Table of Contents

Search

  1. Preface
  2. Introduction to Informatica Data Engineering Integration
  3. Mappings
  4. Mapping Optimization
  5. Sources
  6. Targets
  7. Transformations
  8. Python Transformation
  9. Data Preview
  10. Cluster Workflows
  11. Profiles
  12. Monitoring
  13. Hierarchical Data Processing
  14. Hierarchical Data Processing Configuration
  15. Hierarchical Data Processing with Schema Changes
  16. Intelligent Structure Models
  17. Blockchain
  18. Stateful Computing
  19. Appendix A: Connections Reference
  20. Appendix B: Data Type Reference
  21. Appendix C: Function Reference

Databricks Connection Properties

Databricks Connection Properties

Use the Databricks connection to run mappings on a Databricks cluster.
A Databricks connection is a cluster type connection. You can create and manage a Databricks connection in the Administrator tool or the Developer tool. You can use infacmd to create a Databricks connection. Configure properties in the Databricks connection to enable communication between the Data Integration Service and the Databricks cluster.
The following table describes the general connection properties for the Databricks connection:
Property
Description
Name
The name of the connection. The name is not case sensitive and must be unique within the domain. You can change this property after you create the connection. The name cannot exceed 128 characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID
String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It must be 255 characters or less and must be unique in the domain. You cannot change this property after you create the connection. Default value is the connection name.
Description
Optional. The description of the connection. The description cannot exceed 4,000 characters.
Connection Type
Choose Databricks.
Cluster Configuration
Name of the cluster configuration associated with the Databricks environment.
Required if you do not configure the cloud provisioning configuration.
Cloud Provisioning Configuration
Name of the cloud provisioning configuration associated with a Databricks cloud platform.
Required if you do not configure the cluster configuration.
Staging Directory
The directory where the Databricks Spark engine stages run-time files.
If you specify a directory that does not exist, the Data Integration Service creates it at run time.
If you do not provide a directory path, the run-time staging files are written to
/<cluster staging directory>/DATABRICKS
.
Advanced Properties
List of advanced properties that are unique to the Databricks environment.
You can configure run-time properties for the Databricks environment in the Data Integration Service and in the Databricks connection. You can override a property configured at a high level by setting the value at a lower level. For example, if you configure a property in the Data Integration Service custom properties, you can override it in the Databricks connection. The Data Integration Service processes property overrides based on the following priorities:
  1. Databricks connection advanced properties
  2. Data Integration Service custom properties
Informatica does not recommend changing these property values before you consult with third-party documentation, Informatica documentation, or Informatica Global Customer Support. If you change a value without knowledge of the property, you might experience performance degradation or other unexpected results.

Advanced Properties

Configure the following properties in the
Advanced Properties
of the Databricks configuration section:
infaspark.json.parser.mode
Specifies the parser how to handle corrupt JSON records. You can set the value to one of the following modes:
  • DROPMALFORMED. The parser ignores all corrupted records. Default mode.
  • PERMISSIVE. The parser accepts non-standard fields as nulls in corrupted records.
  • FAILFAST. The parser generates an exception when it encounters a corrupted record and the Spark application goes down.
infaspark.json.parser.multiLine
Specifies whether the parser can read a multiline record in a JSON file. You can set the value to true or false. Default is false. Applies only to non-native distributions that use Spark version 2.2.x and above.
infaspark.flatfile.writer.nullValue
When the Databricks Spark engine writes to a target, it converts null values to empty strings (" "). For example, 12, AB,"",23p09udj.
The Databricks Spark engine can write the empty strings to string columns, but when it tries to write an empty string to a non-string column, the mapping fails with a type mismatch.
To allow the Databricks Spark engine to convert the empty strings back to null values and write to the target, configure the property in the Databricks Spark connection.
Set to: TRUE
infaspark.pythontx.exec
Required to run a Python transformation on the Databricks Spark engine. Set to the location of the Python executable binary on the worker nodes in the Databricks cluster.
When you provision the cluster at run time, set this property in the Databricks cloud provisioning configuration. Otherwise, set on the Databricks connection.
For example, set to:
infaspark.pythontx.exec=/databricks/python3/bin/python3
infaspark.pythontx.executorEnv.PYTHONHOME
Required to run a Python transformation on the Databricks Spark engine. Set to the location of the Python installation directory on the worker nodes in the Databricks cluster.
When you provision the cluster at run time, set this property in the Databricks cloud provisioning configuration. Otherwise, set on the Databricks connection.
For example, set to:
infaspark.pythontx.executorEnv.PYTHONHOME=/databricks/python3

0 COMMENTS

We’d like to hear from you!