Preface
Introduction to Informatica Data Engineering Integration
- Informatica Data Engineering Integration Overview
  - Example
- Data Engineering Integration Component Architecture
- Data Engineering Integration Engines
- Data Engineering Process
- Data Warehouse Optimization Mapping Example
Mappings
- Overview of Mappings
- Mapping Run-time Properties
- PreSQL and PostSQL Queries for JDBC Sources
- Sqoop Mappings in a Hadoop Environment
- Mapping Output Binding
- Rules and Guidelines for Mappings in a Non-native Environment
- Workflows that Run Mappings in a Non-native Environment
- Configuring a Mapping to Run in a Non-native Environment
- Mapping Execution Plans
- Troubleshooting Mappings in a Non-native Environment
- Mappings in the Native Environment
Mapping Optimization
- Mapping Optimization
- Mapping Recommendations and Analysis
- Enabling Data Compression on Temporary Staging Tables
  - Step 1. Enable Data Compression in the Hadoop Connection
  - Step 2. Enable Data Compression on the Hadoop Environment
- Truncating Partitions in a Hive Target
- Hive Warehouse Connector and Hive LLAP
  - Enabling the Hive Warehouse Connector and Hive LLAP
- Scheduling, Queuing, and Node Labeling
- Data Engineering Recovery
- Spark Engine Optimization for Sqoop Pass-Through Mappings
Sources
- Overview of Sources
- PowerExchange Adapter Sources
- Sources on Databricks
- File Sources on Hadoop
- Relational Sources on Hadoop
- Hive Sources on Hadoop
  - PreSQL and PostSQL Commands
  - Rules and Guidelines for Hive Sources on the Blaze Engine
- Sqoop Sources on Hadoop
Targets
- Overview of Targets
- PowerExchange Adapter Targets
- Targets on Databricks
- File Targets on Hadoop
- Message Targets on Hadoop
- Relational Targets on Hadoop
- Hive Targets on Hadoop
- Sqoop Targets on Hadoop
  - Rules and Guidelines for Sqoop Targets
Transformations
- Overview of Transformations
- Address Validator Transformation in a Non-native Environment
  - Address Validator Transformation on the Blaze Engine
  - Address Validator Transformation on the Spark Engine
    - Address Validator Transformation in a Streaming Mapping
- Aggregator Transformation in a Non-native Environment
  - Aggregator Transformation on the Blaze Engine
  - Aggregator Transformation on the Spark Engine
    - Aggregator Transformation in a Streaming Mapping
  - Aggregator Transformation on the Databricks Spark Engine
- Case Converter Transformation in a Non-native Environment
- Classifier Transformation in a Non-native Environment
- Comparison Transformation in a Non-native Environment
- Consolidation Transformation in a Non-native Environment
  - Consolidation Transformation on the Blaze Engine
  - Consolidation Transformation on the Spark Engine
- Data Masking Transformation in a Non-native Environment
  - Data Masking Transformation on the Blaze Engine
  - Data Masking Transformation on the Spark Engine
    - Data Masking Transformation in a Streaming Mapping
- Data Processor Transformation in a Non-native Environment
- Decision Transformation in a Non-native Environment
  - Decision Transformation on the Spark Engine
- Expression Transformation in a Non-native Environment
  - Expression Transformation on the Blaze Engine
  - Expression Transformation on the Spark Engine
    - Expression Transformation in a Streaming Mapping
  - Expression Transformation on the Databricks Spark Engine
- Filter Transformation in a Non-native Environment
  - Filter Transformation on the Blaze Engine
- Hierarchical to Relational Transformation in a Non-native Environment
- Java Transformation in a Non-native Environment
  - Java Transformation on the Blaze Engine
  - Java Transformation on the Spark Engine
    - Java Transformation in a Streaming Mapping
- Joiner Transformation in a Non-native Environment
  - Joiner Transformation on the Blaze Engine
  - Joiner Transformation on the Spark Engine
    - Joiner Transformation in a Streaming Mapping
  - Joiner Transformation on the Databricks Spark Engine
- Key Generator Transformation in a Non-native Environment
- Labeler Transformation in a Non-native Environment
- Lookup Transformation in a Non-native Environment
  - Lookup Transformation on the Blaze Engine
  - Lookup Transformation on the Spark Engine
    - Lookup Transformation in a Streaming Mapping
  - Lookup Transformation on the Databricks Spark Engine
- Macro Transformation in a Non-native Environment
- Match Transformation in a Non-native Environment
  - Match Transformation on the Blaze Engine
  - Match Transformation on the Spark Engine
- Merge Transformation in a Non-native Environment
- Normalizer Transformation in a Non-native Environment
- Parser Transformation in a Non-native Environment
- Rank Transformation in a Non-native Environment
  - Rank Transformation on the Blaze Engine
  - Rank Transformation on the Spark Engine
    - Rank Transformation in a Streaming Mapping
  - Rank Transformation on the Databricks Spark Engine
- Relational to Hierarchical Transformation in a Non-native Environment
- Router Transformation in a Non-native Environment
- Sequence Generator Transformation in a Non-native Environment
  - Sequence Generator Transformation on the Blaze Engine
  - Sequence Generator Transformation on the Spark Engine
- Sorter Transformation in a Non-native Environment
  - Sorter Transformation on the Blaze Engine
  - Sorter Transformation on the Spark Engine
    - Sorter Transformation in a Streaming Mapping
  - Sorter Transformation on the Databricks Spark Engine
- Standardizer Transformation in a Non-native Environment
- Union Transformation in a Non-native Environment
  - Union Transformation in a Streaming Mapping
- Update Strategy Transformation in a Non-native Environment
  - Update Strategy Transformation on the Blaze Engine
  - Update Strategy Transformation on the Spark Engine
  - Update Strategy Transformation on the Databricks Spark Engine
- Weighted Average Transformation in a Non-native Environment
Python Transformation
- Python Transformation Overview
- Python Transformation Ports
- Python Transformation Advanced Properties
- Python Transformation Components
  - Resource File
  - Python Code
- Rules and Guidelines for the Python Transformation
  - Python Transformation in a Streaming Mapping
- Creating a Python Transformation
  - Creating a Reusable Python Transformation
  - Creating a Non-Reusable Python Transformation
- Example: Add an ID Column to Nonpartitioned Data
- Example: Use Partitions to Find the Highest Salary
- Use Case: Operationalize a Pre-Trained Model
Data Preview
- Overview of Data Preview
  - Connections and Cluster Distributions that Support Data Preview
- Data Preview Process
- Previewing Data
- Data Preview Interface for Hierarchical Data
- Data Preview on Transformations
- Data Preview Logs
- Rules and Guidelines for Data Preview on the Spark Engine
Cluster Workflows
- Cluster Workflows Overview
  - Cluster Workflows Platform Support
- Cluster Workflow Components
- Cluster Workflows Process
- Create Cluster Task Properties
- Mapping Task Properties
- Add a Delete Cluster Task
- Deploy and Run the Workflow
  - Monitoring Azure HDInsight Cluster Workflow Jobs
Profiles
- Profiles Overview
- Native Environment
- Hadoop Environment
  - Column Profiles for Sqoop Data Sources
- Sampling Options
- Creating a Single Data Object Profile in Informatica Developer
- Creating an Enterprise Discovery Profile in Informatica Developer
- Creating a Column Profile in Informatica Analyst
- Creating an Enterprise Discovery Profile in Informatica Analyst
- Creating a Scorecard in Informatica Analyst
- Monitoring a Profile
- Profiling Functionality Support
- Troubleshooting
Monitoring
- Overview of Monitoring
- Hadoop Environment Logs
- Blaze Engine Monitoring
- Spark Engine Monitoring
Hierarchical Data Processing
- Overview of Hierarchical Data Processing
- How to Develop a Mapping to Process Hierarchical Data
- Complex Data Types
- Complex Ports
- Complex Data Type Definitions
- Type Configuration
- Complex Operators
  - Extracting an Array Element Using a Subscript Operator
  - Extracting a Struct Element Using the Dot Operator
- Complex Functions
- Rules and Guidelines for Processing Hierarchical Data on the Spark Engine
- Midstream Parsing of Hierarchical Data
Hierarchical Data Processing Configuration
- Hierarchical Data Conversion
- Convert Relational or Hierarchical Data to Struct Data
  - Creating a Struct Port
- Convert Relational or Hierarchical Data to Nested Struct Data
  - Creating A Nested Complex Port
- Extract Elements from Hierarchical Data
  - Extracting Elements from a Complex Port
- Flatten Hierarchical Data
  - Flattening a Complex Port
Hierarchical Data Processing with Schema Changes
- Overview of Hierarchical Data Processing with Schema Changes
- How to Develop a Dynamic Mapping to Process Schema Changes in Hierarchical Data
- Dynamic Complex Ports
  - Dynamic Ports and Dynamic Complex Ports
  - Dynamic Complex Ports in Transformations
- Input Rules for a Dynamic Complex Port
- Port Selectors for Dynamic Complex Ports
- Dynamic Expressions
  - Example - Dynamic Expression to Construct a Dynamic Struct
- Complex Operators
- Complex Functions
- Rules and Guidelines for Dynamic Complex Ports
- Optimized Mappings
Intelligent Structure Models
- Overview of Intelligent Structure Models
- Intelligent Structure Discovery Process
- Use Case
- Using an Intelligent Structure Model in a Mapping
- Rules and Guidelines for Intelligent Structure Models
- How to Develop and Run a Mapping to Process Data with an Intelligent Structure Model
  - Mapping Example
- Create an Intelligent Structure Model in Cloud Data Integration
Blockchain
- Blockchain Overview
  - Blockchain Process
- Blockchain Data Objects
- Blockchain Data Object Operations
- Use Case: Using a Blockchain Source to Improve Services in a Vehicle Lifecycle
  - Mapping Overview
Stateful Computing
- Overview of Stateful Computing
- Windowing Configuration
- Window Functions
- Windowing Examples
Appendix A: Connections Reference
- Connections Overview
- Cloud Provisioning Configuration
  - AWS Cloud Provisioning Configuration Properties
  - Azure Cloud Provisioning Configuration Properties
  - Databricks Cloud Provisioning Configuration Properties
- Amazon Redshift Connection Properties
- Amazon S3 Connection Properties
- Blockchain Connection Properties
- Cassandra Connection Properties
- Databricks Connection Properties
- Google Analytics Connection Properties
- Google BigQuery Connection Properties
- Google Cloud Spanner Connection Properties
- Google Cloud Storage Connection Properties
- Hadoop Connection Properties
  - Hadoop Cluster Properties
  - Common Properties
  - Reject Directory Properties
  - Blaze Configuration
  - Spark Configuration
- HDFS Connection Properties
- HBase Connection Properties
- HBase Connection Properties for MapR-DB
- Hive Connection Properties
- JDBC Connection Properties
  - JDBC Connection String
  - Sqoop Connection-Level Arguments
  - Delta Lake JDBC Connection Properties
- JDBC V2 Connection Properties
- Kafka Connection Properties
- Microsoft Azure Blob Storage Connection Properties
- Microsoft Azure Cosmos DB SQL API Connection Properties
- Microsoft Azure Data Lake Storage Gen1 Connection Properties
- Microsoft Azure Data Lake Storage Gen2 Connection Properties
- Microsoft Azure SQL Data Warehouse Connection Properties
- Snowflake Connection Properties
- Creating a Connection to Access Sources or Targets
- Creating a Hadoop Connection
- Configuring Hadoop Connection Properties
  - Cluster Environment Variables
  - Cluster Library Path
  - Common Advanced Properties
  - Blaze Engine Advanced Properties
  - Spark Advanced Properties
Appendix B: Data Type Reference
- Data Type Reference Overview
- Transformation Data Type Support in a Non-native Environment
- Complex File and Transformation Data Types
- Flat File and Transformation Data Types
- Hive Data Types and Transformation Data Types
  - Hive Complex Data Types
- Sqoop Data Types
Appendix C: Function Reference
- Function Support in a Non-native Environment
- Function and Data Type Processing

User Guide

10.4.1
- 10.5.9
- 10.5.8
- 10.5.7
- 10.5.6
- 10.5.5
- 10.5.4
- 10.5.3
- 10.5.2
- 10.5.10
- 10.5.1
- 10.5
- 10.4.0
- 10.2.2 HotFix 1
- 10.2.2 Service Pack 1
- 10.2.2
- 10.2.1

Back Next

Spark Advanced Properties

Spark advanced properties are a list of advanced or custom properties that are unique to the Spark engine. Each property contains a name and a value. You can add or edit advanced properties. Each property contains a name and a value. You can add or edit advanced properties.

Configure the following properties in the

Advanced Properties

of the Spark configuration section:

To edit the property in the text box, use the following format with &: to separate each name-value pair:

<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]

infasjs.env.spark.context-settings.passthrough.spark.dynamicAllocation.executorIdleTimeout: Maximum time that an Spark Jobserver executor node can be idle before it is removed. Increase the value to assist in debugging data preview jobs that use the Spark engine.
You can specify the time in seconds, minutes, or hours using the suffix s
, m
, or h
, respectively. If you do not specify a time unit, the property uses milliseconds.

If you disable dynamic resource allocation, this property is not used.

Default is 120s.

infasjs.env.spark.jobserver.max-jobs-per-context: Maximum number of Spark jobs that can run simultaneously on a Spark context. If you increase the value of this property, you might need to allocate more resources by increasing spark.executor.cores and spark.executor.memory.
Default is 10.

infasjs.env.spark.jobserver.sparkJobTimeoutInMinutes: Maximum time in minutes that a Spark job can run on a Spark context before the Spark Jobserver cancels the job. Increase the value to assist in debugging data preview jobs that use the Spark engine.
Default is 15.

infaspark.class.log.level.map: Logging level for specific classes in the Spark driver or executor. When you configure this property, it overrides the tracing level you set for the mapping.
Set the value of this property to a JSON string in the following format: {"<fully qualified class name":"<log level>"}

Join multiple class logging level statements with a comma. You can use the following logging levels: FATAL, WARN, INFO, DEBUG, ALL.

For example, set to:
infaspark.class.log.level.map={"org.apache.spark.deploy.yarn.ApplicationMaster":"TRACE","org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider":"DEBUG"}

infaspark.driver.cluster.mode.extraJavaOptions: List of extra Java options for the Spark driver that runs inside the cluster. Required for streaming mappings to read from or write to a Kafka cluster that uses Kerberos authentication.

For example, set to:

infaspark.driver.cluster.mode.extraJavaOptions= -Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -Djavax.security.auth.useSubjectCredsOnly=true -Djava.security.krb5.conf=/<path to keytab file>/krb5.conf -Djava.security.auth.login.config=<path to jaas config>/kafka_client_jaas.config

To configure the property for a specific user, you can include the following lines of code:

infaspark.driver.cluster.mode.extraJavaOptions = -Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500 -Djava.security.krb5.conf=/etc/krb5.conf

infaspark.driver.log.level: Logging level for the Spark driver logs. When you configure this property, it overrides the tracing level you set for the mapping.
Set the value to one of the following levels: FATAL, WARN, INFO, DEBUG, ALL.

infaspark.executor.extraJavaOptions: List of extra Java options for the Spark executor. Required for streaming mappings to read from or write to a Kafka cluster that uses Kerberos authentication.

For example, set to:

infaspark.executor.extraJavaOptions= -Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -Djavax.security.auth.useSubjectCredsOnly=true -Djava.security.krb5.conf=/<path to krb5.conf file>/krb5.conf -Djava.security.auth.login.config=/<path to jAAS config>/kafka_client_jaas.config

To configure the property for a specific user, you can include the following lines of code:

infaspark.executor.extraJavaOptions = -Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500 -Djava.security.krb5.conf=/etc/krb5.conf

infaspark.executor.log.level: Logging level for the Spark executor logs. When you configure this property, it overrides the tracing level you set for the mapping.
Set the value to one of the following levels: FATAL, WARN, INFO, DEBUG, ALL.

infaspark.flatfile.writer.nullValue: When the Databricks Spark engine writes to a target, it converts null values to empty strings (" "). For example, 12, AB,"",23p09udj.; The Databricks Spark engine can write the empty strings to string columns, but when it tries to write an empty string to a non-string column, the mapping fails with a type mismatch.
To allow the Databricks Spark engine to convert the empty strings back to null values and write to the target, configure the property in the Databricks Spark connection.

Set to: TRUE

infaspark.json.parser.mode: Specifies the parser how to handle corrupt JSON records. You can set the value to one of the following modes:

DROPMALFORMED. The parser ignores all corrupted records. Default mode.
PERMISSIVE. The parser accepts non-standard fields as nulls in corrupted records.
FAILFAST. The parser generates an exception when it encounters a corrupted record and the Spark application goes down.

infaspark.json.parser.multiLine: Specifies whether the parser can read a multiline record in a JSON file. You can set the value to true or false. Default is false. Applies only to non-native distributions that use Spark version 2.2.x and above.

infaspark.pythontx.exec: Required to run a Python transformation on the Spark engine for Data Engineering Integration. The location of the Python executable binary on the worker nodes in the Hadoop cluster.

For example, set to:
infaspark.pythontx.exec=/usr/bin/python3.4

If you use the installation of Python on the Data Integration Service machine, set the value to the Python executable binary in the Informatica installation directory on the Data Integration Service machine.

For example, set to:
infaspark.pythontx.exec=INFA_HOME/services/shared/spark/python/lib/python3.4

infaspark.pythontx.executorEnv.LD_PRELOAD: Required to run a Python transformation on the Spark engine for Data Engineering Streaming. The location of the Python shared library in the Python installation folder on the Data Integration Service machine.

For example, set to:

infaspark.pythontx.executorEnv.LD_PRELOAD= INFA_HOME/services/shared/spark/python/lib/libpython3.6m.so

infaspark.pythontx.executorEnv.PYTHONHOME: Required to run a Python transformation on the Spark engine for Data Engineering Integration and Data Engineering Streaming. The location of the Python installation directory on the worker nodes in the Hadoop cluster.

For example, set to:
infaspark.pythontx.executorEnv.PYTHONHOME=/usr

If you use the installation of Python on the Data Integration Service machine, use the location of the Python installation directory on the Data Integration Service machine.

For example, set to:
infaspark.pythontx.executorEnv.PYTHONHOME= INFA_HOME/services/shared/spark/python/

infaspark.pythontx.submit.lib.JEP_HOME: Required to run a Python transformation on the Spark engine for Data Engineering Streaming. The location of the Jep package in the Python installation folder on the Data Integration Service machine.

For example, set to:
infaspark.pythontx.submit.lib.JEP_HOME= INFA_HOME/services/shared/spark/python/lib/python3.6/site-packages/jep/

infaspark.useHiveWarehouseAPI: Enables the Hive Warehouse Connector. Set to TRUE.
For example,
infaspark.useHiveWarehouseAPI=true
.

spark.authenticate: Enables authentication for the Spark service on Hadoop. Required for Spark encryption.

Set to TRUE.

For example,
spark.authenticate=TRUE

spark.authenticate.enableSaslEncryption: Enables encrypted communication when SASL authentication is enabled. Required if Spark encryption uses SASL authentication.

Set to TRUE.

For example,
spark.authenticate.enableSaslEncryption=TRUE

spark.datasource.hive.warehouse.load.staging.dir: Directory for the temporary HDFS files used for batch writes to Hive. Required when you enable the Hive Warehouse Connector.
For example, set to
/tmp

spark.datasource.hive.warehouse.metastoreUri: URI for the Hive metastore. Required when you enable the Hive Warehouse Connector. Use the value for hive.metastore.uris from the hive_site_xml cluster configuration properties.
For example, set the value to
thrift://mycluster-1.com:9083
.

spark.driver.cores: Indicates the number of cores that each driver uses uses to run jobs on the Spark engine.
Set to:
spark.driver.cores=1

spark.driver.extraJavaOptions: List of extra Java options for the Spark driver.
When you write date/time data within a complex data type to a Hive target using a Hortonworks HDP 3.1 cluster, append the following value to the property:
-Duser.timezone=UTC

spark.driver.memory: Indicates the amount of driver process memory that the Spark engine uses to run jobs.
Recommended value: Allocate at least 256 MB for every data source.

Set to:
spark.driver.memory=3G

spark.executor.cores: Indicates the number of cores that each executor process uses to run tasklets on the Spark engine.
Set to:
spark.executor.cores=1

spark.executor.extraJavaOptions: List of extra Java options for the Spark executor.
When you write date/time data within a complex data type to a Hive target using a Hortonworks HDP 3.1 cluster, append the following value to the property:
-Duser.timezone=UTC

spark.executor.instances: Indicates the number of instances that each executor process uses to run tasklets on the Spark engine.
Set to:
spark.executor.instances=1

spark.executor.memory: Indicates the amount of memory that each executor process uses to run tasklets on the Spark engine.
Set to:
spark.executor.memory=3G

spark.hadoop.hive.llap.daemon.service.hosts: Application name for the LLAP service. Required when you enable the Hive Warehouse Connector. Use the value for hive.llap.daemon.service.hosts from the hive_site_xml cluster configuration properties.

spark.hadoop.hive.zookeeper.quorum: Zookeeper hosts used by Hive LLAP. Required when you enable the Hive Warehouse Connector. Use the value for hive.zookeeper.quorum from the hive_site_xml cluster configuration properties.

spark.hadoop.validateOutputSpecs: Validates if the HBase table exists. Required for streaming mappings to write to a HBase target in an Amazon EMR cluster. Set the value to false.

spark.scheduler.maxRegisteredResourcesWaitingTime: The number of milliseconds to wait for resources to register before scheduling a task. Default is 30000. Decrease the value to reduce delays before starting the Spark job execution. Required to improve performance for mappings on the Spark engine.

Set to 15000.

For example,
spark.scheduler.maxRegisteredResourcesWaitingTime=15000

spark.scheduler.minRegisteredResourcesRatio: The minimum ratio of registered resources to acquire before task scheduling begins. Default is 0.8. Decrease the value to reduce any delay before starting the Spark job execution. Required to improve performance for mappings on the Spark engine.

Set to: 0.5

For example,
spark.scheduler.minRegisteredResourcesRatio=0.5

spark.shuffle.encryption.enabled: Enables encrypted communication when authentication is enabled. Required for Spark encryption.

Set to TRUE.

For example,
spark.shuffle.encryption.enabled=TRUE

spark.sql.hive.hiveserver2.jdbc.url: URL for HiveServer2 Interactive. Required to use the Hive Warehouse Connector. Use the value in Ambari for HiveServer2 JDBC URL.

Rename Saved Search

Table of Contents

User Guide

User Guide

Spark Advanced Properties

Spark Advanced Properties