Preface
Introduction to Informatica Data Engineering Integration
- Informatica Data Engineering Integration Overview
  - Example
- Data Engineering Integration Component Architecture
- Data Engineering Integration Engines
- Data Engineering Process
- Data Warehouse Optimization Mapping Example
Mappings
- Overview of Mappings
- Mapping Run-time Properties
- PreSQL and PostSQL Queries for JDBC Sources
- Sqoop Mappings in a Hadoop Environment
- Mapping Output Binding
- Rules and Guidelines for Mappings in a Non-native Environment
- Workflows that Run Mappings in a Non-native Environment
- Configuring a Mapping to Run in a Non-native Environment
  - Configure Mappings to Run on Dataproc
- Audits
- Mapping Execution Plans
- Troubleshooting Mappings in a Non-native Environment
- Mappings in the Native Environment
Mapping Optimization
- Mapping Optimization
- Mapping Recommendations and Analysis
- Enabling Data Compression on Temporary Staging Tables
  - Step 1. Enable Data Compression in the Hadoop Connection
  - Step 2. Enable Data Compression on the Hadoop Environment
- Truncating Partitions in a Hive Target
- Hive Warehouse Connector and Hive LLAP
  - Enabling the Hive Warehouse Connector and Hive LLAP
- Scheduling, Queuing, and Node Labeling
- Data Engineering Recovery
- Spark Engine Optimization for Sqoop Pass-Through Mappings
Sources
- Overview of Sources
- PowerExchange Adapter Sources
- Sources on Databricks
- File Sources on Hadoop
- Relational Sources on Hadoop
- Hive Sources on Hadoop
  - PreSQL and PostSQL Commands
  - Rules and Guidelines for Hive Sources on the Blaze Engine
- Sqoop Sources on Hadoop
Targets
- Overview of Targets
- PowerExchange Adapter Targets
- Targets on Databricks
- File Targets on Hadoop
- Message Targets on Hadoop
- Relational Targets on Hadoop
- Hive Targets on Hadoop
- Sqoop Targets on Hadoop
  - Rules and Guidelines for Sqoop Targets
Transformations
- Overview of Transformations
- Address Validator Transformation in a Non-native Environment
  - Address Validator Transformation on the Blaze Engine
  - Address Validator Transformation on the Spark Engine
    - Address Validator Transformation in a Streaming Mapping
  - Address Validator Transformation on the Databricks Spark Engine
- Aggregator Transformation in a Non-native Environment
  - Aggregator Transformation on the Blaze Engine
  - Aggregator Transformation on the Spark Engine
    - Aggregator Transformation in a Streaming Mapping
  - Aggregator Transformation on the Databricks Spark Engine
- Case Converter Transformation in a Non-native Environment
- Classifier Transformation in a Non-native Environment
- Comparison Transformation in a Non-native Environment
- Consolidation Transformation in a Non-native Environment
  - Consolidation Transformation on the Blaze Engine
  - Consolidation Transformation on the Spark Engine
  - Consolidation Transformation on the Databricks Spark Engine
- Data Masking Transformation in a Non-native Environment
  - Data Masking Transformation on the Blaze Engine
  - Data Masking Transformation on the Spark Engine
    - Data Masking Transformation in a Streaming Mapping
- Data Processor Transformation in a Non-native Environment
- Decision Transformation in a Non-native Environment
  - Decision Transformation on the Spark Engine
  - Decision Transformation on the Databricks Spark Engine
- Expression Transformation in a Non-native Environment
  - Expression Transformation on the Blaze Engine
  - Expression Transformation on the Spark Engine
    - Expression Transformation in a Streaming Mapping
  - Expression Transformation on the Databricks Spark Engine
- Filter Transformation in a Non-native Environment
  - Filter Transformation on the Blaze Engine
- Hierarchical to Relational Transformation in a Non-native Environment
- Java Transformation in a Non-native Environment
  - Java Transformation on the Blaze Engine
  - Java Transformation on the Spark Engine
    - Java Transformation in a Streaming Mapping
- Joiner Transformation in a Non-native Environment
  - Joiner Transformation on the Blaze Engine
  - Joiner Transformation on the Spark Engine
    - Joiner Transformation in a Streaming Mapping
  - Joiner Transformation on the Databricks Spark Engine
- Key Generator Transformation in a Non-native Environment
  - Key Generator Transformation on the Blaze Engine
  - Key Generator Transformation on the Spark Engine
  - Key Generator Transformation on the Databricks Spark Engine
- Labeler Transformation in a Non-native Environment
- Lookup Transformation in a Non-native Environment
  - Lookup Transformation on the Blaze Engine
  - Lookup Transformation on the Spark Engine
    - Lookup Transformation in a Streaming Mapping
  - Lookup Transformation on the Databricks Spark Engine
- Macro Transformation in a Non-native Environment
- Match Transformation in a Non-native Environment
  - Match Transformation on the Blaze Engine
  - Match Transformation on the Spark Engine
  - Match Transformation on the Databricks Spark Engine
- Merge Transformation in a Non-native Environment
- Normalizer Transformation in a Non-native Environment
- Parser Transformation in a Non-native Environment
- Rank Transformation in a Non-native Environment
  - Rank Transformation on the Blaze Engine
  - Rank Transformation on the Spark Engine
    - Rank Transformation in a Streaming Mapping
  - Rank Transformation on the Databricks Spark Engine
- Relational to Hierarchical Transformation in a Non-native Environment
- Router Transformation in a Non-native Environment
- Rule Specification Transformation in a Non-native Environment
- Sequence Generator Transformation in a Non-native Environment
  - Sequence Generator Transformation on the Blaze Engine
  - Sequence Generator Transformation on the Spark Engine
  - Sequence Generator Transformation on the Databricks Spark Engine
- Sorter Transformation in a Non-native Environment
  - Sorter Transformation on the Blaze Engine
  - Sorter Transformation on the Spark Engine
    - Sorter Transformation in a Streaming Mapping
  - Sorter Transformation on the Databricks Spark Engine
- Standardizer Transformation in a Non-native Environment
- Union Transformation in a Non-native Environment
  - Union Transformation in a Streaming Mapping
- Update Strategy Transformation in a Non-native Environment
  - Update Strategy Transformation on the Blaze Engine
  - Update Strategy Transformation on the Spark Engine
  - Update Strategy Transformation on the Databricks Spark Engine
- Weighted Average Transformation in a Non-native Environment
Python Transformation
- Python Transformation Overview
- Python Transformation Ports
- Python Transformation Advanced Properties
- Python Transformation Components
  - Resource File
  - Python Code
- Rules and Guidelines for the Python Transformation
  - Python Transformation in a Streaming Mapping
- Creating a Python Transformation
  - Creating a Reusable Python Transformation
  - Creating a Non-Reusable Python Transformation
- Example: Add an ID Column to Nonpartitioned Data
- Example: Use Partitions to Find the Highest Salary
- Use Case: Operationalize a Pre-Trained Model
Data Preview
- Overview of Data Preview
  - Connections and Cluster Distributions that Support Data Preview
- Data Preview Process
- Previewing Data
- Data Preview Interface for Hierarchical Data
- Data Preview on Transformations
- Data Preview Logs
- Rules and Guidelines for Data Preview on the Spark Engine
Cluster Workflows
- Cluster Workflows Overview
  - Cluster Workflows Platform Support
- Cluster Workflow Components
- Configure Databricks Clusters Using Warm Pools
- Cluster Workflows Process
- Create Cluster Task Properties
- Mapping Task Properties
- Add a Delete Cluster Task
- Deploy and Run the Workflow
  - Monitoring Azure HDInsight Cluster Workflow Jobs
Profiles
- Profiles Overview
- Native Environment
- Hadoop Environment
  - Column Profiles for Sqoop Data Sources
- Sampling Options
- Creating a Single Data Object Profile in Informatica Developer
- Creating an Enterprise Discovery Profile in Informatica Developer
- Creating a Column Profile in Informatica Analyst
- Creating an Enterprise Discovery Profile in Informatica Analyst
- Creating a Scorecard in Informatica Analyst
- Monitoring a Profile
- Profiling Functionality Support
- Troubleshooting
Monitoring
- Overview of Monitoring
- Hadoop Environment Logs
- Blaze Engine Monitoring
- Spark Engine Monitoring
Hierarchical Data Processing
- Overview of Hierarchical Data Processing
- How to Develop a Mapping to Process Hierarchical Data
- Complex Data Types
- Complex Ports
- Complex Data Type Definitions
- Type Configuration
- Complex Operators
  - Extracting an Array Element Using a Subscript Operator
  - Extracting a Struct Element Using the Dot Operator
- Complex Functions
- Rules and Guidelines for Processing Hierarchical Data on the Spark Engine
- Midstream Parsing of Hierarchical Data
Hierarchical Data Processing Configuration
- Hierarchical Data Conversion
- Convert Relational or Hierarchical Data to Struct Data
  - Creating a Struct Port
- Convert Relational or Hierarchical Data to Nested Struct Data
  - Creating A Nested Complex Port
- Extract Elements from Hierarchical Data
  - Extracting Elements from a Complex Port
- Flatten Hierarchical Data
  - Flattening a Complex Port
Hierarchical Data Processing with Schema Changes
- Overview of Hierarchical Data Processing with Schema Changes
- How to Develop a Dynamic Mapping to Process Schema Changes in Hierarchical Data
- Flatten Hierarchical Data with Schema Changes
  - Flatten a Dynamic Struct
- Dynamic Complex Ports
  - Dynamic Ports and Dynamic Complex Ports
  - Dynamic Complex Ports in Transformations
- Input Rules for a Dynamic Complex Port
- Port Selectors for Dynamic Complex Ports
- Dynamic Expressions
  - Example - Dynamic Expression to Construct a Dynamic Struct
- Complex Operators
- Complex Functions
- Rules and Guidelines for Dynamic Complex Ports
- Optimized Mappings
Intelligent Structure Models
- Overview of Intelligent Structure Models
- Intelligent Structure Discovery Process
- Use Case
- Using an Intelligent Structure Model in a Mapping
- Rules and Guidelines for Intelligent Structure Models
- How to Create a Mapping with an Intelligent Structure Model
  - Mapping Example
- Create an Intelligent Structure Model in Cloud Data Integration
Blockchain
- Blockchain Overview
  - Blockchain Process
- Blockchain Data Objects
- Blockchain Data Object Operations
- Use Case: Using a Blockchain Source to Improve Services in a Vehicle Lifecycle
  - Mapping Overview
Stateful Computing
- Overview of Stateful Computing
- Windowing Configuration
- Window Functions
- Windowing Examples
Appendix A: Connections Reference
- Connections Overview
- Cloud Provisioning Configuration
  - AWS Cloud Provisioning Configuration Properties
  - Azure Cloud Provisioning Configuration Properties
  - Databricks Cloud Provisioning Configuration Properties
- Amazon Redshift Connection Properties
- Amazon S3 Connection Properties
- Blockchain Connection Properties
- Cassandra Connection Properties
- Confluent Kafka Connection
  - General Properties
  - Confluent Kafka Broker Properties
  - SSL Properties
  - Creating a Confluent Kafka Connection Using infacmd
- Databricks Connection Properties
- Google Analytics Connection Properties
- Google BigQuery Connection Properties
- Google Cloud Spanner Connection Properties
- Google Cloud Storage Connection Properties
- Google PubSub Connection Properties
- Hadoop Connection Properties
  - Hadoop Cluster Properties
  - Common Properties
  - Reject Directory Properties
  - Blaze Configuration
  - Spark Configuration
- HDFS Connection Properties
- HBase Connection Properties
- HBase Connection Properties for MapR-DB
- Hive Connection Properties
- JDBC Connection Properties
  - JDBC Connection String
  - Sqoop Connection-Level Arguments
  - Delta Lake JDBC Connection Properties
- JDBC V2 Connection Properties
- Kafka Connection Properties
  - General Properties
  - Kafka Broker Properties
  - SSL Properties
  - Creating a Kafka Connection Using infacmd
- Kudu Connection Properties
- Microsoft Azure Blob Storage Connection Properties
- Microsoft Azure Cosmos DB SQL API Connection Properties
- Microsoft Azure Data Lake Storage Gen1 Connection Properties
- Microsoft Azure Data Lake Storage Gen2 Connection Properties
- Microsoft Azure SQL Data Warehouse Connection Properties
- Snowflake Connection Properties
- Creating a Connection to Access Sources or Targets
- Creating a Hadoop Connection
- Configuring Hadoop Connection Properties
  - Cluster Environment Variables
  - Cluster Library Path
  - Common Advanced Properties
  - Blaze Engine Advanced Properties
  - Spark Advanced Properties
Appendix B: Data Type Reference
- Data Type Reference Overview
- Transformation Data Type Support in a Non-native Environment
- Complex File and Transformation Data Types
- Flat File and Transformation Data Types
- Hive Data Types and Transformation Data Types
  - Hive Complex Data Types
- Sqoop Data Types
Appendix C: Function Reference
- Function Support in a Non-native Environment
- Function and Data Type Processing

User Guide

10.5
- 10.5.9
- 10.5.8
- 10.5.7
- 10.5.6
- 10.5.5
- 10.5.4
- 10.5.3
- 10.5.2
- 10.5.1
- 10.4.1
- 10.4.0
- 10.2.2 HotFix 1
- 10.2.2 Service Pack 1
- 10.2.2
- 10.2.1

Back Next

Rules and Guidelines for Mappings on the Spark Engine

Consider the following run-time differences on the Spark engine:

Mapping Validation

Consider the following rules and guidelines for mapping validation:

Mapping validation fails if you configure SYSTIMESTAMP with a variable value, such as a port name. The function can either include no argument or the precision to which you want to retrieve the timestamp value.

Mapping validation fails if an output port contains a Timestamp with Time Zone data type.

Optimization

Set the optimizer level to none or minimal if a mapping validates but fails to run. If you set the optimizer level to use cost-based or semi-join optimization methods, the Data Integration Service ignores this at run-time and uses the default.

The run-time engine does not honor the early projection optimization method in all cases. If the Data Integration Service removes the links between unused ports, the run-time engine might reconnect the ports.

When you use the auto optimizer level, the early selection optimization method is enabled if the mapping contains any data source that supports pushing filters to the source on Spark or Databricks Spark engines. For more information about optimizer levels, see the

Informatica Developer Mapping Guide

When the Spark engine runs a mapping, it processes jobs on the cluster using HiveServer2 in the following cases:

The mapping writes to a target that is a Hive table bucketed on fields of type char or varchar.

The mapping reads from or writes to Hive transaction-enabled tables.

The mapping reads from or writes to Hive tables where column-level security is enabled.

The mapping writes to a Hive target and is configured to create or replace the table at run time.

High Precision

Consider the following rules and guidelines for mappings that run in high precision or low precision mode:

When you use the TO_DECIMAL or TO_DECIMAL38 function in a mapping that runs in high precision mode, you must specify a scale argument. If the mapping runs in low precision mode, the Spark engine ignores the scale argument and returns a double.

If you enable high-precision in a streaming mapping, the Spark engine runs the mapping in low-precision mode.

If the mapping contains a complex port with an element of a decimal data type, the Spark engine runs the mapping in low-precision mode.

Overflow Values

The Spark engine and the Data Integration Service process overflow values differently. The Spark engine processing rules might differ from the rules that the Data Integration Service uses. As a result, mapping results can vary between the native and Hadoop environment when the Spark engine processes an overflow.

Null and Invalid Values and Rejected Rows

Consider the following guidelines when mappings pass null or invalid values for rows:

The Spark run-time engine drops rejected rows even if you configure the mapping to forward rejected rows. The rejected rows are not written to the session log file.

If an expression results in numerical errors, such as division by zero or SQRT of a negative number, the Spark engine returns null. In the native environment, the same expression results in a row error.

The Hadoop environment treats "/n" values as null values. If an aggregate function contains empty or NULL values, the Hadoop environment includes these values while performing an aggregate calculation.

The Spark engine writes null values for rows when invalid values are passed in the following situations:

The terms
argument in PV, FV, PMT, and RATE finance functions passes a 0 value. The value of terms
must be an integer greater than 0.

The month
argument in the MAKE_DATE_TIME function passes an invalid value. The value of month
must be from 1 to 12.

In the native environment, the Data Integration Service rejects the row and does not write it to the target.

If data overflow occurs, the Spark engine returns null. For example, if you use the expression

TO_DECIMAL(12.34, 2)

in a port that has a precision of 3 and a scale of 2, the return value is null. The null value will be propagated through the mapping. The mapping might overwrite it using a default value, detect it using the function IS_NULL, and write the null value to the target.

Data Conversions

The Spark engine and the Data Integration Service process data type conversions differently. As a result, mapping results can vary between the native and Hadoop environment when the Spark engine performs a data type conversion. Consider the following processing variations for Spark:

The results of arithmetic operations on floating point types, such as Decimal, can vary between the native environment and a Hadoop environment. The difference between the results can increase across multiple operations.

When the number of fractional digits in a double or decimal value exceeds the scale that is configured in a decimal port, the Spark engine trims trailing digits, rounding the value if necessary.

If you use Hive 2.0 or higher, the Spark engine guarantees scale values. For example, when the Spark engine processes the decimal

1.1234567

with scale 9 using Hive 2.0, the output is

1.123456700

The Spark engine cannot process dates to the nanosecond. It can return a precision for date/time data up to the microsecond.

Scale

The Spark engine and the Data Integration Service process scale differently. The Data Integration Service allows scale to differ between rows of decimal data while the Spark engine uses a fixed scale for each row. Because the scale is fixed, arithmetic operations can result in data overflow.

For example, the arithmetic operation dec(38,0) / dec(10,0) outputs a decimal dec(38,6) on the Spark engine. The operation might result in data overflow based on whether the result can be represented as a decimal dec(38,6).

The following table shows the decimal values of dec(10,0) that result in data overflow for several decimal values of dec(38,0):

Decimal Value of dec(38,0)	Decimal Values of dec(10,0) that Result in Data Overflow
Less than 10^32	None
10^32	Decimals with an absolute value that is less than or equal to 1
10^33	Decimals with an absolute value that is less than or equal to 10
10^34	Decimals with an absolute value that is less than or equal to 100

Function

Consider the following rules and guidelines for various functions:

Avoid including single and nested functions in an Aggregator transformation. The Data Integration Service fails the mapping in the native environment. It can push the processing to the Hadoop environment, but you might get unexpected results. Informatica recommends creating multiple transformations to perform the aggregation.

The Spark METAPHONE function uses phonetic encoders from the

org.apache.commons.codec.language

library. When the Spark engine runs a mapping, the METAPHONE function can produce an output that is different from the output in the native environment. The following table shows some examples:

String	Data Integration Service	Spark Engine
Might	MFT	MT
High	HF	H

If you use the TO_DATE function on the Spark engine to process a string written in ISO standard format, you must add

*T*

to the date string and

*”T”*

to the format string. The following expression shows an example that uses the TO_DATE function to convert a string written in the ISO standard format YYYY-MM-DDTHH24:MI:SS:

TO_DATE(‘2017-11-03*T*12:45:00’,’YYYY-MM-DD*”T”*HH24:MI:SS’)

The following table shows how the function converts the string:

ISO Standard Format	RETURN VALUE
2017-11-03T12:45:00	Nov 03 2017 12:45:00

The UUID4 function is supported only when used as an argument in UUID_UNPARSE or ENC_BASE64.

The UUID_UNPARSE function is supported only when the argument is UUID4( ).

Other Rules and Guidelines

You cannot preview data for a transformation that is configured for windowing.

When the Spark engine runs a mapping, it processes jobs on the cluster using HiveServer2 in the following cases:

The mapping writes to a target that is a Hive table bucketed on fields of type char or varchar.

The mapping reads from or writes to Hive transaction-enabled tables.

The mapping reads from or writes to Hive tables where column-level security is enabled.

The mapping writes to a Hive target and is configured to create or replace the table at run time.

Rename Saved Search

Table of Contents

User Guide

User Guide

Rules and Guidelines for Mappings on the Spark Engine

Rules and Guidelines for Mappings on the Spark Engine

Mapping Validation

Optimization

High Precision

Overflow Values

Null and Invalid Values and Rejected Rows

Data Conversions

Scale

Function

Other Rules and Guidelines