Table of Contents

Search

  1. Preface
  2. Introduction to Informatica Big Data Management
  3. Connections
  4. Mappings in the Hadoop Environment
  5. Mapping Objects in the Hadoop Environment
  6. Processing Hierarchical Data on the Spark Engine
  7. Stateful Computing on the Spark Engine
  8. Monitoring Mappings in the Hadoop Environment
  9. Mappings in the Native Environment
  10. Profiles
  11. Native Environment Optimization
  12. Data Type Reference
  13. Complex File Data Object Properties
  14. Function Reference
  15. Parameter Reference

Transformation Support on the Spark Engine

Transformation Support on the Spark Engine

Some restrictions and guidelines apply to processing transformations on the Spark engine.
The following table describes rules and guidelines for the transformations that are supported on the Spark engine:
Transformation
Rules and Guidelines
Transformations not listed in this table are not supported.
Aggregator
Mapping validation fails in the following situations:
  • The transformation contains stateful variable ports.
  • The transformation contains unsupported functions in an expression.
When a mapping contains an Aggregator transformation with an input/output port that is not a group by port, the transformation might not return the last row of each group with the result of the aggregation. Hadoop execution is distributed, and thus it might not be able to determine the actual last row of each group.
Data Masking
Mapping validation fails in the following situations:
  • The transformation is configured for repeatable expression masking.
  • The transformation is configured for unique repeatable substitution masking.
Expression
Mapping validation fails in the following situations:
  • The transformation contains stateful variable ports.
  • The transformation contains unsupported functions in an expression.
If an expression results in numerical errors, such as division by zero or SQRT of a negative number, it returns an infinite or an NaN value. In the native environment, the expression returns null values and the rows do not appear in the output.
Filter
Supported without restrictions.
Java
Mapping validation fails in the following situations:
  • You reference an unconnected Lookup transformation from an expression within a Java transformation.
To use external .jar files in a Java transformation, perform the following steps:
  1. Copy external .jar files to the Informatica installation directory in the Data Integration Service machine at the following location:
    <Informatic installation directory>/services/shared/jars
    . Then recycle the Data Integration Service.
  2. On the machine that hosts the Developer tool where you develop and run the mapping that contains the Java transformation:
    1. Copy external .jar files to a directory on the local machine.
    2. Edit the Java transformation to include an import statement pointing to the local .jar files.
    3. Update the classpath in the Java transformation.
    4. Compile the transformation.
To run user code directly on the Spark engine, the JDK version that the Data Integration Service uses must be compatible with the JRE version on the cluster. For best performance, create the environment variable DIS_JDK_HOME on the Data Integration Service in the Administrator tool. The environment variable contains the path to the JDK installation folder on the machine running the Data Integration Service. For example, you might enter a value such as
/usr/java/default
.
The Partitionable property must be enabled in the Java transformation. The transformation cannot run in one partition.
For date/time values, the Spark engine supports the precision of up to microseconds. If a date/time value contains nanoseconds, the trailing digits are truncated.
When you enable high precision and the Java transformation contains a field that is a decimal data type, a validation error occurs.
The following restrictions apply to the Transformation Scope property:
  • The value Transaction for transformation scope is not valid.
  • If you enable an input port for partition key, the transformation scope must be set to All Input.
  • Stateless must be enabled if the transformation scope is row.
The Java code in the transformation cannot write output to standard output when you push transformation logic to Hadoop. The Java code can write output to standard error which appears in the log files.
Joiner
Mapping validation fails in the following situations:
  • Case sensitivity is disabled.
  • The join condition is of binary data type or contains binary expressions.
Lookup
Mapping validation fails in the following situations:
  • Case sensitivity is disabled.
  • The lookup condition in the Lookup transformation contains binary data type.
  • The lookup is a data object.
  • The cache is configured to be shared, named, persistent, dynamic, or uncached. The cache must be a static cache.
The mapping fails in the following situation:
  • The transformation is unconnected and used with a Joiner or Java transformation.
When you choose to return the first, last, or any value on multiple matches, the Lookup transformation returns any value.
If you configure the transformation to report an error on multiple matches, the Spark engine drops the duplicate rows and does not include the rows in the logs.
Normalizer
Supported without restrictions.
Rank
Mapping validation fails in the following situations:
  • Case sensitivity is disabled.
  • The rank port is of binary data type.
Router
Supported without restrictions.
Sorter
Mapping validation fails in the following situations:
  • Case sensitivity is disabled.
The Data Integration Service logs a warning and ignores the Sorter transformation in the following situations:
  • There is a type mismatch in between the target and the Sorter transformation sort keys.
  • The transformation contains sort keys that are not connected to the target.
  • The Write transformation is not configured to maintain row order.
  • The transformation is not directly upstream from the Write transformation.
The Data Integration Service treats null values as high even if you configure the transformation to treat null values as low.
Union
Supported without restrictions.
Update Strategy
The Update Strategy transformation is supported only on Hadoop distributions that support Hive ACID.
Mapping validation fails in the following situations:
  • The Update Strategy transformation is connected to more than one target.
  • The Update Strategy transformation is not located immediately before the target.
  • The Update Strategy target is not a Hive target.
  • The Update Strategy transformation target is an external ACID table.
  • The target does not contain a connected primary key.
  • The Hive target property to truncate the target table at run time is enabled.
  • The Hive target property to create or replace the target table at run time is enabled.
The mapping fails in the following situations:
  • The target table is not enabled for transactions.
  • The target is not ORC bucketed.
The Update Strategy transformation does not forward rejected rows to the next transformation.
To use a Hive target table with an Update Strategy transformation, you must create the Hive target table with the following clause in the Hive Data Definition Language:
TBLPROPERTIES ("transactional"="true")
.
To use an Update Strategy transformation with a Hive target, verify that the following properties are configured in the hive-site.xml configuration set associated with the Hadoop connection:
hive.support.concurrency true hive.enforce.bucketing true hive.exec.dynamic.partition.mode nonstrict hive.txn.manager org.apache.hadoop.hive.ql.lockmgr.DbTxnManager hive.compactor.initiator.on true hive.compactor.worker.threads 1
If the Update Strategy transformation receives multiple update rows for the same primary key value, the transformation selects one random row to update the target.
If multiple Update Strategy transformations write to different instances of the same target, the target data might be unpredictable.
The Spark engine executes operations in the following order: deletes, updates, inserts. It does not process rows in the same order as the Update Strategy transformation receives them.
Hive targets always perform Update as Update operations. Hive targets do not support Update Else Insert or Update as Insert.


Updated December 13, 2018