Table of Contents

Search

  1. Preface
  2. Introduction to Informatica Data Engineering Integration
  3. Mappings
  4. Mapping Optimization
  5. Sources
  6. Targets
  7. Transformations
  8. Python Transformation
  9. Data Preview
  10. Cluster Workflows
  11. Profiles
  12. Monitoring
  13. Hierarchical Data Processing
  14. Hierarchical Data Processing Configuration
  15. Hierarchical Data Processing with Schema Changes
  16. Intelligent Structure Models
  17. Blockchain
  18. Stateful Computing
  19. Appendix A: Connections Reference
  20. Appendix B: Data Type Reference
  21. Appendix C: Function Reference

Rules and Guidelines for Mappings on the Spark Engine

Rules and Guidelines for Mappings on the Spark Engine

Consider the following run-time differences on the Spark engine:

Mapping Validation

Consider the following rules and guidelines for mapping validation:
  • Mapping validation fails if you configure SYSTIMESTAMP with a variable value, such as a port name. The function can either include no argument or the precision to which you want to retrieve the timestamp value.
  • Mapping validation fails if an output port contains a Timestamp with Time Zone data type.

Optimization

  • Set the optimizer level to none or minimal if a mapping validates but fails to run. If you set the optimizer level to use cost-based or semi-join optimization methods, the Data Integration Service ignores this at run-time and uses the default.
  • The run-time engine does not honor the early projection optimization method in all cases. If the Data Integration Service removes the links between unused ports, the run-time engine might reconnect the ports.
  • When you use the auto optimizer level, the early selection optimization method is enabled if the mapping contains any data source that supports pushing filters to the source on Spark or Databricks Spark engines. For more information about optimizer levels, see the
    Informatica Developer Mapping Guide
    .
  • When the Spark engine runs a mapping, it processes jobs on the cluster using HiveServer2 in the following cases:
    • The mapping writes to a target that is a Hive table bucketed on fields of type char or varchar.
    • The mapping reads from or writes to Hive transaction-enabled tables.
    • The mapping reads from or writes to Hive tables where column-level security is enabled.
    • The mapping writes to a Hive target and is configured to create or replace the table at run time.

High Precision

Consider the following rules and guidelines for mappings that run in high precision or low precision mode:
  • When you use the TO_DECIMAL or TO_DECIMAL38 function in a mapping that runs in high precision mode, you must specify a scale argument. If the mapping runs in low precision mode, the Spark engine ignores the scale argument and returns a double.
  • If you enable high-precision in a streaming mapping, the Spark engine runs the mapping in low-precision mode.
  • If the mapping contains a complex port with an element of a decimal data type, the Spark engine runs the mapping in low-precision mode.

Overflow Values

The Spark engine and the Data Integration Service process overflow values differently. The Spark engine processing rules might differ from the rules that the Data Integration Service uses. As a result, mapping results can vary between the native and Hadoop environment when the Spark engine processes an overflow.

Null and Invalid Values and Rejected Rows

Consider the following guidelines when mappings pass null or invalid values for rows:
  • The Spark run-time engine drops rejected rows even if you configure the mapping to forward rejected rows. The rejected rows are not written to the session log file.
  • If an expression results in numerical errors, such as division by zero or SQRT of a negative number, the Spark engine returns null. In the native environment, the same expression results in a row error.
  • The Hadoop environment treats "/n" values as null values. If an aggregate function contains empty or NULL values, the Hadoop environment includes these values while performing an aggregate calculation.
  • The Spark engine writes null values for rows when invalid values are passed in the following situations:
    • The
      terms
      argument in PV, FV, PMT, and RATE finance functions passes a 0 value. The value of
      terms
      must be an integer greater than 0.
    • The
      month
      argument in the MAKE_DATE_TIME function passes an invalid value. The value of
      month
      must be from 1 to 12.
    In the native environment, the Data Integration Service rejects the row and does not write it to the target.
  • If data overflow occurs, the Spark engine returns null. For example, if you use the expression
    TO_DECIMAL(12.34, 2)
    in a port that has a precision of 3 and a scale of 2, the return value is null. The null value will be propagated through the mapping. The mapping might overwrite it using a default value, detect it using the function IS_NULL, and write the null value to the target.

Data Conversions

The Spark engine and the Data Integration Service process data type conversions differently. As a result, mapping results can vary between the native and Hadoop environment when the Spark engine performs a data type conversion. Consider the following processing variations for Spark:
  • The results of arithmetic operations on floating point types, such as Decimal, can vary between the native environment and a Hadoop environment. The difference between the results can increase across multiple operations.
  • When the number of fractional digits in a double or decimal value exceeds the scale that is configured in a decimal port, the Spark engine trims trailing digits, rounding the value if necessary.
  • If you use Hive 2.0 or higher, the Spark engine guarantees scale values. For example, when the Spark engine processes the decimal
    1.1234567
    with scale 9 using Hive 2.0, the output is
    1.123456700
    .
  • The Spark engine cannot process dates to the nanosecond. It can return a precision for date/time data up to the microsecond.

Scale

The Spark engine and the Data Integration Service process scale differently. The Data Integration Service allows scale to differ between rows of decimal data while the Spark engine uses a fixed scale for each row. Because the scale is fixed, arithmetic operations can result in data overflow.
For example, the arithmetic operation dec(38,0) / dec(10,0) outputs a decimal dec(38,6) on the Spark engine. The operation might result in data overflow based on whether the result can be represented as a decimal dec(38,6).
The following table shows the decimal values of dec(10,0) that result in data overflow for several decimal values of dec(38,0):
Decimal Value of dec(38,0)
Decimal Values of dec(10,0) that Result in Data Overflow
Less than 10^32
None
10^32
Decimals with an absolute value that is less than or equal to 1
10^33
Decimals with an absolute value that is less than or equal to 10
10^34
Decimals with an absolute value that is less than or equal to 100

Function

Consider the following rules and guidelines for various functions:
  • Avoid including single and nested functions in an Aggregator transformation. The Data Integration Service fails the mapping in the native environment. It can push the processing to the Hadoop environment, but you might get unexpected results. Informatica recommends creating multiple transformations to perform the aggregation.
  • The Spark METAPHONE function uses phonetic encoders from the
    org.apache.commons.codec.language
    library. When the Spark engine runs a mapping, the METAPHONE function can produce an output that is different from the output in the native environment. The following table shows some examples:
    String
    Data Integration Service
    Spark Engine
    Might
    MFT
    MT
    High
    HF
    H
  • If you use the TO_DATE function on the Spark engine to process a string written in ISO standard format, you must add
    *T*
    to the date string and
    *”T”*
    to the format string. The following expression shows an example that uses the TO_DATE function to convert a string written in the ISO standard format YYYY-MM-DDTHH24:MI:SS:
    TO_DATE(‘2017-11-03*T*12:45:00’,’YYYY-MM-DD*”T”*HH24:MI:SS’)
    The following table shows how the function converts the string:
    ISO Standard Format
    RETURN VALUE
    2017-11-03T12:45:00
    Nov 03 2017 12:45:00
  • The UUID4 function is supported only when used as an argument in UUID_UNPARSE or ENC_BASE64.
  • The UUID_UNPARSE function is supported only when the argument is UUID4( ).

Other Rules and Guidelines

You cannot preview data for a transformation that is configured for windowing.
When the Spark engine runs a mapping, it processes jobs on the cluster using HiveServer2 in the following cases:
  • The mapping writes to a target that is a Hive table bucketed on fields of type char or varchar.
  • The mapping reads from or writes to Hive transaction-enabled tables.
  • The mapping reads from or writes to Hive tables where column-level security is enabled.
  • The mapping writes to a Hive target and is configured to create or replace the table at run time.


Updated March 31, 2021