Big Data Management User Guide

10.2.1
- 10.5.8
- 10.5.7
- 10.5.6
- 10.5.5
- 10.5.4
- 10.5.3
- 10.5.2
- 10.5.1
- 10.5
- 10.4.1
- 10.4.0
- 10.2.2 HotFix 1
- 10.2.2 Service Pack 1
- 10.2.2

Back Next

Rules and Guidelines for Spark Engine Processing

Some restrictions and guidelines apply to processing Informatica functions on the Spark engine.

When you push a mapping to the Hadoop environment, the engine that processes the mapping uses a set of rules different from the Data Integration Service. As a result, the mapping results can vary based on the rules that the engine uses. This topic contains some processing differences that Informatica discovered through internal testing and usage. Informatica does not test all the rules of the third-party engines and cannot provide an extensive list of the differences.

Consider the following rules and guidelines for function and data type processing on the Spark engine:

The Spark engine and the Data Integration Service process overflow values differently. The Spark engine processing rules might differ from the rules that the Data Integration Service uses. As a result, mapping results can vary between the native and Hadoop environment when the Spark engine processes an overflow. Consider the following processing variation for Spark:

If an expression results in numerical errors, such as division by zero or SQRT of a negative number, it returns an infinite or an NaN value. In the native environment, the expression returns null values and the rows do not appear in the output.

The Spark engine and the Data Integration Service process data type conversions differently. As a result, mapping results can vary between the native and Hadoop environment when the Spark engine performs a data type conversion. Consider the following processing variations for Spark:

The results of arithmetic operations on floating point types, such as Decimal, can vary up to 0.1 percent between the native environment and a Hadoop environment.

The Spark engine ignores the scale argument of the TO_DECIMAL function. The function returns a value with the same scale as the input value.

When the scale of a double or decimal value is smaller than the configured scale, the Spark engine trims the trailing zeros.

The Spark engine cannot process dates to the nanosecond. It can return a precision for date/time data up to the microsecond.

The Spark engine does not support high precision. If you enable high precision, the Spark engine processes data in low-precision mode.

If you use Hive 2.3, the Spark engine guarantees scale values.

For example, when the Spark engine processes the decimal

1.1234567

with scale 9 using Hive 2.3, the output is

1.123456700

. If you do not use Hive 2.3, the output is

1.1234567

The Hadoop environment treats "/n" values as null values. If an aggregate function contains empty or NULL values, the Hadoop environment includes these values while performing an aggregate calculation.

Mapping validation fails if you configure SYSTIMESTAMP with a variable value, such as a port name. The function can either include no argument or the precision to which you want to retrieve the timestamp value.

Mapping validation fails if an output port contains a Timestamp with Time Zone data type.

Avoid including single and nested functions in an Aggregator transformation. The Data Integration Service fails the mapping in the native environment. It can push the processing to the Hadoop environment, but you might get unexpected results. Informatica recommends creating multiple transformations to perform the aggregation.

You cannot preview data for a transformation that is configured for windowing.

The Spark METAPHONE function uses phonetic encoders from the

org.apache.commons.codec.language

library. When the Spark engine runs a mapping, the METAPHONE function can produce an output that is different from the output in the native environment. The following table shows some examples:

String	Data Integration Service	Spark Engine
Might	MFT	MT
High	HF	H

If you use the TO_DATE function on the Spark engine to process a string written in ISO standard format, you must add

*T*

to the date string and

*”T”*

to the format string. The following expression shows an example that uses the TO_DATE function to convert a string written in the ISO standard format YYYY-MM-DDTHH24:MI:SS:

TO_DATE(‘2017-11-03*T*12:45:00’,’YYYY-MM-DD*”T”*HH24:MI:SS’)

The following table shows how the function converts the string:

ISO Standard Format	RETURN VALUE
2017-11-03T12:45:00	Nov 03 2017 12:45:00

The UUID4 function is supported only when used as an argument in UUID_UNPARSE or ENC_BASE64.

The UUID_UNPARSE function is supported only when the argument is UUID4( ).

Function and Data Type Processing

Download Guide

Watch

Comments

Communities

Knowledge Base

Success Portal

0 COMMENTS

Back Next

We’d like to hear from you! Log in to comment.

Rename Saved Search

Table of Contents

Big Data Management User Guide

Big Data Management User Guide

Rules and Guidelines for Spark Engine Processing

Rules and Guidelines for Spark Engine Processing