Hadoop Integration Guide

10.2.1
- 10.5.5
- 10.5.4.1
- 10.5.4
- 10.5.3
- 10.5.2
- 10.5.1
- 10.5
- 10.4.1
- 10.4.0
- 10.2.2 HotFix 1
- 10.2.2 Service Pack 1
- 10.2.2

Back Next

Spark Advanced Properties

Spark advanced properties are a list of advanced or custom properties that are unique to the Spark engine. Each property contains a name and a value. You can add or edit advanced properties. Each property contains a name and a value. You can add or edit advanced properties.

Configure the following properties in the

Advanced Properties

of the Spark configuration section:

To edit the property in the text box, use the following format with &: to separate each name-value pair:

<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]

spark.authenticate: Enables authentication for the Spark service on Hadoop. Required for Spark encryption.

Set to TRUE.

For example,
spark.authenticate=TRUE

spark.authenticate.enableSaslEncryption: Enables encrypted communication when SASL authentication is enabled. Required if Spark encryption uses SASL authentication.

Set to TRUE.

For example,
spark.authenticate.enableSaslEncryption=TRUE

spark.executor.cores: Indicates the number of cores that each executor process uses to run tasklets on the Spark engine.
Set to:
spark.executor.cores=1

spark.executor.instances: Indicates the number of instances that each executor process uses to run tasklets on the Spark engine.
Set to:
spark.executor.instances=1

spark.executor.memory: Indicates the amount of memory that each executor process uses to run tasklets on the Spark engine.
Set to:
spark.executor.memory=3G

infaspark.driver.cluster.mode.extraJavaOptions: List of extra Java options for the Spark driver that runs inside the cluster. Required for streaming mappings to read from or write to a Kafka cluster that uses Kerberos authentication.

For example, set to:

infaspark.driver.cluster.mode.extraJavaOptions= -Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -Djavax.security.auth.useSubjectCredsOnly=true -Djava.security.krb5.conf=/<path to keytab file>/krb5.conf -Djava.security.auth.login.config=<path to jaas config>/kafka_client_jaas.config

To configure the property for a specific user, you can include the following lines of code:

infaspark.driver.cluster.mode.extraJavaOptions = -Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500 -Djava.security.krb5.conf=/etc/krb5.conf

infaspark.executor.extraJavaOptions: List of extra Java options for the Spark executor. Required for streaming mappings to read from or write to a Kafka cluster that uses Kerberos authentication.

For example, set to:

infaspark.executor.extraJavaOptions= -Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -Djavax.security.auth.useSubjectCredsOnly=true -Djava.security.krb5.conf=/<path to krb5.conf file>/krb5.conf -Djava.security.auth.login.config=/<path to jAAS config>/kafka_client_jaas.config

To configure the property for a specific user, you can include the following lines of code:

infaspark.executor.extraJavaOptions = -Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500 -Djava.security.krb5.conf=/etc/krb5.conf

infaspark.flatfile.writer.nullValue: When the Databricks Spark engine writes to a target, it converts null values to empty strings (" "). For example, 12, AB,"",23p09udj.; The Databricks Spark engine can write the empty strings to string columns, but when it tries to write an empty string to a non-string column, the mapping fails with a type mismatch.
To allow the Databricks Spark engine to convert the empty strings back to null values and write to the target, configure the following advanced property in the Databricks Spark connection:

infaspark.flatfile.writer.nullValue=true
spark.hadoop.validateOutputSpecs: Validates if the HBase table exists. Required for streaming mappings to write to a HBase target in an Amazon EMR cluster. Set the value to false.

infaspark.json.parser.mode: Specifies the parser how to handle corrupt JSON records. You can set the value to one of the following modes:

DROPMALFORMED. The parser ignores all corrupted records. Default mode.
PERMISSIVE. The parser accepts non-standard fields as nulls in corrupted records.
FAILFAST. The parser generates an exception when it encounters a corrupted record and the Spark application goes down.

infaspark.json.parser.multiLine: Specifies whether the parser can read a multiline record in a JSON file. You can set the value to true or false. Default is false. Applies only to non-native distributions that use Spark version 2.2.x and above.

infaspark.pythontx.exec: Required to run a Python transformation on the Spark engine for Big Data Management. The location of the Python executable binary on the worker nodes in the Hadoop cluster.

For example, set to:
infaspark.pythontx.exec=/usr/bin/python3.4

If you use the installation of Python on the Data Integration Service machine, set the value to the Python executable binary in the Informatica installation directory on the Data Integration Service machine.

For example, set to:
infaspark.pythontx.exec=INFA_HOME/services/shared/spark/python/lib/python3.4

infaspark.pythontx.executorEnv.PYTHONHOME: Required to run a Python transformation on the Spark engine for Big Data Management and Big Data Streaming. The location of the Python installation directory on the worker nodes in the Hadoop cluster.

If the Python installation directory on the worker nodes is in a directory such as
usr/lib/python
, set the property to the following value:
infaspark.pythontx.executorEnv.PYTHONHOME=usr/lib/python

If you use the installation of Python on the Data Integration Service machine, use the location of the Python installation directory on the Data Integration Service machine.

For example, set the property to the following value:
infaspark.pythontx.executorEnv.PYTHONHOME= INFA_HOME/services/shared/spark/python/

infaspark.pythontx.executorEnv.LD_PRELOAD: Required to run a Python transformation on the Spark engine for Big Data Streaming. The location of the Python shared library in the Python installation folder on the Data Integration Service machine.

For example, set to:

infaspark.pythontx.executorEnv.LD_PRELOAD= INFA_HOME/services/shared/spark/python/lib/libpython3.6m.so

infaspark.pythontx.submit.lib.JEP_HOME: Required to run a Python transformation on the Spark engine for Big Data Streaming. The location of the Jep package in the Python installation folder on the Data Integration Service machine.

For example, set to:
infaspark.pythontx.submit.lib.JEP_HOME= INFA_HOME/services/shared/spark/python/lib/python3.6/site-packages/jep/

spark.shuffle.encryption.enabled: Enables encrypted communication when authentication is enabled. Required for Spark encryption.

Set to TRUE.

For example,
spark.shuffle.encryption.enabled=TRUE

spark.scheduler.maxRegisteredResourcesWaitingTime: The number of milliseconds to wait for resources to register before scheduling a task. Default is 30000. Decrease the value to reduce delays before starting the Spark job execution. Required to improve performance for mappings on the Spark engine.

Set to 15000.

For example,
spark.scheduler.maxRegisteredResourcesWaitingTime=15000

spark.scheduler.minRegisteredResourcesRatio: The minimum ratio of registered resources to acquire before task scheduling begins. Default is 0.8. Decrease the value to reduce any delay before starting the Spark job execution. Required to improve performance for mappings on the Spark engine.

Set to: 0.5

For example,
spark.scheduler.minRegisteredResourcesRatio=0.5

Rename Saved Search

Table of Contents

Hadoop Integration Guide

Hadoop Integration Guide

Spark Advanced Properties

Spark Advanced Properties