Tasks

Back Next

Spark session properties

When you create a

mapping

task that is based on a mapping in advanced mode, you can configure optional Spark session properties.

The default values for Spark session properties are configured using best practices and the average computational requirements of in-house

mapping

tasks. If the default values do not fit the requirements of a specific

mapping

task, reconfigure the properties to override the default values.

To get an optimal set of Spark session properties for the

mapping

task, see CLAIRE Tuning.

The following table describes the Spark session properties:

Spark session property	Description
infaspark.sql.forcePersist	Indicates whether data persists in memory to avoid repeating read operations. For example, the Router transformation can avoid repeated read operations on output groups. Default is false.
spark.driver.extraJavaOptions	Additional JVM options for the Spark driver process. Default is -Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500 .
spark.driver.maxResultSize	Maximum total size of serialized results of all partitions for each Spark action. Default is 4G.
spark.driver.memory	Amount of memory for the Spark driver process. Default is 4G.
spark.dynamicAllocation.maxExecutors	Maximum number of Spark executors if dynamic allocation is enabled. Default is 1000. The value is calculated automatically.
spark.executor.cores	Number of cores that run each Spark executor. Default is 2.
spark.executor.extraJavaOptions	Additional JVM options for Spark executors. Default is -Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500 .
spark.executor.memory	Amount of memory for each Spark executor. Default is 6G.
spark.memory.fraction	Fraction of the heap that is allocated to the Spark engine. When set to 1, the Spark engine uses the full heap space except for 300 MB that is reserved memory. Default is 0.6.
spark.memory.storageFraction	Fraction of memory that the Spark engine uses for storage compared to processing data. Default is 0.5.
spark.rdd.compress	Indicates whether to compress serialized RDD partitions. Default is false.
spark.reducer.maxSizeInFlight	Maximum size of the data that each reduce task can receive from a map task while shuffling data. The size represents a network buffer to make sure that the reduce task has enough memory for the shuffled data. Default is 48M.
spark.shuffle.file.buffer	Size of the in-memory buffer that each map task uses to write the intermediate shuffle output. Default is 32K.
spark.sql.autoBroadcastJoinThreshold	Threshold in bytes to use broadcast join. When the Spark engine uses broadcast join, the Spark driver sends data to Spark executors that are running on the advanced cluster , thereby avoiding shuffling and resulting in better performance. In some situations, like when a mapping task processes columnar formats or delimited files, broadcast join can cause memory issues at the Spark driver level. To resolve the issues, try reducing the broadcast join threshold to 10 MB, increasing the Spark driver memory, or disabling broadcast join. Default is 256000000. To disable broadcast join, set the value to -1.
spark.sql.broadcastTimeout	Timeout in seconds that is used during broadcast join. Default is 300.
spark.sql.shuffle.partitions	Number of partitions that Spark uses to shuffle data to process joins or aggregations. Default is 100.
spark.custom.property	Configure custom Spark session properties. Use &: to separate custom properties.