Table of Contents

Search

  1. Preface
  2. Data integration tasks
  3. Mapping tasks
  4. Dynamic mapping tasks
  5. Synchronization tasks
  6. Data transfer tasks
  7. Replication tasks
  8. Masking tasks
  9. Masking rules
  10. PowerCenter tasks

Tasks

Tasks

Spark session properties

Spark session properties

When you create a
mapping
task that is based on a mapping in advanced mode, you can configure optional Spark session properties.
The default values for Spark session properties are configured using best practices and the average computational requirements of in-house
mapping
tasks. If the default values do not fit the requirements of a specific
mapping
task, reconfigure the properties to override the default values.
To get an optimal set of Spark session properties for the
mapping
task, see CLAIRE Tuning.
The following table describes the Spark session properties:
Spark session property
Description
infaspark.sql.forcePersist
Indicates whether data persists in memory to avoid repeating read operations. For example, the Router transformation can avoid repeated read operations on output groups.
Default is false.
spark.driver.extraJavaOptions
Additional JVM options for the Spark driver process.
Default is
-Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500
.
spark.driver.maxResultSize
Maximum total size of serialized results of all partitions for each Spark action.
Default is 4G.
spark.driver.memory
Amount of memory for the Spark driver process.
Default is 4G.
spark.dynamicAllocation.maxExecutors
Maximum number of Spark executors if dynamic allocation is enabled.
Default is 1000. The value is calculated automatically.
spark.executor.cores
Number of cores that run each Spark executor.
Default is 2.
spark.executor.extraJavaOptions
Additional JVM options for Spark executors.
Default is
-Djava.security.egd=file:/dev/./urandom -XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500
.
spark.executor.memory
Amount of memory for each Spark executor.
Default is 6G.
spark.memory.fraction
Fraction of the heap that is allocated to the Spark engine. When set to 1, the Spark engine uses the full heap space except for 300 MB that is reserved memory.
Default is 0.6.
spark.memory.storageFraction
Fraction of memory that the Spark engine uses for storage compared to processing data.
Default is 0.5.
spark.rdd.compress
Indicates whether to compress serialized RDD partitions.
Default is false.
spark.reducer.maxSizeInFlight
Maximum size of the data that each reduce task can receive from a map task while shuffling data. The size represents a network buffer to make sure that the reduce task has enough memory for the shuffled data.
Default is 48M.
spark.shuffle.file.buffer
Size of the in-memory buffer that each map task uses to write the intermediate shuffle output.
Default is 32K.
spark.sql.autoBroadcastJoinThreshold
Threshold in bytes to use broadcast join. When the
Spark engine
uses broadcast join, the Spark driver sends data to Spark executors that are running on the
advanced cluster
, thereby avoiding shuffling and resulting in better performance.
In some situations, like when a
mapping
task processes columnar formats or delimited files, broadcast join can cause memory issues at the Spark driver level. To resolve the issues, try reducing the broadcast join threshold to 10 MB, increasing the Spark driver memory, or disabling broadcast join.
Default is 256000000. To disable broadcast join, set the value to -1.
spark.sql.broadcastTimeout
Timeout in seconds that is used during broadcast join.
Default is 300.
spark.sql.shuffle.partitions
Number of partitions that Spark uses to shuffle data to process joins or aggregations.
Default is 100.
spark.custom.property
Configure custom Spark session properties. Use
&:
to separate custom properties.

0 COMMENTS

We’d like to hear from you!