Tuning and Sizing Guidelines for Data Engineering Integration (10.4.x)

Back Next

Spark Configuration

Configure properties for the Spark engine in the Hadoop connection.

You can use the following properties to tune job execution when you run jobs with the Spark engine:

spark.executor.memory: Amount of memory to use per executor process. Specify a value with a size unit suffix "k", "m", "g" or "t". For example, 512k or 1g. Default: 1 GB.
spark.executor.cores: The number of cores to use on each executor. Default: 1.
Infaspark.shuffle.max.partitions: Sets the number of shuffle partitions to the maximum number of partitions seen across all input sources. Default: 10000.

Recommended value: Allocate approximately 8 dynamic shuffle partitions for each gigabyte of shuffle data. For example, for 400 GB of shuffle data, set this value to 3200.

For columnar formats like ORC for Hortonworks or Parquet for Cloudera, you might set this property to a lower value.

If the data being shuffled in mid-stream is less than ~250 GB, you can reduce the value of infaspark.shuffle.max.partitions to 1000 for increased performance.
spark.driver.memory: Sets the driver process memory to a default value of 4 GB. The driver requires more memory based on the number of data sources and data nodes.

Recommended value: Allocate at least 256 MB for every data source participating in map join. For example, if a mapping has eight data sources, set the driver memory to at least 2 GB (8 x 256).
spark.driver.maxResultSize: Limit of total size of serialized results of all partitions for each Spark action in bytes. Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Default: 1 GB.

The following table lists the tuning recommendations for sandbox, basic, standard, and advanced deployment types:

Property	Sandbox Deployment	Basic Deployment	Standard Deployment	Advanced Deployment (Default)
spark.executor.memory	2 GB	4 GB	6 GB	6 GB
spark.executor.cores	2	2	2	2
Infaspark.shuffle.max.partitions (8 * per GB data at shuffle)	800	800	4000	10000
spark.driver.memory	1 GB	2 GB	4 GB	4 GB + (default)
spark.driver.maxResultSize	1 GB	1 GB	2 GB	4 GB + (default)