Table of Contents

Search

  1. Abstract
  2. Supported Versions
  3. Tuning and Sizing Guidelines for Data Engineering Integration (10.4.x)

Tuning and Sizing Guidelines for Data Engineering Integration (10.4.x)

Tuning and Sizing Guidelines for Data Engineering Integration (10.4.x)

Spark Configuration

Spark Configuration

Configure properties for the Spark engine in the Hadoop connection.
You can use the following properties to tune job execution when you run jobs with the Spark engine:
spark.executor.memory
Amount of memory to use per executor process. Specify a value with a size unit suffix "k", "m", "g" or "t". For example, 512k or 1g. Default: 1 GB.
spark.executor.cores
The number of cores to use on each executor. Default: 1.
Infaspark.shuffle.max.partitions
Sets the number of shuffle partitions to the maximum number of partitions seen across all input sources. Default: 10000.
Recommended value: Allocate approximately 8 dynamic shuffle partitions for each gigabyte of shuffle data. For example, for 400 GB of shuffle data, set this value to 3200.
For columnar formats like ORC for Hortonworks or Parquet for Cloudera, you might set this property to a lower value.
If the data being shuffled in mid-stream is less than ~250 GB, you can reduce the value of infaspark.shuffle.max.partitions to 1000 for increased performance.
spark.driver.memory
Sets the driver process memory to a default value of 4 GB. The driver requires more memory based on the number of data sources and data nodes.
Recommended value: Allocate at least 256 MB for every data source participating in map join. For example, if a mapping has eight data sources, set the driver memory to at least 2 GB (8 x 256).
spark.driver.maxResultSize
Limit of total size of serialized results of all partitions for each Spark action in bytes. Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size is above this limit. Default: 1 GB.
The following table lists the tuning recommendations for sandbox, basic, standard, and advanced deployment types:
Property
Sandbox Deployment
Basic Deployment
Standard Deployment
Advanced Deployment (Default)
spark.executor.memory
2 GB
4 GB
6 GB
6 GB
spark.executor.cores
2
2
2
2
Infaspark.shuffle.max.partitions
(8 * per GB data at shuffle)
800
800
4000
10000
spark.driver.memory
1 GB
2 GB
4 GB
4 GB + (default)
spark.driver.maxResultSize
1 GB
1 GB
2 GB
4 GB + (default)

0 COMMENTS

We’d like to hear from you!