Sizing Guidelines and Performance Tuning for Big Data Streaming 10.2.1

Sizing Guidelines and Performance Tuning for Big Data Streaming 10.2.1

Tune Spark Parameters

Tune Spark Parameters

Tune the Spark parameters in the Hadoop connection.
You can configure the following parameters based on the input data rate, mapping complexity, and concurrency of mappings:
spark.executor.cores
The number of cores to use on each executor.
Recommended value: Specify 3 to 4 cores for each executor. Specifying a higher number of cores might lead to performance degradation.
spark.executor.memory
The amount of memory to use for each executor process.
Recommended value: Specify a value of 8 GB.
spark.driver.memory
The amount of memory to use for the driver process.
Recommended value: Specify a value of 8 GB.
spark.driver.cores
The number of cores to use for each driver process.
Recommended value: Specify 8 cores.
spark.executor.instances
The total number of executors to be started. This number depends on number of machines in the cluster, memory allocated, and cores per machine.
Configure the number of executor instances based on the following deployment types:
  • Sandbox deployment. 4
  • Small deployment. 14
  • Medium deployment. 27
  • Large deployment. 262
spark.sql.shuffle.partitions
The total number of partitions used for a SQL shuffle operation.
Recommended value: Specify a value that equals the total number of executor cores if total executor cores allocated is less than 200. Maximum value is 200.
Configure the partitions based on the following deployment types:
  • Sandbox deployment. 16
  • Small deployment. 56
  • Medium deployment. 108
  • Large deployment. 200
spark.kryo.registrationRequired
Indicates whether registration with Kryo is required.
Recommended value: True
spark.kryo.classesToRegister
The comma-separated list of custom class names to register with Kryo if you use Kyro serialization.
Specify the following value for all deployment types:
org.apache.spark.sql.catalyst.expressions.GenericRow,[Ljava.lang.Object;, org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema, org.apache.spark.sql.types.StructType,[Lorg.apache.spark.sql.types.StructField;, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StringType$, org.apache.spark.sql.types.Metadata, scala.collection.immutable.Map$EmptyMap$ [Lorg.apache.spark.sql.catalyst.InternalRow;, scala.reflect.ClassTag$$anon$1,java.lang.Class
ExecutionContextOptions.Spark.StreamingDropEmptyBatches
To prevent Spark from creating jobs and tasks when there are no messages to be processed in a batch, set this parameter to true. You can configure this property in the
Custom Properties
tab of the Data Integration Service.

0 COMMENTS

We’d like to hear from you!