Sizing Guidelines and Performance Tuning for Big Data Streaming 10.2.1

Sizing Guidelines and Performance Tuning for Big Data Streaming 10.2.1

3. Tune the Spark Engine

3. Tune the Spark Engine

To optimize Big Data Streaming performance, you must tune Spark parameters. To tune the parameters, configure the Hadoop connection advanced properties for the Spark engine in the Developer tool or the Administrator tool. To tune the parameters for specific mappings, configure the execution parameters of the streaming mapping run-time properties in the Developer tool.
You can configure the following parameters based on the input data rate, mapping complexity, and concurrency of mappings:
spark.executor.cores
The number of cores that each executor process uses to run tasklets.
Recommended value: Specify 3 to 4 cores for each executor. Specifying a higher number of cores might lead to performance degradation.
spark.executor.instances
The number of instances that each executor process uses to run tasklets.
Configure the number of executor instances based on the following deployment types:
  • Sandbox deployment. 4
  • Small deployment. 14
  • Medium deployment. 27
  • Large deployment. 262
For mapping with multiple targets, you can configure the
spark.executor.instances
and
spark.executor.cores
parameters such that the sum of its values is equal to the total number of cores required for processing targets in a mapping. For example, consider a mapping with two targets, Target1 and Target2. If Target1 requires 10 cores and Target2 requires 15 cores, then the total number of cores required by the mapping is 25. Configure the
spark.executor.instances
and
spark.executor.cores
parameters such that the sum of its values equals to 25.
spark.executor.memory
The amount of memory that each executor process uses to run tasklets.
Recommended value: Specify a value of 8 GB.
spark.driver.memory
The driver process memory that the Spark engine uses to run mapping jobs.
Recommended value: Specify a value of 8 GB.
spark.driver.cores
The number of cores to use for each driver process.
Recommended value: Specify 8 cores.
spark.sql.shuffle.partitions
The number of partitions that the Spark engine uses to shuffle data to process joins or aggregations in a mapping job.
Recommended value: Specify a value that equals the total number of executor cores if total executor cores allocated is less than 200. Maximum value is 200.
Configure the partitions based on the following deployment types:
  • Sandbox deployment. 16
  • Small deployment. 56
  • Medium deployment. 108
  • Large deployment. 200
spark.kryo.registrationRequired
Indicates whether registration with Kryo is required.
Recommended value: True
spark.kryo.classesToRegister
The comma-separated list of custom class names to register with Kryo if you use Kyro serialization.
Specify the following value for all deployment types:
org.apache.spark.sql.catalyst.expressions.GenericRow,[Ljava.lang.Object;, org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema, org.apache.spark.sql.types.StructType,[Lorg.apache.spark.sql.types.StructField;, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StringType$, org.apache.spark.sql.types.Metadata, scala.collection.immutable.Map$EmptyMap$ [Lorg.apache.spark.sql.catalyst.InternalRow;, scala.reflect.ClassTag$$anon$1,java.lang.Class
infaspark.lookup.persist.enabled
Enables lookup data caching.
To increase the performance of cached Hive lookup, set this parameter to true.
infaspark.lookup.repartition.partitions
Partitions lookup data for better performance.
To increase the performance of cached Hive lookup, set this parameter to any value greater than 1.
infaspark.lookup.hbase.batchsize
The number of records fetched at a time in the HBase lookup database.
Default value: 1500
ExecutionContextOptions.Spark.StreamingDropEmptyBatches
To prevent Spark from creating jobs and tasks when there are no messages to be processed in a batch, set this parameter to true. You can configure this property in the
Custom Properties
tab of the Data Integration Service.

0 COMMENTS

We’d like to hear from you!