Sizing Guidelines and Performance Tuning for Big Data Streaming 10.2.1

Back Next

3. Tune the Spark Engine

To optimize Big Data Streaming performance, you must tune Spark parameters. To tune the parameters, configure the Hadoop connection advanced properties for the Spark engine in the Developer tool or the Administrator tool. To tune the parameters for specific mappings, configure the execution parameters of the streaming mapping run-time properties in the Developer tool.

You can configure the following parameters based on the input data rate, mapping complexity, and concurrency of mappings:

spark.executor.cores: The number of cores that each executor process uses to run tasklets.
Recommended value: Specify 3 to 4 cores for each executor. Specifying a higher number of cores might lead to performance degradation.
spark.executor.instances: The number of instances that each executor process uses to run tasklets.; Configure the number of executor instances based on the following deployment types:
Sandbox deployment. 4
Small deployment. 14
Medium deployment. 27
Large deployment. 262

For mapping with multiple targets, you can configure the
spark.executor.instances
and
spark.executor.cores
parameters such that the sum of its values is equal to the total number of cores required for processing targets in a mapping. For example, consider a mapping with two targets, Target1 and Target2. If Target1 requires 10 cores and Target2 requires 15 cores, then the total number of cores required by the mapping is 25. Configure the
spark.executor.instances
and
spark.executor.cores
parameters such that the sum of its values equals to 25.
spark.executor.memory: The amount of memory that each executor process uses to run tasklets.
Recommended value: Specify a value of 8 GB.
spark.driver.memory: The driver process memory that the Spark engine uses to run mapping jobs.
Recommended value: Specify a value of 8 GB.
spark.driver.cores: The number of cores to use for each driver process.
Recommended value: Specify 8 cores.
spark.sql.shuffle.partitions: The number of partitions that the Spark engine uses to shuffle data to process joins or aggregations in a mapping job.
Recommended value: Specify a value that equals the total number of executor cores if total executor cores allocated is less than 200. Maximum value is 200.

Configure the partitions based on the following deployment types:
Sandbox deployment. 16
Small deployment. 56
Medium deployment. 108
Large deployment. 200
spark.kryo.registrationRequired: Indicates whether registration with Kryo is required.
Recommended value: True
spark.kryo.classesToRegister: The comma-separated list of custom class names to register with Kryo if you use Kyro serialization.
Specify the following value for all deployment types:
org.apache.spark.sql.catalyst.expressions.GenericRow,[Ljava.lang.Object;, org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema, org.apache.spark.sql.types.StructType,[Lorg.apache.spark.sql.types.StructField;, org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StringType$, org.apache.spark.sql.types.Metadata, scala.collection.immutable.Map$EmptyMap$ [Lorg.apache.spark.sql.catalyst.InternalRow;, scala.reflect.ClassTag$$anon$1,java.lang.Class
infaspark.lookup.persist.enabled: Enables lookup data caching.
To increase the performance of cached Hive lookup, set this parameter to true.
infaspark.lookup.repartition.partitions: Partitions lookup data for better performance.
To increase the performance of cached Hive lookup, set this parameter to any value greater than 1.
infaspark.lookup.hbase.batchsize: The number of records fetched at a time in the HBase lookup database.
Default value: 1500

ExecutionContextOptions.Spark.StreamingDropEmptyBatches: To prevent Spark from creating jobs and tasks when there are no messages to be processed in a batch, set this parameter to true. You can configure this property in the
Custom Properties
tab of the Data Integration Service.

Sizing Guidelines and Performance Tuning for Big Data Streaming 10.2.1

Tune Spark Parameters

Tune the Kafka Cluster

Download Guide

Watch

Comments

Communities

Knowledge Base

Success Portal