Sizing Guidelines and Performance Tuning for Big Data Streaming 10.2.1

Back Next

Lookup Transformation

Consider the following restrictions when you optimize the Lookup transformation to run on the Spark engine:

Data skew: Data skew refers to uneven distribution of data. Spark engine optimization might lead to data skew among executors due to the location of the data. To avoid such an issue, you can set the
spark.shuffle.reduceLocality.enabled
property to false.
When the
spark.shuffle.reduceLocality.enabled
property is set to false, the shuffle behaviour is impacted.

Inefficient lookup partitioning: Mapping performance might degrade due to inefficient lookup partitioning and caching.; To configure cache partitioning for a lookup transformation, perform the following steps:
Set the value of
infaspark.lookup.repartition.partitions
property equal to the number of source topic partitions. For example, if a Kafka topic has 18 partitions, set the value of
infaspark.lookup.repartition.partitions
property to 18.
Set the value of
infaspark.lookup.persist.enabled
property to true.

Data duplication: Avoid data duplication in lookup source. If lookup data is unique, configure to return all rows on multiple matches.