Data Integration Performance Tuning

Back Next

Optimizing Joiner transformations

Joiner transformations can slow performance because they need additional space at run time to hold intermediary results.

Use the following guidelines to improve performance of a Joiner transformation:

Designate the master group as the source with fewer duplicate key values.: When
Data Integration
processes a sorted Joiner transformation, it caches rows for one hundred unique keys at a time. If the master group contains many rows with the same key value,
Data Integration
must cache more rows, and performance can be slowed.
Designate the master group as the source with fewer rows.: The Joiner transformation compares each row of the detail group against the master group. The fewer rows in the master, the fewer iterations of the join comparison occur, which speeds the join process.
Perform joins in a database or data warehouse when possible.: Performing a join in a database is faster than performing a join in the mapping. The type of database join you use can affect performance. Normal joins are faster than outer joins and result in fewer rows. In some cases, you cannot perform the join in the database, such as joining tables from two different databases or flat file systems.
Join sorted data when possible.: To improve mapping performance, configure the Joiner transformation to use sorted input. When you configure the Joiner transformation to use sorted data,
Data Integration
improves performance by minimizing disk input and output. You see the greatest performance improvement when you work with large data sets. For an unsorted Joiner transformation, designate the source with fewer rows as the master group.
Join the largest data set last.: In a mapping that has multiple Joiner transformations, join the largest data set in the most downstream transformation.
Set the broadcast join threshold.: Mappings in advanced mode perform a broadcast join for data sets that are smaller than the value set in the Spark session property
spark.sql.autoBroadcastJoinThreshold
. The mapping broadcasts the data set to all Spark executors across all
advanced cluster
nodes and reduces shuffle overhead for better performance.

Run CLAIRE Tuning to get a recommendation for the broadcast join threshold.

Rename Saved Search

Table of Contents

Data Integration Performance Tuning

Data Integration Performance Tuning

Optimizing Joiner transformations

Optimizing Joiner transformations