Data Quality Performance Tuning Guide

Data Quality Performance Tuning Guide

Match

Match

To optimize performance in a Match transformation, you must understand the concepts that underpin match analysis.

Single-Source field matching

Single-source field match analysis compares data from every record in a data set with every other record. The analysis generates a numerical score for every pair of records that it compares. To reduce processing time, the transformation uses one or more Key fields to organize the input records into groups prior to match analysis. You select the Key fields. The number of record pairs created depends on the number of records within a group.
The number can be calculated by the following formula:
where n is the number of records in the group.
Group size has a significant impact on performance. For example, applying the formula above to a group of 2,000 records will produce 1,999,000 matches. Applying the formula to a group of 5,000 records will produce 12,497,500 matches, or over six times the amount.
For optimal performance, groups of over 10,000 are not recommended. Group sizes should be meaningful, so that you do not miss possible matches, but they should not be too large.
If you perform matching on a large data set, the Match transformation may not be able to store all comparison pairs in memory, and some pairs will be written to disk. The Cache Size property on the transformation determines the amount of memory available.
The following image shows the property:
You can view the Cache Size property on the Match Output tab in the Match transformation.
A cache size value below 65536 is measured in megabytes, and any higher value is measured in bytes.
The Cache Directory property identifies a storage area for the temporary files that match analysis creates. Configure the cache directory on the smallest, fastest disk for performance improvements.
Where possible, do not use pass-through ports on the Match transformation, especially in large data sets. The pass-through ports take up valuable memory or disk space. To reunite the ports with the matched records, you can use a Joiner transformation that reads the sequence ID values.
The Match transformation can generate Link Score and Driver Score values that represent the degrees of similarity between different pairs of records in a cluster of matching records.
For optimum performance, choose Link Scores and not Driver Scores. Choosing Driver Scores will greatly decrease the performance of your match mapping, as Driver Scores write more information to disk.
Selecting the Filter Exact Match property significantly improves match performance if the data contains a significant number of exactly matched pairs. Otherwise the option has a negligible performance impact.
The following image shows the Filter Exact Match property:
You can view the Filter Exact Match property on the Advanced tab in the Match transformation.

Dual-Source Field Matching

Many of the principles of single-source matching also apply in dual-source matching. However, in dual-source matching, the Match transformation compares each record in one data set with every record in the other data set.
The following formula calculates the number of pairs:
n x m
, where n is the number of records in group 1 in data set 1 and m is number of records in group 1 in data set 2.
For example, if data set 1 includes a group with 3,000 rows and the same group exists in data set 2 with 2,000 rows, match analysis will generate 6,000,000 record pairs.

Identity Matching

The use of groups in identity matching is optional but recommended. As is the case in field matching, very large group sizes will result in considerably slower performance.
To significantly improve identity matching performance, increase the number of execution instances on the transformation. When you increase the number of execution instances, the Data Integration Service splits the workload over multiple threads. The availability of execution instances depends on the number of processor cores on the Data Integration Service machine.
The performance improvement will not be linear. The complete matching process cannot be split over multiple threads. Part of the process must be completed in a single thread.
For optimal performance with identity matching, set your execution instances to the number of processor cores minus 1.

Disk Space Required

Consider the following factors for disk space sizing in field and identity matching:
Field matching
The following formula is a guide to the quantity of disk space in MB required to run field matching on a data set, generating only the link score:
where d = the sum of the Match transformation input port precisions, n = the number of records, and 0.0000025 = the memory required per character.
If the mapping has dual sources, n in the above formula represents the total of the two sources.
The result above will double when the driver score is required.
Identity matching
The following formula is a guide to the quantity of disk space in MB required to run identity matching on a data set, generating only the link score:
where d = the sum of the Match transformation input port precisions, n = the number of records, and 0.000005 = the memory required per character.
If the mapping has dual sources, n in the above formula represents the total of the two sources.

Match Performance Analysis

The Match Performance Analysis feature, available as a right-click option on the Match transformation, is valuable in estimating how long a mapping with a Match transformation may take to complete.
The report represents a profile of the data and including a table that describes the composition of the groups.
The Developer tool shows the first 16,000 groups. To see the full makeup of the data, export the report to a file.
The following image shows a match performance analysis report:
The match performance analysis report shows a series of measurements for the match analysis operation, including the maximum and minimum group sizes in the operation.
Modify the minimum and maximum group sizes to evaluate the likely effect of different group sizes on the mapping performance.

0 COMMENTS

We’d like to hear from you!