When you run a mapping with a Deduplicate transformation, the transformation analyzes the identity data in each input record. The transformation generates a set of percentage scores that represent the degrees of similarity between the input records. If two or more records match one another with scores that exceed a given threshold, the transformation considers the records to be duplicates.
The deduplicate asset that you add to the transformation specifies the comparison criteria for the deduplication operation, including the threshold score that duplicate records must meet.
Consolidation is an optional process that the deduplicate asset can specify for the transformation. During consolidation, the transformation evaluates the sets of matching records that the deduplication process identifies. The transformation selects or constructs a preferred version of the records in each set.
A
Data Quality
user configures the deduplication and consolidation processes in the deduplicate asset. For more information about the criteria that the asset defines, contact the
Data Quality
user.
Rules and guidelines for deduplication and consolidation
When you add a Deduplicate transformation to a mapping, consider the following rules and guidelines:
Mapping fields for identity analysis
The deduplicate asset that you add to the transformation specifies a type of identity, such as a person name or an organization name. The asset identifies the identity type as the
objective
of the deduplication operations. The type of identity on the asset defines the types of information that the transformation expects to find in the input fields.
You must map the appropriate input fields on the transformation to the target fields that the transformation indicates. You can optionally map the optional input fields to other fields.
Scores and thresholds
The Deduplicate transformation calculates a score for each possible pair of records in the input data. The transformation returns the scores for the records within each set of matching duplicate records. It does not return the scores for records that do not belong in the same set.
The transformation represents the relationships between the records in a matching set as a link score and a driver score.
Sequence ID and group key fields
On the
Field Mapping
tab, the transformation adds a group key field and a sequence ID field to the fields that the asset specifies. The group key field is mandatory. The sequence ID field mandatory in advanced mode..
The group key is a data value that allows the transformation to sort the input records into subsets and to perform discrete duplicate analyses on each subset. When you select a suitable group key, you reduce the time that the mapping takes to run without reducing the quality of the mapping results. If you do not want to divide the input records into groups, add a field to the input data that contains a single or constant value and select the field as the group key.
The sequence ID values determine the order in which the transformation reads the input records. If your input records do not contain a sequence ID field, the transformation reads the records in the order in which they appear in the input data set.
Metadata fields
On the
Output Fields
tab, the transformation adds fields that display the score values for pairs of matching records. The fields also identify the set of matching records to which each record belongs. If the deduplicate asset specifies a consolidation process, the metadata fields specify a preferred record for each record set. The transformation identifies the preferred record as the