Table of Contents

Search

  1. Preface
  2. Transformations
  3. Source transformation
  4. Target transformation
  5. Access Policy transformation
  6. Aggregator transformation
  7. Cleanse transformation
  8. Data Masking transformation
  9. Data Services transformation
  10. Deduplicate transformation
  11. Expression transformation
  12. Filter transformation
  13. Hierarchy Builder transformation
  14. Hierarchy Parser transformation
  15. Hierarchy Processor transformation
  16. Input transformation
  17. Java transformation
  18. Java transformation API reference
  19. Joiner transformation
  20. Labeler transformation
  21. Lookup transformation
  22. Machine Learning transformation
  23. Mapplet transformation
  24. Normalizer transformation
  25. Output transformation
  26. Parse transformation
  27. Python transformation
  28. Rank transformation
  29. Router transformation
  30. Rule Specification transformation
  31. Sequence Generator transformation
  32. Sorter transformation
  33. SQL transformation
  34. Structure Parser transformation
  35. Transaction Control transformation
  36. Union transformation
  37. Velocity transformation
  38. Verifier transformation
  39. Web Services transformation

Transformations

Transformations

Deduplication and consolidation operations

Deduplication and consolidation operations

When you run a mapping with a Deduplicate transformation, the transformation analyzes the identity data in each input record. The transformation generates a set of percentage scores that represent the degrees of similarity between the input records. If two or more records match one another with scores that exceed a given threshold, the transformation considers the records to be duplicates.
The deduplicate asset that you add to the transformation specifies the comparison criteria for the deduplication operation, including the threshold score that duplicate records must meet.
Consolidation is an optional process that the deduplicate asset can specify for the transformation. During consolidation, the transformation evaluates the sets of matching records that the deduplication process identifies. The transformation selects or constructs a preferred version of the records in each set.
A
Data Quality
user configures the deduplication and consolidation processes in the deduplicate asset. For more information about the criteria that the asset defines, contact the
Data Quality
user.

Rules and guidelines for deduplication and consolidation

When you add a Deduplicate transformation to a mapping, consider the following rules and guidelines:
Mapping fields for identity analysis
The deduplicate asset that you add to the transformation specifies a type of identity, such as a person name or an organization name. The asset identifies the identity type as the
objective
of the deduplication operations. The type of identity on the asset defines the types of information that the transformation expects to find in the input fields.
You must map the appropriate input fields on the transformation to the target fields that the transformation indicates. You can optionally map the optional input fields to other fields.
Scores and thresholds
The Deduplicate transformation calculates a score for each possible pair of records in the input data. The transformation returns the scores for the records within each set of matching duplicate records. It does not return the scores for records that do not belong in the same set.
The transformation represents the relationships between the records in a matching set as a link score and a driver score.
SequenceId and GroupKey fields
On the
Field Mapping
tab, the transformation adds a GroupKey field and a SequenceId field to the fields that the asset specifies. The GroupKey field is mandatory. The SequenceId field is mandatory in advanced mode.
A group key is a data value that allows the transformation to sort input records into subsets and to perform discrete duplicate analyses on each subset. When you select a suitable group key, you reduce the time that the mapping takes to run without reducing the quality of the mapping results. For more information about groups, see Groups in duplicate analysis.
Sequence ID values determine the order in which the transformation reads the input records. If your input records do not contain a field that can provide data to the SequenceId field, the transformation reads the records in the order in which they appear in the input data set.
Metadata fields
On the
Output Fields
tab, the transformation adds fields that display the score values for pairs of matching records. The fields also identify the set of matching records to which each record belongs. If the deduplicate asset specifies a consolidation process, the metadata fields specify a preferred record for each record set. The transformation identifies the preferred record as the
survivor
record.
Use the fields to understand the mapping results.

0 COMMENTS

We’d like to hear from you!