Table of Contents

Search

  1. Preface
  2. Transformations
  3. Source transformation
  4. Target transformation
  5. Aggregator transformation
  6. Cleanse transformation
  7. Data Masking transformation
  8. Data Services transformation
  9. Deduplicate transformation
  10. Expression transformation
  11. Filter transformation
  12. Hierarchy Builder transformation
  13. Hierarchy Parser transformation
  14. Hierarchy Processor transformation
  15. Input transformation
  16. Java transformation
  17. Java transformation API reference
  18. Joiner transformation
  19. Labeler transformation
  20. Lookup transformation
  21. Machine Learning transformation
  22. Mapplet transformation
  23. Normalizer transformation
  24. Output transformation
  25. Parse transformation
  26. Python transformation
  27. Rank transformation
  28. Router transformation
  29. Rule Specification transformation
  30. Sequence Generator transformation
  31. Sorter transformation
  32. SQL transformation
  33. Structure Parser transformation
  34. Transaction Control transformation
  35. Union transformation
  36. Velocity transformation
  37. Verifier transformation
  38. Web Services transformation

Transformations

Transformations

Deduplication and consolidation operations

Deduplication and consolidation operations

When you run a mapping with a Deduplicate transformation, the transformation analyzes the identity data in each input record. The transformation generates a set of percentage scores that represent the degrees of similarity between the input records. If two or more records match one another with scores that exceed a given threshold, the transformation considers the records to be duplicates.
The deduplicate asset that you add to the transformation specifies the comparison criteria for the deduplication operation, including the threshold score that duplicate records must meet.
Consolidation is an optional process that the deduplicate asset can specify for the transformation. During consolidation, the transformation evaluates the sets of matching records that the deduplication process identifies. The transformation selects or constructs a preferred version of the records in each set.
A
Data Quality
user configures the deduplication and consolidation processes in the deduplicate asset. For more information about the criteria that the asset defines, contact the
Data Quality
user.

Rules and guidelines for deduplication and consolidation

When you add a Deduplicate transformation to a mapping, consider the following rules and guidelines:
Mapping fields for identity analysis
The deduplicate asset that you add to the transformation specifies a type of identity, such as a person name or an organization name. The asset identifies the identity type as the
objective
of the deduplication operations. The type of identity on the asset defines the types of information that the transformation expects to find in the input fields.
You must map the appropriate input fields on the transformation to the target fields that the transformation indicates. You can optionally map the optional input fields to other fields.
Scores and thresholds
The Deduplicate transformation calculates a score for each possible pair of records in the input data. The transformation returns the scores for the records within each set of matching duplicate records. It does not return the scores for records that do not belong in the same set.
The transformation represents the relationships between the records in a matching set as a link score and a driver score.
Sequence ID and group key fields
On the
Field Mapping
tab, the transformation adds a group key field and a sequence ID field to the fields that the asset specifies. The group key field is mandatory. The sequence ID field mandatory in advanced mode..
The group key is a data value that allows the transformation to sort the input records into subsets and to perform discrete duplicate analyses on each subset. When you select a suitable group key, you reduce the time that the mapping takes to run without reducing the quality of the mapping results. If you do not want to divide the input records into groups, add a field to the input data that contains a single or constant value and select the field as the group key.
The sequence ID values determine the order in which the transformation reads the input records. If your input records do not contain a sequence ID field, the transformation reads the records in the order in which they appear in the input data set.
Metadata fields
On the
Output Fields
tab, the transformation adds fields that display the score values for pairs of matching records. The fields also identify the set of matching records to which each record belongs. If the deduplicate asset specifies a consolidation process, the metadata fields specify a preferred record for each record set. The transformation identifies the preferred record as the
survivor
record.
Use the fields to understand the mapping results.

0 COMMENTS

We’d like to hear from you!