Table of Contents

Search

  1. Preface
  2. Transformations
  3. Source transformation
  4. Target transformation
  5. Access Policy transformation
  6. Aggregator transformation
  7. B2B transformation
  8. Chunking transformation
  9. Cleanse transformation
  10. Data Masking transformation
  11. Data Services transformation
  12. Deduplicate transformation
  13. Expression transformation
  14. Filter transformation
  15. Hierarchy Builder transformation
  16. Hierarchy Parser transformation
  17. Hierarchy Processor transformation
  18. Input transformation
  19. Java transformation
  20. Java transformation API reference
  21. Joiner transformation
  22. Labeler transformation
  23. Lookup transformation
  24. Machine Learning transformation
  25. Mapplet transformation
  26. Normalizer transformation
  27. Output transformation
  28. Parse transformation
  29. Python transformation
  30. Rank transformation
  31. Router transformation
  32. Rule Specification transformation
  33. Sequence transformation
  34. Sorter transformation
  35. SQL transformation
  36. Structure Parser transformation
  37. Transaction Control transformation
  38. Union transformation
  39. Vector Embedding transformation
  40. Velocity transformation
  41. Verifier transformation
  42. Web Services transformation

Transformations

Transformations

Deduplication and consolidation operations

Deduplication and consolidation operations

When you run a mapping, the Deduplicate transformation generates a temporary index from the input records that it reads. The transformation analyzes the index to find pairs of similar records.
The transformation calculates a series of percentage scores that represent the degrees of similarity between the pairs of records that it finds. If two records match each other with a score that exceeds a given threshold, the transformation considers the records to be duplicates.
The deduplicate asset that you add to the transformation specifies the comparison criteria for the deduplication operation, including the threshold score that duplicate records must satisfy.
Consolidation is an optional process that the deduplicate asset can specify for the transformation. During consolidation, the transformation evaluates the sets of matching records that the deduplication process identifies. The transformation selects or constructs a preferred version of the records in each matching set.
A
Data Quality
user configures the deduplication and consolidation processes in the deduplicate asset. For more information about the criteria that the asset defines, contact the
Data Quality
user.

Rules and guidelines for deduplication and consolidation

When you add a Deduplicate transformation to a mapping, consider the following rules and guidelines:
Mapping fields for identity analysis
The deduplicate asset that you add to the transformation specifies a type of identity, such as a person name or an organization name. The asset identifies the identity type as the
objective
of the deduplication operation. The type of identity defines the types of information that the transformation expects to find in the index.
You must map the appropriate input fields on the transformation to the target fields that the transformation indicates. You can optionally map additional input fields to other fields on the transformation.
Groups and sequence ID values
In duplicate analysis, a group is a set of records that contain identical values in a given field. At run time, the Deduplicate transformation analyzes the index data for records exclusively within each group and subsequently combines the results from each group into a single data set. Use the GroupKey field on the
Field Mapping
tab to define your groups. When you create groups on an appropriate field, you reduce the overall number of comparisons that the transformation must perform without any meaningful loss of accuracy in duplicate analysis.
The GroupKey field is mandatory. If you prefer not to sort your input data into groups, add a column to your data set that has the same value on every row and map the column to the GroupKey field.
Sequence ID values determine the order in which the transformation reads the input records. If your input records do not contain a field that can provide data to the SequenceId field, the transformation reads the records in the order in which they appear in the input data set. A SequenceId field is mandatory if you run a mapping in advanced mode.
Clusters and scores
When two or more records match each other, the transformation assigns them to the same matching set and adds an ID value to each record that identifies them as members of the set.
A set of matching records within a group is also known as a
cluster
, and the ID value that associates matching records together is the
cluster ID
.
The transformation represents the relationships between matching records with
link score
and
driver score
values in the output data set. The link score is the score between two records that identifies them as members of the same cluster. The driver score is the score between the first record added to a cluster and another record in the cluster.
Bear in mind that the transformation generates a single score for each pair of matching records that it finds. The link and driver scores define the types of relationship between different records and do not represent different calculations.
Metadata fields
On the
Output Fields
tab, the transformation adds fields that display the score values for pairs of matching records. The fields also identify the cluster to which each record belongs. If the deduplicate asset specifies a consolidation process, the metadata fields specify a preferred record for each cluster. The transformation identifies the preferred record as the
survivor
record.
Use the fields to understand the mapping results.
For more information about the metadata fields, see Metadata fields on the Deduplicate transformation and Link scores and driver scores.

0 COMMENTS

We’d like to hear from you!