Table of Contents

Search

  1. Preface
  2. Transformations
  3. Source transformation
  4. Target transformation
  5. Aggregator transformation
  6. Cleanse transformation
  7. Data Masking transformation
  8. Deduplicate transformation
  9. Expression transformation
  10. Filter transformation
  11. Hierarchy Builder transformation
  12. Hierarchy Parser transformation
  13. Hierarchy Processor transformation
  14. Input transformation
  15. Java transformation
  16. Java transformation API reference
  17. Joiner transformation
  18. Labeler transformation
  19. Lookup transformation
  20. Mapplet transformation
  21. Normalizer transformation
  22. Output transformation
  23. Parse transformation
  24. Python transformation
  25. Rank transformation
  26. Router transformation
  27. Rule Specification transformation
  28. Sequence Generator transformation
  29. Sorter transformation
  30. SQL transformation
  31. Structure Parser transformation
  32. Transaction Control transformation
  33. Union transformation
  34. Velocity transformation
  35. Verifier transformation
  36. Web Services transformation

Transformations

Transformations

Metadata fields on the Deduplicate transformation

Metadata fields on the Deduplicate transformation

The Deduplicate transformation includes a set of predefined fields that contain metadata for the deduplication and consolidation processes. The transformation creates the fields by default and populates the fields when the mapping runs.

Metadata fields on the Field Mapping tab

The
Target Fields
list in the Field Mappings tab includes the following metadata fields:
GroupKey
Contains the data values that the transformation uses to sort input records into groups for duplicate analysis.
SequenceId
Contains a unique identifier for each record that enters the transformation.
The transformation uses the sequence ID values to identify records in the Out_DriverId and Out_LinkId data. If you do not map the SequenceId field, the transformation uses the values on the OutRowId field as unique identifiers for the records.

Metadata fields on the Output Fields tab

The Output Fields tab includes the following metadata fields:
Out_ClusterId
Contains the identifiers of the cluster to which each record belongs.
In the deduplication process, a cluster is a set of records whose data values match each other to a degree that exceeds the duplicate threshold. Records in the same set are likely to identify the same identity. A set may contain a single record, as every unique record is a perfect match with itself.
Out_ClusterSize
Contains the number of records in the set to which the current record belongs. When a set contains a unique record, the cluster size is 1.
Out_DriverId
Contains the identifier of the driver record in each matching record set. The driver record is the record in the set with the lowest value on the SequenceId input field. If the transformation does not use the SequenceId field, the driver record is the record in the matching set with the lowest Out_RowId value.
Out_DriverScore
Contains the score that represents the degree of similarity between the current record and the driver record in the matching record set.
Out_IsSurvivor
Contains an identifier for the preferred record that a consolidation process specifies.
Out_LinkId
Contains the identifier of the record that matched with the current record and linked it to the matching record set.
Out_LinkScore
Contains the score between two records that results in the addition of a record to a matching record set. The Out_LinkId field identifies the record with which the current record shares the link score.
Out_RowId
Contains a unique identifier for each record in the mapping source data set.
The transformation uses the Out_RowId values to identify records if you do not map a field of unique identifiers to the SequenceId field.

Selecting metadata fields

The metadata fields can provide important information about the relationship between duplicate records. For example, the metadata includes the Out_LinkScore field, which represents the degree of similarity between two records as a numerical value. If you select the Out_LinkScore field, select the Out_LinkId field also. The Out_LinkId field identifies the other record in the pair of records that the Out_LinkScore value describes.
The Out_DriverId value provides a benchmark for all records in a matching record set. The Out_DriverId value is the score between the current record and the record in the set with the lowest sequence ID or row ID value. The record with the lowest ID is also the first record that the deduplication process added to the set.