Table of Contents

Search

  1. Preface
  2. Transformations
  3. Source transformation
  4. Target transformation
  5. Access Policy transformation
  6. B2B transformation
  7. Aggregator transformation
  8. Cleanse transformation
  9. Data Masking transformation
  10. Data Services transformation
  11. Deduplicate transformation
  12. Expression transformation
  13. Filter transformation
  14. Hierarchy Builder transformation
  15. Hierarchy Parser transformation
  16. Hierarchy Processor transformation
  17. Input transformation
  18. Java transformation
  19. Java transformation API reference
  20. Joiner transformation
  21. Labeler transformation
  22. Lookup transformation
  23. Machine Learning transformation
  24. Mapplet transformation
  25. Normalizer transformation
  26. Output transformation
  27. Parse transformation
  28. Python transformation
  29. Rank transformation
  30. Router transformation
  31. Rule Specification transformation
  32. Sequence Generator transformation
  33. Sorter transformation
  34. SQL transformation
  35. Structure Parser transformation
  36. Transaction Control transformation
  37. Union transformation
  38. Velocity transformation
  39. Verifier transformation
  40. Web Services transformation

Transformations

Transformations

Metadata fields on the Deduplicate transformation

Metadata fields on the Deduplicate transformation

The Deduplicate transformation includes a set of predefined fields that contain metadata for the deduplication and consolidation processes. The transformation creates the fields by default and populates the fields when the mapping runs.

Metadata fields on the Field Mapping tab

The
Target Fields
list in the Field Mappings tab includes the following metadata fields:
GroupKey
Contains the data values that the transformation uses to sort input records into groups for duplicate analysis.
SequenceId
Contains a unique identifier for each record that enters the transformation.
The transformation uses the sequence ID values to identify records in the Out_DriverId and Out_LinkId data. If you do not map the SequenceId field, the transformation uses the values on the OutRowId field as unique identifiers for the records.

Metadata fields on the Output Fields tab

The Output Fields tab includes the following metadata fields:
Out_ClusterId
Contains the identifiers of the cluster to which each record belongs.
In the deduplication process, a cluster is a set of records whose data values match each other to a degree that exceeds the duplicate threshold. Records in the same set are likely to identify the same identity. A set may contain a single record, as every unique record is a perfect match with itself.
Out_ClusterSize
Contains the number of records in the set to which the current record belongs. When a set contains a unique record, the cluster size is 1.
Out_DriverId
Contains the identifier of the driver record in each matching record set. The driver record is the record in the set with the lowest value on the SequenceId input field. If the transformation does not use the SequenceId field, the driver record is the record in the matching set with the lowest Out_RowId value.
Out_DriverScore
Contains the score that represents the degree of similarity between the current record and the driver record in the matching record set.
Out_IsSurvivor
Contains an identifier for the preferred record that a consolidation process specifies.
Out_LinkId
Contains the identifier of the record that matched with the current record and linked it to the matching record set.
Out_LinkScore
Contains the score between two records that results in the addition of a record to a matching record set. The Out_LinkId field identifies the record with which the current record shares the link score.
Out_RowId
Contains a unique identifier for each record in the mapping source data set.
The transformation uses the Out_RowId values to identify records if you do not map a field of unique identifiers to the SequenceId field.

Selecting metadata fields

The metadata fields can provide important information about the relationship between duplicate records. For example, the metadata includes the Out_LinkScore field, which represents the degree of similarity between two records as a numerical value. If you select the Out_LinkScore field, select the Out_LinkId field also. The Out_LinkId field identifies the other record in the pair of records that the Out_LinkScore value describes.
The Out_DriverId value provides a benchmark for all records in a matching record set. The Out_DriverId value is the score between the current record and the record in the set with the lowest sequence ID or row ID value. The record with the lowest ID is also the first record that the deduplication process added to the set.

0 COMMENTS

We’d like to hear from you!