Table of Contents

Search

  1. Preface
  2. Transformations
  3. Source transformation
  4. Target transformation
  5. Access Policy transformation
  6. Aggregator transformation
  7. B2B transformation
  8. Chunking transformation
  9. Cleanse transformation
  10. Data Masking transformation
  11. Data Services transformation
  12. Deduplicate transformation
  13. Expression transformation
  14. Filter transformation
  15. Hierarchy Builder transformation
  16. Hierarchy Parser transformation
  17. Hierarchy Processor transformation
  18. Input transformation
  19. Java transformation
  20. Java transformation API reference
  21. Joiner transformation
  22. Labeler transformation
  23. Lookup transformation
  24. Machine Learning transformation
  25. Mapplet transformation
  26. Normalizer transformation
  27. Output transformation
  28. Parse transformation
  29. Python transformation
  30. Rank transformation
  31. Router transformation
  32. Rule Specification transformation
  33. Sequence transformation
  34. Sorter transformation
  35. SQL transformation
  36. Structure Parser transformation
  37. Transaction Control transformation
  38. Union transformation
  39. Vector Embedding transformation
  40. Velocity transformation
  41. Verifier transformation
  42. Web Services transformation

Transformations

Transformations

Groups in duplicate analysis

Groups in duplicate analysis

A duplicate analysis mapping can take time to run because of the number of data comparisons that the Deduplicate transformation must perform. The number of comparisons relates to the number of data values on the fields that you select.
The following table shows the number of calculations that a mapping performs on a single field:
Number of data values
Number of comparisons
10,000
50 million
100,000
5,000 million
1 million
500,000 million
To reduce the time that the mapping takes to run, you configure the Deduplicate transformation to assign the input records to
groups
.
A group is a set of records that contain identical values on a field that you specify. When you perform duplicate analysis on grouped data, the Deduplicate transformation analyzes the record data exclusively within each group and combines the results from each group into a single output data set. The field on which you group the data is the
GroupKey
field. When you choose an appropriate group key, you reduce the overall number of comparisons that the Deduplicate transformation must perform without any meaningful loss of accuracy in the mapping analysis. Select the GroupKey field in the Deduplicate transformation.
The following table shows the number of calculations that a mapping performs on a single field that you sort into ten groups:
Number of data values
Number of groups
Group size
Total number of comparisons (all groups)
10,000
10
1,000
5 million
100,000
10
10,000
500 million
1 million
10
100,000
50,000 million
Consider the following rules and guidelines when you organize data into groups:
  • The GroupKey field must contain a range of identical values, such as a city name or a state name in an address data set.
  • Do not select a group key that contains information that is relevant to the duplicate analysis. For example, do not select the index key field as the GroupKey field. The goal in group creation is to organize the data according to values whose duplicate nature is not relevant to the objectives of the analysis.
  • When you select a group key, consider whether the transformation can create groups that are a valid size relative to your input data. If the groups are too small, the match analysis might not find all of the duplicate records in the data set. If the groups are too large, the match analysis might return false duplicates.
  • If your data does not contain a suitable field for group keys, create a data column that the transformation can use to can sort the records into the group sizes that you require. For example, for a data set that contains 1 million records, you might decide to create a column that repeats a series of values from 1 through 50. The records in each group will be distributed evenly in the data set and will allow the duplicate analysis to proceed on grouped data.
  • If you do not want to sort the records into groups, specify a GroupKey field that contains the same value in every record. If a suitable field does not exist, create the field. For example, create a column of data in which every value is
    Group1
    and select the column as the GroupKey field. When the mapping runs, the Deduplicate transformation sorts the records by the GroupKey field values and therefore assigns every record to the same group.
  • Groups do not reorder the position of the records in the mapping data set.

0 COMMENTS

We’d like to hear from you!