Preface
Transformations
- Active and passive transformations
- Transformation types
- Licensed transformations
- Incoming fields
  - Field name conflicts
    - Creating a field name conflict resolution
  - Field rules
- Data object preview
- Variable fields
- Transformation caches
- Expression macros
- File lists
- Configuration for multibyte hierarchical data
Source transformation
- Source object
- File sources
- Database sources
- Web service sources
- Partitions
  - Partitioning rules and guidelines
  - Partitioning examples
- Reading hierarchical data in advanced mode
- Reading documents in advanced mode
- Configuration for multibyte hierarchical data
- Source fields
  - Editing native data types in complex file sources
  - Editing transformation data types
Target transformation
- Target object
  - Target file creation on advanced clusters
- File targets
- Database targets
- Web service targets
  - Web service operations for targets
  - Field mapping for web service targets
- Partitions
- Writing hierarchical data in advanced mode
- Configuration for multibyte hierarchical data
- Target fields
- Target transformation field mappings
- Configuring a Target transformation
Access Policy transformation
- Using parameters in Access Policy transformations
- Data filter policy best practices
- Access Policy transformation configuration
- Access Policy transformation example
Aggregator transformation
- Group by fields
- Sorted data
- Aggregate fields
- Advanced properties
- Hierarchical data in advanced mode
- Aggregator transformation example
B2B transformation
- B2B Incoming Fields
- B2B settings
- Output fields
- Field mapping
- Advanced settings
Chunking transformation
- Chunking methods
- Chunking output fields
Cleanse transformation
- Cleanse transformation configuration
  - Cleanse asset considerations
  - Synchronizing data quality assets
- Cleanse transformation field mappings
- Cleanse transformation output fields
- Advanced properties
Data Masking transformation
- Masking techniques
- Configuration properties for masking techniques
- Credit card masking
- Email masking
  - Advanced email masking
- IP address masking
- Key masking
- Phone number masking
- Random masking
- Social Insurance number masking
- Social Security number masking
- Custom substitution masking
- Dependent masking
  - Dependent masking parameters
- Substitution masking
- URL address masking
- Mask rule parameter
- Mask rule parameter example
  - Create a mapping with parameters
  - Run the mapping
- Creating a Data Masking transformation
- Consistent masked output
  - Rules and guidelines
  - Example
- Data Masking transformation example
Data Services transformation
- Dynamic service name
- Status tracing messages
- Data Services properties
- Data Services transformation input fields
- Data Services transformation output fields
- Data Services transformation field mapping
Deduplicate transformation
- Deduplication and consolidation operations
- Identity population data
- Groups in duplicate analysis
  - Example: Selecting a group key column
- Deduplicate transformation configuration
- Deduplicate transformation field mappings
- Metadata fields on the Deduplicate transformation
- Link scores and driver scores
- Deduplicate transformation output fields
- Advanced properties
Expression transformation
- Expression fields
- Expression editor
- Transformation language components for expressions
- Expression syntax
- String and numeric literals
- Adding comments to expressions
- Reserved words
- Window functions
  - Frame
  - Partition and order keys
- Example: Use a window to calculate expiration dates
- Example: Use a window to flag GPS pings
- Example: Run an aggregate function on a window
- Advanced properties
- Hierarchical data in advanced mode
Filter transformation
- Filter conditions
- Advanced properties
- Hierarchical data in advanced mode
Hierarchy Builder transformation
- Configure output settings
- Join and map fields for data conversion
  - Joining incoming data
  - Mapping relational fields to hierarchy fields
- Configure advanced properties
- Configuration for multibyte hierarchical data
- Hierarchy Builder transformation example
Hierarchy Parser transformation
- Using a Hierarchy Parser transformation
- Hierarchy Parser rules and guidelines
- Choosing a sample or schema file
- Hierarchical schemas
  - Rules and guidelines for hierarchical schemas
  - Creating a hierarchical schema
- Input settings
  - Selecting a hierarchical schema
  - Creating a hierarchical schema from sample
- Input field selection
- Field mapping
  - Selecting the elements to convert
- Output fields
- Selecting an output group
- Configuration for multibyte hierarchical data
- Hierarchy Parser transformation example
Hierarchy Processor transformation
- Hierarchy Processor transformation overview
- Processing relational output
- Processing hierarchical output
- Processing flattened output
Input transformation
- Input fields
Java transformation
- Defining a Java transformation
- Classpath configuration
- Java transformation fields
- Configuring Java transformation properties
- Developing the Java code
- Compiling the code
  - Viewing the full class code
- Troubleshooting a Java transformation
  - Finding the source of compilation errors
  - Identifying the error type
- Java transformation example
Java transformation API reference
- failSession
- generateRow
- getInRowType
- incrementErrorCount
- invokeJExpression
- isNull
- logError
- logInfo
- setNull
- setOutRowType
Joiner transformation
- Join condition
- Join type
- Advanced properties
- Hierarchical data in advanced mode
- Creating a Joiner transformation
- Joiner transformation example
Labeler transformation
- Labeler transformation configuration
- Labeler transformation field mappings
- Labeler transformation output fields
Lookup transformation
- Lookup object
  - Lookup object properties
    - Multiple match policy restrictions
  - Custom queries
- Lookup condition
- Lookup return fields
- Advanced properties
- Lookup SQL overrides
- Lookup source filter
- Dynamic lookup cache
- Persistent lookup cache
  - Rebuilding the lookup cache
- Unconnected lookups
  - Configuring an unconnected Lookup transformation
  - Calling an unconnected lookup from another transformation
- Connected Lookup example
- Dynamic Lookup example
- Unconnected Lookup example
Machine Learning transformation
- Deploying the model as a REST endpoint
- Accessing the machine learning model
- Mapping fields to the request schema
  - Mapping hierarchical fields
  - Request mapping options
- Viewing response fields
- Configuring bulk requests
  - Bulk request options
- Configuring an API proxy
- Troubleshooting
- Error handling
- Machine Learning transformation example
Mapplet transformation
- Mapplet transformation configuration
- Selecting a mapplet
- Mapplet transformation field mappings
- Mapplet parameters
- Mapplet transformation output fields
- Mapplet transformation names
- Synchronizing a mapplet
Normalizer transformation
- Normalized fields
- Normalizer field mapping
  - Normalizer field mapping options
- Advanced properties
- Target configuration for Normalizer transformations
- Normalizer field rule for parameterized sources
- Mapping example with a Normalizer and Aggregator
Output transformation
- Output fields
  - Generating output fields based on incoming fields
- Field mapping
Parse transformation
- Parse transformation configuration
- Parse transformation field mappings
- Parse transformation output fields
- Advanced properties
Python transformation
- Install and configure Python
- Python transformation fields
- Active and passive Python transformations
- Resource files
- Developing the Python code
  - Creating Python code snippets
  - Referencing a resource file
- Example: Add an ID column to nonpartitioned data
- Example: Use partitions to find the highest salary
- Example: Operationalize a pre-trained model
Rank transformation
- Ranking string values
- Rank caches
- Defining a Rank transformation
- Rank transformation fields
- Defining rank properties
- Defining rank groups
- Advanced properties
- Hierarchical data in advanced mode
- Rank transformation example
Router transformation
- Working with groups
  - Guidelines for connecting output groups
- Group filter conditions
  - Configuring a group filter condition
- Advanced properties
- Hierarchical data in advanced mode
- Router transformation examples
Rule Specification transformation
- Rule Specification transformation configuration
- Rule Specification transformation field mappings
- Rule Specification transformation output fields
- Advanced properties
Sequence transformation
- Sequence transformation uses
- Sequence output fields
- Sequence properties
  - Disabling incoming fields
- Hierarchical data in advanced mode
- Sequence transformation rules and guidelines
- Sequence transformation example
Sorter transformation
- Sort conditions
- Sorter caches
- Advanced properties
- Hierarchical data in advanced mode
- Sorter transformation example
SQL transformation
- Stored procedure or function processing
- Connected or unconnected SQL transformation for stored procedure processing
- Unconnected SQL transformations
- Query processing
- SQL transformation configuration
Structure Parser transformation
- Processing input from a Hadoop Files source
- Processing input from a flat file source
  - Configuring the flat file source
  - Configuring the Structure Parser transformation to access flat files
- Structure Parser field mapping
- Output fields
- Advanced properties
- Structure Parser transformation configuration
- Rules and guidelines for the Structure Parser transformation
- Structure Parser transformation example
Transaction Control transformation
- Transaction control condition
- Using Transaction Control transformations in mappings
  - Sample transaction control mappings with multiple targets
- Guidelines for using Transaction Control transformations in mappings
- Advanced properties
Union transformation
- Comparison to Joiner transformation
- Planning to use a Union transformation
- Input groups
- Output fields
- Field mappings
- Advanced properties
- Union Transformation example
Vector Embedding transformation
- Vector embedding techniques
- Vector embedding output fields
Velocity transformation
- Velocity transformation input format
  - Source configuration for file sources
- Velocity template
- Testing the template
- Velocity transformation output
  - Target configuration for file targets
- Velocity transformation parsers
- Examples
  - XML conversion example
  - JSON conversion example
Verifier transformation
- Address Reference Data
- Verifier transformation configuration
- Verifier transformation field mappings
  - Understanding input and output mappings
- Verifier transformation output fields
- Advanced properties
Web Services transformation
- Create a Web Services consumer connection
- Define a business service
- Configure the Web Services transformation
- Web Services transformation example
- Configuration for multibyte hierarchical data

Transformations

Back Next

Deduplication and consolidation operations

When you run a mapping, the Deduplicate transformation generates a temporary index from the input records that it reads. The transformation analyzes the index to find pairs of similar records.

The transformation calculates a series of percentage scores that represent the degrees of similarity between the pairs of records that it finds. If two records match each other with a score that exceeds a given threshold, the transformation considers the records to be duplicates.

The deduplicate asset that you add to the transformation specifies the comparison criteria for the deduplication operation, including the threshold score that duplicate records must satisfy.

Consolidation is an optional process that the deduplicate asset can specify for the transformation. During consolidation, the transformation evaluates the sets of matching records that the deduplication process identifies. The transformation selects or constructs a preferred version of the records in each matching set.

Data Quality

user configures the deduplication and consolidation processes in the deduplicate asset. For more information about the criteria that the asset defines, contact the

Data Quality

user.

Rules and guidelines for deduplication and consolidation

When you add a Deduplicate transformation to a mapping, consider the following rules and guidelines:

Mapping fields for identity analysis: The deduplicate asset that you add to the transformation specifies a type of identity, such as a person name or an organization name. The asset identifies the identity type as the objective of the deduplication operation. The type of identity defines the types of information that the transformation expects to find in the index.

You must map the appropriate input fields on the transformation to the target fields that the transformation indicates. You can optionally map additional input fields to other fields on the transformation.
Groups and sequence ID values: In duplicate analysis, a group is a set of records that contain identical values in a given field. At run time, the Deduplicate transformation analyzes the index data for records exclusively within each group and subsequently combines the results from each group into a single data set. Use the GroupKey field on the
Field Mapping
tab to define your groups. When you create groups on an appropriate field, you reduce the overall number of comparisons that the transformation must perform without any meaningful loss of accuracy in duplicate analysis.

The GroupKey field is mandatory. If you prefer not to sort your input data into groups, add a column to your data set that has the same value on every row and map the column to the GroupKey field.; Sequence ID values determine the order in which the transformation reads the input records. If your input records do not contain a field that can provide data to the SequenceId field, the transformation reads the records in the order in which they appear in the input data set. A SequenceId field is mandatory if you run a mapping in advanced mode.
Clusters and scores: When two or more records match each other, the transformation assigns them to the same matching set and adds an ID value to each record that identifies them as members of the set.

A set of matching records within a group is also known as a cluster, and the ID value that associates matching records together is the cluster ID.

The transformation represents the relationships between matching records with link score and driver score values in the output data set. The link score is the score between two records that identifies them as members of the same cluster. The driver score is the score between the first record added to a cluster and another record in the cluster.

Bear in mind that the transformation generates a single score for each pair of matching records that it finds. The link and driver scores define the types of relationship between different records and do not represent different calculations.
Metadata fields: On the
Output Fields
tab, the transformation adds fields that display the score values for pairs of matching records. The fields also identify the cluster to which each record belongs. If the deduplicate asset specifies a consolidation process, the metadata fields specify a preferred record for each cluster. The transformation identifies the preferred record as the survivor record.

Use the fields to understand the mapping results.

For more information about the metadata fields, see Metadata fields on the Deduplicate transformation and Link scores and driver scores.