Preface
Transformations
- Active and passive transformations
- Transformation types
- Licensed transformations
- Incoming fields
  - Field name conflicts
    - Creating a field name conflict resolution
  - Field rules
- Data object preview
- Variable fields
- Transformation caches
- Expression editor
- Expression macros
- Generate an expression
  - Prompts to generate expressions
- File lists
- Configuration for multibyte hierarchical data
Source transformation
- Source object
- File sources
- Database sources
- Web service sources
- Partitions
  - Partitioning rules and guidelines
  - Partitioning examples
- Reading hierarchical data in advanced mode
- Reading documents in advanced mode
- Configuration for multibyte hierarchical data
- Source fields
  - Editing native data types in complex file sources
  - Editing transformation data types
Target transformation
- Target object
  - Target file creation on advanced clusters
- File targets
- Database targets
- Web service targets
  - Web service operations for targets
  - Field mapping for web service targets
- Partitions
- Writing hierarchical data in advanced mode
- Configuration for multibyte hierarchical data
- Target fields
- Target transformation field mappings
- Configuring a Target transformation
Access Policy transformation
- Data access policies overview
- Data access policy best practices
- Access Policy transformation configuration
- Using parameters in Access Policy transformations
- Access Policy transformation example
- Unmasking protected data
Aggregator transformation
- Group by fields
- Sorted data
- Aggregate fields
- Advanced properties
- Hierarchical data in advanced mode
- Aggregator transformation example
B2B transformation
- B2B Incoming Fields
- B2B settings
- Output fields
- Field mapping
- Advanced settings
Chunking transformation
- Chunking methods
- Text processing functions
- Output fields
Cleanse transformation
- Cleanse transformation configuration
  - Cleanse asset considerations
  - Synchronizing data quality assets
- Cleanse transformation field mappings
- Cleanse transformation output fields
- Advanced properties
Data Masking transformation
- Masking techniques
- Configuration properties for masking techniques
- Credit card masking
- Email masking
  - Advanced email masking
- IP address masking
- Key masking
- Phone number masking
- Random masking
- Social Insurance number masking
- Social Security number masking
- Custom substitution masking
- Dependent masking
  - Dependent masking parameters
- Substitution masking
- URL address masking
- Mask rule parameter
- Mask rule parameter example
  - Create a mapping with parameters
  - Run the mapping
- Creating a Data Masking transformation
- Consistent masked output
  - Rules and guidelines
  - Example
- Data Masking transformation example
Data Services transformation
- Dynamic service name
- Status tracing messages
- Data Services properties
- Data Services transformation input fields
- Data Services transformation output fields
- Data Services transformation field mapping
Deduplicate transformation
- Deduplication and consolidation operations
- Identity population data
- Groups in duplicate analysis
  - Example: Selecting a group key column
- Deduplicate transformation configuration
- Deduplicate transformation field mappings
- Metadata fields on the Deduplicate transformation
- Link scores and driver scores
- Deduplicate transformation output fields
- Advanced properties
Expression transformation
- Expression fields
- Window functions
- Advanced properties
- Hierarchical data in advanced mode
Filter transformation
- Filter conditions
- Advanced properties
- Hierarchical data in advanced mode
Hierarchy Builder transformation
- Configure output settings
- Join and map fields for data conversion
  - Joining incoming data
  - Mapping relational fields to hierarchy fields
- Configure advanced properties
- Configuration for multibyte hierarchical data
- Hierarchy Builder transformation example
Hierarchy Parser transformation
- Using a Hierarchy Parser transformation
- Hierarchy Parser rules and guidelines
- Choosing a sample or schema file
- Hierarchical schemas
  - Rules and guidelines for hierarchical schemas
  - Creating a hierarchical schema
- Input settings
  - Selecting a hierarchical schema
  - Creating a hierarchical schema from sample
- Input field selection
- Field mapping
  - Selecting the elements to convert
- Output fields
- Selecting an output group
- Configuration for multibyte hierarchical data
- Hierarchy Parser transformation example
Hierarchy Processor transformation
- Hierarchy Processor transformation overview
- Processing relational output
- Processing hierarchical output
- Processing flattened output
Input transformation
- Input fields
Java transformation
- Defining a Java transformation
- Classpath configuration
- Java transformation fields
- Configuring Java transformation properties
- Developing the Java code
- Compiling the code
  - Viewing the full class code
- Troubleshooting a Java transformation
  - Finding the source of compilation errors
  - Identifying the error type
- Java transformation example
Java transformation API reference
- failSession
- generateRow
- getInRowType
- incrementErrorCount
- invokeJExpression
- isNull
- logError
- logInfo
- setNull
- setOutRowType
Joiner transformation
- Join condition
- Join type
- Advanced properties
- Hierarchical data in advanced mode
- Creating a Joiner transformation
- Joiner transformation example
Labeler transformation
- Labeler transformation configuration
- Labeler transformation field mappings
- Labeler transformation output fields
Lookup transformation
- Lookup object
  - Lookup object properties
    - Multiple match policy restrictions
  - Custom queries
- Lookup condition
- Lookup return fields
- Advanced properties
- Lookup SQL overrides
- Lookup source filter
- Dynamic lookup cache
- Persistent lookup cache
  - Rebuilding the lookup cache
- Unconnected lookups
  - Configuring an unconnected Lookup transformation
  - Calling an unconnected lookup from another transformation
- Connected Lookup example
- Dynamic Lookup example
- Unconnected Lookup example
Machine Learning transformation
- Deploying the model as a REST endpoint
- Accessing the machine learning model
- Mapping fields to the request schema
  - Mapping hierarchical fields
  - Request mapping options
- Viewing response fields
- Configuring bulk requests
  - Bulk request options
- Configuring an API proxy
- Troubleshooting
- Error handling
- Machine Learning transformation example
Mapplet transformation
- Mapplet transformation configuration
- Selecting a mapplet
- Mapplet transformation field mappings
- Mapplet parameters
- Mapplet transformation output fields
- Mapplet transformation names
- Synchronizing a mapplet
Normalizer transformation
- Normalized fields
- Normalizer field mapping
  - Normalizer field mapping options
- Advanced properties
- Target configuration for Normalizer transformations
- Normalizer field rule for parameterized sources
- Mapping example with a Normalizer and Aggregator
Output transformation
- Output fields
  - Generating output fields based on incoming fields
- Field mapping
Parse transformation
- Parse transformation configuration
- Parse transformation field mappings
- Parse transformation output fields
- Advanced properties
Python transformation
- Install and configure Python
- Python transformation fields
- Active and passive Python transformations
- Resource files
- Developing the Python code
  - Creating Python code snippets
  - Referencing a resource file
- Example: Add an ID column to nonpartitioned data
- Example: Use partitions to find the highest salary
- Example: Operationalize a pre-trained model
Rank transformation
- Ranking string values
- Rank caches
- Defining a Rank transformation
- Rank transformation fields
- Defining rank properties
- Defining rank groups
- Advanced properties
- Hierarchical data in advanced mode
- Rank transformation example
Router transformation
- Working with groups
  - Guidelines for connecting output groups
- Group filter conditions
  - Configuring a group filter condition
- Advanced properties
- Hierarchical data in advanced mode
- Router transformation examples
Rule Specification transformation
- Rule Specification transformation configuration
- Rule Specification transformation field mappings
- Rule Specification transformation output fields
- Advanced properties
Sequence transformation
- Sequence transformation uses
- Sequence output fields
- Sequence properties
  - Disabling incoming fields
- Hierarchical data in advanced mode
- Sequence transformation rules and guidelines
- Sequence transformation example
Sorter transformation
- Sort conditions
- Sorter caches
- Advanced properties
- Hierarchical data in advanced mode
- Sorter transformation example
SQL transformation
- Stored procedure or function processing
- Connected or unconnected SQL transformation for stored procedure processing
- Unconnected SQL transformations
- Query processing
- SQL transformation configuration
Structure Parser transformation
- Processing input from a Hadoop Files source
- Processing input from a flat file source
  - Configuring the flat file source
  - Configuring the Structure Parser transformation to access flat files
- Structure Parser field mapping
- Output fields
- Advanced properties
- Structure Parser transformation configuration
- Rules and guidelines for the Structure Parser transformation
- Structure Parser transformation example
Transaction Control transformation
- Transaction control condition
- Using Transaction Control transformations in mappings
  - Sample transaction control mappings with multiple targets
- Guidelines for using Transaction Control transformations in mappings
- Advanced properties
Union transformation
- Comparison to Joiner transformation
- Planning to use a Union transformation
- Input groups
- Output fields
- Field mappings
- Advanced properties
- Union Transformation example
Vector Embedding transformation
- Vector embedding models
- Built-in vector embedding techniques
- Vector embedding output fields
Velocity transformation
- Velocity transformation input format
  - Source configuration for file sources
- Velocity template
- Testing the template
- Velocity transformation output
  - Target configuration for file targets
- Velocity transformation parsers
- Examples
  - XML conversion example
  - JSON conversion example
Verifier transformation
- Address Reference Data
- Verifier transformation configuration
- Verifier transformation field mappings
  - Understanding input and output mappings
- Verifier transformation output fields
- Advanced properties
Web Services transformation
- Create a Web Services consumer connection
- Define a business service
- Configure the Web Services transformation
- Web Services transformation example
- Configuration for multibyte hierarchical data

Transformations

Back Next

Link scores and driver scores

The deduplication process adds link score and driver score data to the Deduplicate transformation output. You can use the data to better understand the relationship between duplicate records.

The link score is the score between two records that identifies them as members of the same matching set. The score represents a link between a given record and the first record that it matches with a score above the threshold value. The link ID identifies the records to which a link score applies.

The link score and link ID values do not imply that a pair of records are the best match in the input data. The purpose of the link score and link ID values is to describe the composition of the matching record set.

The driver score is the score between the first record added to a matching record set and another record in the same set. The transformation uses the sequence ID or row ID values to identify the first record in the set. Driver scores provide a means to assess all records in the set against a single record.

Duplicate analysis generates a single set of scores for the input records. The driver scores and link scores represent the different relationships between the records and do not indicate different types of duplicate analysis. The driver score and link score assignments can depend on the order in which the records enter the transformation. A driver score for a given pair of records might be lower than the threshold value.

Example of link scores and driver scores

A Deduplicate transformation analyzes records with a column of surname data. The deduplicate asset defines a threshold value of

0.825

for duplicate records.

The following table shows the results that the transformation might return:

Surname	Sequence ID	ClusterId	ClusterSize	DriverId	DriverScore	LinkId	LinkScore
SMITH	1	1	2	1 - 6	1	1 - 1	1
SMYTH	2	2	2	1 - 3	0.83333	1 - 2	1
SMYTHE	3	2	2	1 - 3	1	1 - 2	0.83333
SMITT	4	3	1	1 - 4	1	1 - 4	1
SMITS	5	4	1	1 - 5	1	1 - 5	1
SMITH	6	1	2	1 - 6	1	1 - 1	1

The results provide the following information about the surname data:

SMITT and SMITS do not match any other record with a score that meets the threshold. The transformation determines that the records are unique in the data set. The transformation can assign score values of 1 to the records because each record matches itself uniquely.

SMITT and SMITS each have a ClusterSize value of 1, which indicates that they are the only record in their respective sets. To find unique records in the output, search for matching record sets that contain a single record.

SMITH and SMITH have a link score of 1. The transformation determines that the records are identical. The transformation adds the records to a single matching record set. The ClusterId value indicates that the records belong to the same set.

SMYTH and SMYTHE match with a score of 0.83333. The score exceeds the duplicate threshold. Therefore, the transformation adds the records to a single matching record set.

Deduplicate transformation

Download Guide

Watch

Comments

Cloud Data Integration Homepage