Table of Contents

Search

  1. Preface
  2. Introduction to Informatica MDM - Relate 360
  3. Linking Batch Data
  4. Tokenizing Batch Data
  5. Processing Streaming Data
  6. Creating Relationship Graph
  7. Loading Linked and Consolidated Data into Hive
  8. Searching Data
  9. Monitoring the Batch Jobs
  10. Troubleshooting
  11. Glossary

User Guide

User Guide

Repository Tokenization Job

Repository Tokenization Job

The repository tokenization job creates match tokens for the input data in HDFS and loads the tokenized data into the repository. The repository tokenization job uses the columns that you configure as index fields to generate the match tokens.
The repository tokenization job performs the tasks of a HDFS tokenization job and a load clustering job. The repository tokenization job reads the input data in HDFS, creates tokenized data in HDFS, and loads the tokenized data into the repository. The tokenized data includes input records and their match tokens. You must use the repository update job to incrementally update the tokenized data in the repository.
The following image shows how the repository tokenization job creates match tokens for the input data and loads the tokenized data into a repository:
The repository tokenization job reads the input files in HDFS, creates match tokens in HDFS, and loads the tokenized data into the repository.
When you run the repository tokenization job, the job performs the following tasks:
  1. Reads the input files in HDFS.
  2. Generates match tokens for the input data.
  3. Writes the tokenized data to the output files in HDFS.
    The tokenized data includes input records and their match tokens.
    The number of output files depends on the number of reducers that you run.
  4. Loads the tokenized data into the repository.

0 COMMENTS

We’d like to hear from you!