The repository tokenization job creates match tokens for the input data in HDFS and loads the tokenized data into the repository. The repository tokenization job uses the columns that you configure as index fields to generate the match tokens.
The repository tokenization job performs the tasks of a HDFS tokenization job and a load clustering job. The repository tokenization job reads the input data in HDFS, creates tokenized data in HDFS, and loads the tokenized data into the repository. The tokenized data includes input records and their match tokens. You must use the repository update job to incrementally update the tokenized data in the repository.
The following image shows how the repository tokenization job creates match tokens for the input data and loads the tokenized data into a repository:
When you run the repository tokenization job, the job performs the following tasks:
Reads the input files in HDFS.
Generates match tokens for the input data.
Writes the tokenized data to the output files in HDFS.
The tokenized data includes input records and their match tokens.
The number of output files depends on the number of reducers that you run.