The repository update job updates the tokenized data in the repository with the input data and creates match tokens for the input data in the repository. During the update process, the repository update job matches the input data with the repository data, deletes the matching records from the repository, and adds the input data to the repository.
Before you run the repository update job, ensure that the repository contains tokenized data.
The following image shows how the repository update job updates the repository data:
The repository update job performs the following tasks:
Reads the input files in HDFS.
Matches the input data with the repository data.
Deletes the matching records from the repository.
Generates match tokens for the input data.
Loads the input data with the match tokens to the repository.