The repository batch search job identifies the matching records for the input data in the repository based on the match tokens. The repository batch search job reads the input data in HDFS and creates the output files that contain the matching records for the input data in HDFS.
The repository batch search job requires the repository to contain all the columns with the match tokens. You must set the
StoreAllFields
parameter to true in the configuration file when you tokenize the input data to include all the columns.
The following image shows how the repository batch search job searches for the matching records in the repository:
When you run the repository batch search job, the job performs the following tasks:
Reads the input files in HDFS.
Compares the input data against the tokenized data in the repository based on the match tokens.
Writes the matching records for the input data to the output files in HDFS.
The number of output files depends on the number of reducers that you run.