Use the region splitter job to analyze the input tokenized data and identify the split points to uniformly distribute the tokenized data across all the regions in the repository. The uniform distribution of the tokenized data optimally utilizes the resources and improves the search performance.
A load clustering job uses the output files of a region splitter job to distribute the linked data. Run the region splitter job before you run the load clustering job for the first time.
The following image shows how the region splitter job identifies the split points based on the input data:
The region splitter job performs the following tasks:
Reads the tokenized data in HDFS.
Identifies the split points for the number of regions that you specify.