Before you process the input data, you must create the required tables in the repository. Use the batch jobs to create the required tables in the repository.
If you plan to perform the linking process, perform the following tasks:
Run the initial clustering job with at least one record.
If you want to uniformly distribute the linked data across all the regions in the repository, run the region splitter job.
The job analyzes the input linked data and identifies the split points for all the regions in the repository.
Run the load clustering job.
The job creates primary key table, link table, and index table in the repository.
If you want to consolidate the linked data, run the
create_preferred_records_table.sh
script located in the following directory:
/usr/local/mdmbdrm-<Version Number>
The script creates an empty preferred records table in the repository.
For more information about the initial clustering, region splitter, load clustering, and consolidation jobs, see the Linking Data and Persisting the Linked Data in a Repository section.
If you plan to perform the tokenization process, perform one of the following tasks:
Run the repository tokenization job with at least one record.
The job creates the required tables in the repository.
Perform the following tasks:
Run the HDFS tokenization job with at least one record.
If you want to uniformly distribute the linked data across all the regions in the repository, run the region splitter job.
The job analyzes the input linked data and identifies the split points for all the regions in the repository.
Run the load clustering job.
The job creates the required tables in the repository.
For more information about the repository tokenization, HDFS tokenization, region splitter, and load clustering jobs, see the Tokenizing Data and Persisting the Tokenized Data in a Repository section.