User Guide

Back Next

Linking Data and Persisting the Linked Data in a Repository

You can link the input data based on the matching rules and consolidate the linked data based on the consolidation rules. You can then persist the linked and consolidated data in a repository so that you can perform data analytics or searches on the data.

The following image shows the batch jobs that you can run to link the input data, consolidate the linked data, and persist the linked data in a repository:

To persist the linked data in a repository, perform the following tasks:

Run the initial clustering job.

The job links the input data and creates clusters for the input data in HDFS.

If you want to process the output files of the initial clustering job, run the post-clustering job.

The post-clustering job reads the output files that the initial clustering job creates in HDFS and processes it based on the mode that you set.

If you want to uniformly distribute the linked data across all the regions in the repository, run the region splitter job.

The job analyzes the input linked data and identifies the split points for all the regions in the repository.

Run the load clustering job.

The job loads the linked data from HDFS into the repository.

If you want to consolidate the linked data, run the consolidation job.

The consolidation job creates a preferred records table with a preferred record for each cluster.

To add an incremental data to the repository, run the initial clustering job in the incremental mode to link the incremental data and run the load clustering job in the incremental mode to add the linked data to the repository.

If you consolidate the linked data, you can also run the consolidation job in the incremental mode to update the incremental data.

To delete records from the repository, run the repository data deletion job with the

--useIndexId

parameter.

Linking Batch Data

Initial Clustering Job

Repository Data Deletion Job

Download Guide