Linking Data and Persisting the Linked Data in HDFS
Linking Data and Persisting the Linked Data in HDFS
You can link the input data based on the matching rules and consolidate the linked data based on the consolidation rules. You can persist the linked and consolidated data in HDFS.
The following image shows the batch jobs that you can run to link data, consolidate the linked data, and persist the data in HDFS:
To persist the linked and consolidated data in HDFS, perform the following tasks:
Run the initial clustering job.
The job links the input data and creates clusters for the input data in HDFS.
If you want to process the output files of the initial clustering job, run the post-clustering job.
The post-clustering job reads the output files that the initial clustering job creates in HDFS and processes it based on the mode that you set.
If you want to consolidate the linked data, run the consolidation job.
The consolidation job creates a preferred record for each cluster.
To add incremental data to the linked data and link the incremental data, run the initial clustering job in the incremental mode.
To delete records from the linked data in HDFS, run the HDFS data deletion job.