User Guide

Back Next

Post-Clustering Job

Use the post-clustering job to read the output files of an initial clustering job in HDFS and process the input data based on the mode that you configure. The input data can be linked data or poor quality data.

You can run the post-clustering job in one of the following modes:

Skip: Skips the records in the high-volume clusters that contain more than the specified number of records.
Recluster: Re-links the records in the high-volume clusters that contain more than the specified number of records.
Longtail: Decrypts the poor quality records that the initial clustering job identifies to the original input format. You can cleanse the decrypted data, and use it as the input data for the initial clustering job.
Export: Exports the linked data in the CSV format.

The following image shows how the post-clustering job processes the input data in the skip, recluster, and longtail modes: