Use the post-clustering job to read the output files of an initial clustering job in HDFS and process the input data based on the mode that you configure. The input data can be linked data or poor quality data.
You can run the post-clustering job in one of the following modes:
Skip
Skips the records in the high-volume clusters that contain more than the specified number of records.
Recluster
Re-links the records in the high-volume clusters that contain more than the specified number of records.
Longtail
Decrypts the poor quality records that the initial clustering job identifies to the original input format. You can cleanse the decrypted data, and use it as the input data for the initial clustering job.
Export
Exports the linked data in the CSV format.
The following image shows how the post-clustering job processes the input data in the skip, recluster, and longtail modes:
The post-clustering job performs the following tasks:
Reads the output files of an initial clustering job in HDFS.
Processes the input data based on the mode that you configure.
Writes the processed data in HDFS.
The following image shows how the post-clustering job processes the input data in the export mode:
The post-clustering job performs the following tasks:
Reads the input and output files of an initial clustering job in HDFS.