User Guide

Back Next

Running the Post-Clustering Job

The post-clustering job uses the output files of an initial clustering job, so ensure that you run the initial clustering job before you run the post-clustering job.

To run the post-clustering job, run the

run_postprocess.sh

script located in the following directory:

/usr/local/mdmbdrm-<Version Number>

Skip and Recluster Modes

To run the

run_postprocess.sh

script in the skip or recluster mode, use the following command format:

run_postprocess.sh

--config=configuration_file_name

--input=input_file_in_HDFS

--hdfsdir=working_directory_in_HDFS

--rule=matching_rules_file_name

--maxcluster=maximum_number_of_records

--mode=SKIP_LARGE_CLUSTER|RECLUSTER_TRANSITIVE

--threshold=matching_score_to_recluster

[--reducer=number_of_reducer_jobs]

The following table describes the options and arguments that you can specify to run the

run_postprocess.sh

script:

Option	Argument	Description
--config	configuration_file_name	Absolute path and file name of the configuration file.
--input	input_file_in_HDFS	Absolute path to the directory that contains linked data. If you run the initial clustering job without the --outputpath parameter, you can find the processed data in the following directory: <Working Directory in HDFS>/batch-cluster/<Job ID>/output/dir If you run the initial clustering job with the --outputpath parameter, you can find the processed data in the following directory: <Output Directory in HDFS>/batch-cluster/output/dir
--hdfsdir	working_directory_in_HDFS	Absolute path to a working directory in HDFS. The post-clustering job uses the working directory to store the library files.
--rule	matching_rules_file_name	Absolute path and file name of the matching rules file. The values in the matching rules file override the values in the configuration file.
--maxcluster	maximum_number_of_records	Maximum number of records in a cluster. If the number of records in a cluster exceeds the specified maximum number of records, the cluster becomes a high-volume cluster, and the post-clustering job processes the cluster.
--mode	SKIP_LARGE_CLUSTER\| RECLUSTER_TRANSITIVE	Indicates the mode to process the input data. Use one of the following modes: SKIP_LARGE_CLUSTER . Indicates to skip all the records in the high-volume clusters. RECLUSTER_TRANSITIVE . Indicates to re-link all the records in the high-volume clusters.
--threshold	matching_score_to_recluster	Minimum score required for the records to be part of the same cluster when you run the post-clustering job in the recluster mode. The job creates a separate cluster for each record whose score is less than the threshold value.
--reducer	number_of_reducer_jobs	Optional. Number of reducer jobs that you want to run. Default is 1.

For example, the following command runs the post-clustering job in the skip mode:

run_postprocess.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/workingdir/batch-cluster/MDMBDE0063_1602999447744334391/output/dir/pass-join --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --maxcluster=1200 --threshold=90 --mode=SKIP_LARGE_CLUSTER

If you run the post-clustering job without the

--outputpath

parameter, you can find the processed data in the following directory:

<Working Directory in HDFS>/ClusterPostProcessing/<Job ID>/output

If you run the post-clustering job with the

--outputpath

parameter, you can find the processed data in the following directory:

<Output Directory in HDFS>/ClusterPostProcessing/output

Longtail Mode

The initial clustering job identifies the poor quality data from the input data and loads it to a directory in an encrypted format. The post-clustering job decrypts the poor quality data to the original input format.

To run the

run_postprocess.sh

script in the longtail mode, use the following command format:

run_postprocess.sh

--config=configuration_file_name

--input=input_file_in_HDFS

--hdfsdir=working_directory_in_HDFS

--rule=matching_rules_file_name

--mode=LONGTAIL_CLUSTERS

[--reducer=number_of_reducer_jobs]

The following table describes the options and arguments that you can specify to run the

run_postprocess.sh

script:

Option	Argument	Description
--config	configuration_file_name	Absolute path and file name of the configuration file.
--input	input_file_in_HDFS	Absolute path to the directory that contains the poor quality data. If you run the initial clustering job without the --outputpath parameter, you can find the poor quality data in the following directory: <Working Directory in HDFS>/batch-cluster/<Job ID>/poorqualitydata If you run the initial clustering job with the --outputpath parameter, you can find the poor quality data in the following directory: <Output Directory in HDFS>/batch-cluster/poorqualitydata
--hdfsdir	working_directory_in_HDFS	Absolute path to a working directory in HDFS. The post-clustering job uses the working directory to store the library files.
--rule	matching_rules_file_name	Absolute path and file name of the matching rules file. The values in the matching rules file override the values in the configuration file.
--mode	LONGTAIL_CLUSTERS	Indicates that the job processes the poor quality data.
--reducer	number_of_reducer_jobs	Optional. Number of reducer jobs that you want to run. Default is 1.

For example, the following command runs the post-clustering job in the longtail mode:

run_postprocess.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/workingdir/batch-cluster/MDMBDE0063_1602999447744334391/poorqualitydata --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --mode=LONGTAIL_CLUSTERS

If you run the post-clustering job without the

--outputpath

parameter, you can find the decrypted data in the following directory:

<Working Directory in HDFS>/ClusterPostProcessing/<Job ID>/output

If you run the post-clustering job with the

--outputpath

parameter, you can find the decrypted data in the following directory:

<Output Directory in HDFS>/ClusterPostProcessing/output

Export Mode

To run the

run_postprocess.sh

script in the export mode, use the following command format:

run_postprocess.sh

--config=configuration_file_name

--input=input_file_in_HDFS

--matchinput=match_output_in_HDFS

--hdfsdir=working_directory_in_HDFS

--rule=matching_rules_file_name

--mode=CSV_OUTPUT

[--reducer=number_of_reducer_jobs]

The following table describes the options and arguments that you can specify to run the

run_postprocess.sh

script:

Option	Argument	Description
--config	configuration_file_name	Absolute path and file name of the configuration file.
--input	input_file_in_HDFS	Absolute path to the directory that contains the input files of the initial clustering job.
--matchinput	match_output_in_HDFS	Absolute path to the directory that contains the linked data. If you run the initial clustering job without the --outputpath parameter, you can find the linked data in the following directory: <Working Directory in HDFS>/batch-cluster/<Job ID>/match/dir If you run the initial clustering job with the --outputpath parameter, you can find the linked data in the following directory: <Output Directory in HDFS>/batch-cluster/match/dir
--hdfsdir	working_directory_in_HDFS	Absolute path to a working directory in HDFS. The post-clustering job uses the working directory to store the library files.
--rule	matching_rules_file_name	Absolute path and file name of the matching rules file. The values in the matching rules file override the values in the configuration file.
--mode	CSV_OUTPUT	Indicates to export the input data in the CSV format.
--reducer	number_of_reducer_jobs	Optional. Number of reducer jobs that you want to run. Default is 1.

For example, the following command runs the post-clustering job in the export mode:

run_postprocess.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/Source10Million --matchinput=/usr/hdfs/workingdir/batch-cluster/MDMBDE0063_1602999447744334391/match/dir/ --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --mode=CSV_OUTPUT

If you run the post-clustering job without the

--outputpath

parameter, you can find the CSV files in the following directory:

<Working Directory in HDFS>/ClusterPostProcessing/<Job ID>/output-match

If you run the post-clustering job with the

--outputpath

parameter, you can find the CSV files in the following directory:

<Output Directory in HDFS>/ClusterPostProcessing/output-match

Post-Clustering Job

Download Guide

Watch

Comments