Relate 360
- Relate 360 10.1
- All Products
run_postprocess.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name --maxcluster=maximum_number_of_records --mode=SKIP_LARGE_CLUSTER|RECLUSTER_TRANSITIVE --threshold=matching_score_to_recluster [--reducer=number_of_reducer_jobs]
Option
| Argument
| Description
|
---|---|---|
--config
| configuration_file_name
| Absolute path and file name of the configuration file.
|
--input
| input_file_in_HDFS
| Absolute path to the directory that contains linked data. If you run the initial clustering job without the
--outputpath parameter, you can find the processed data in the following directory:
<Working Directory in HDFS>/batch-cluster/<Job ID>/output/dir If you run the initial clustering job with the
--outputpath parameter, you can find the processed data in the following directory:
<Output Directory in HDFS>/batch-cluster/output/dir |
--hdfsdir
| working_directory_in_HDFS
| Absolute path to a working directory in HDFS. The post-clustering job uses the working directory to store the library files.
|
--rule
| matching_rules_file_name
| Absolute path and file name of the matching rules file.
The values in the matching rules file override the values in the configuration file.
|
--maxcluster
| maximum_number_of_records
| Maximum number of records in a cluster.
If the number of records in a cluster exceeds the specified maximum number of records, the cluster becomes a high-volume cluster, and the post-clustering job processes the cluster.
|
--mode
| SKIP_LARGE_CLUSTER| RECLUSTER_TRANSITIVE
| Indicates the mode to process the input data.
Use one of the following modes:
|
--threshold
| matching_score_to_recluster
| Minimum score required for the records to be part of the same cluster when you run the post-clustering job in the recluster mode. The job creates a separate cluster for each record whose score is less than the threshold value.
|
--reducer
| number_of_reducer_jobs
| Optional. Number of reducer jobs that you want to run. Default is 1.
|
run_postprocess.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/workingdir/batch-cluster/MDMBDE0063_1602999447744334391/output/dir/pass-join --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --maxcluster=1200 --threshold=90 --mode=SKIP_LARGE_CLUSTER
run_postprocess.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name --mode=LONGTAIL_CLUSTERS [--reducer=number_of_reducer_jobs]
Option
| Argument
| Description
|
---|---|---|
--config
| configuration_file_name
| Absolute path and file name of the configuration file.
|
--input
| input_file_in_HDFS
| Absolute path to the directory that contains the poor quality data.
If you run the initial clustering job without the
--outputpath parameter, you can find the poor quality data in the following directory:
<Working Directory in HDFS>/batch-cluster/<Job ID>/poorqualitydata If you run the initial clustering job with the
--outputpath parameter, you can find the poor quality data in the following directory:
<Output Directory in HDFS>/batch-cluster/poorqualitydata |
--hdfsdir
| working_directory_in_HDFS
| Absolute path to a working directory in HDFS. The post-clustering job uses the working directory to store the library files.
|
--rule
| matching_rules_file_name
| Absolute path and file name of the matching rules file.
The values in the matching rules file override the values in the configuration file.
|
--mode
| LONGTAIL_CLUSTERS
| Indicates that the job processes the poor quality data.
|
--reducer
| number_of_reducer_jobs
| Optional. Number of reducer jobs that you want to run. Default is 1.
|
run_postprocess.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/workingdir/batch-cluster/MDMBDE0063_1602999447744334391/poorqualitydata --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --mode=LONGTAIL_CLUSTERS
run_postprocess.sh --config=configuration_file_name --input=input_file_in_HDFS --matchinput=match_output_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name --mode=CSV_OUTPUT [--reducer=number_of_reducer_jobs]
Option
| Argument
| Description
|
---|---|---|
--config
| configuration_file_name
| Absolute path and file name of the configuration file.
|
--input
| input_file_in_HDFS
| Absolute path to the directory that contains the input files of the initial clustering job.
|
--matchinput
| match_output_in_HDFS
| Absolute path to the directory that contains the linked data.
If you run the initial clustering job without the
--outputpath parameter, you can find the linked data in the following directory:
<Working Directory in HDFS>/batch-cluster/<Job ID>/match/dir If you run the initial clustering job with the
--outputpath parameter, you can find the linked data in the following directory:
<Output Directory in HDFS>/batch-cluster/match/dir |
--hdfsdir
| working_directory_in_HDFS
| Absolute path to a working directory in HDFS. The post-clustering job uses the working directory to store the library files.
|
--rule
| matching_rules_file_name
| Absolute path and file name of the matching rules file.
The values in the matching rules file override the values in the configuration file.
|
--mode
| CSV_OUTPUT
| Indicates to export the input data in the CSV format.
|
--reducer
| number_of_reducer_jobs
| Optional. Number of reducer jobs that you want to run. Default is 1.
|
run_postprocess.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/Source10Million --matchinput=/usr/hdfs/workingdir/batch-cluster/MDMBDE0063_1602999447744334391/match/dir/ --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --mode=CSV_OUTPUT