Table of Contents

Search

  1. Preface
  2. Introduction to Informatica MDM - Relate 360
  3. Linking Batch Data
  4. Tokenizing Batch Data
  5. Processing Streaming Data
  6. Creating Relationship Graph
  7. Loading Linked and Consolidated Data into Hive
  8. Searching Data
  9. Monitoring the Batch Jobs
  10. Troubleshooting
  11. Glossary

User Guide

User Guide

Running the Post-Clustering Job

Running the Post-Clustering Job

The post-clustering job uses the output files of an initial clustering job, so ensure that you run the initial clustering job before you run the post-clustering job.
To run the post-clustering job, run the
run_postprocess.sh
script located in the following directory:
/usr/local/mdmbdrm-<Version Number>

Skip and Recluster Modes

To run the
run_postprocess.sh
script in the skip or recluster mode, use the following command format:
run_postprocess.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name --maxcluster=maximum_number_of_records --mode=SKIP_LARGE_CLUSTER|RECLUSTER_TRANSITIVE --threshold=matching_score_to_recluster [--reducer=number_of_reducer_jobs]
The following table describes the options and arguments that you can specify to run the
run_postprocess.sh
script:
Option
Argument
Description
--config
configuration_file_name
Absolute path and file name of the configuration file.
--input
input_file_in_HDFS
Absolute path to the directory that contains linked data.
If you run the initial clustering job without the
--outputpath
parameter, you can find the processed data in the following directory:
<Working Directory in HDFS>/batch-cluster/<Job ID>/output/dir
If you run the initial clustering job with the
--outputpath
parameter, you can find the processed data in the following directory:
<Output Directory in HDFS>/batch-cluster/output/dir
--hdfsdir
working_directory_in_HDFS
Absolute path to a working directory in HDFS. The post-clustering job uses the working directory to store the library files.
--rule
matching_rules_file_name
Absolute path and file name of the matching rules file.
The values in the matching rules file override the values in the configuration file.
--maxcluster
maximum_number_of_records
Maximum number of records in a cluster.
If the number of records in a cluster exceeds the specified maximum number of records, the cluster becomes a high-volume cluster, and the post-clustering job processes the cluster.
--mode
SKIP_LARGE_CLUSTER| RECLUSTER_TRANSITIVE
Indicates the mode to process the input data.
Use one of the following modes:
  • SKIP_LARGE_CLUSTER
    . Indicates to skip all the records in the high-volume clusters.
  • RECLUSTER_TRANSITIVE
    . Indicates to re-link all the records in the high-volume clusters.
--threshold
matching_score_to_recluster
Minimum score required for the records to be part of the same cluster when you run the post-clustering job in the recluster mode. The job creates a separate cluster for each record whose score is less than the threshold value.
--reducer
number_of_reducer_jobs
Optional. Number of reducer jobs that you want to run. Default is 1.
For example, the following command runs the post-clustering job in the skip mode:
run_postprocess.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/workingdir/batch-cluster/MDMBDE0063_1602999447744334391/output/dir/pass-join --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --maxcluster=1200 --threshold=90 --mode=SKIP_LARGE_CLUSTER
If you run the post-clustering job without the
--outputpath
parameter, you can find the processed data in the following directory:
<Working Directory in HDFS>/ClusterPostProcessing/<Job ID>/output
If you run the post-clustering job with the
--outputpath
parameter, you can find the processed data in the following directory:
<Output Directory in HDFS>/ClusterPostProcessing/output

Longtail Mode

The initial clustering job identifies the poor quality data from the input data and loads it to a directory in an encrypted format. The post-clustering job decrypts the poor quality data to the original input format.
To run the
run_postprocess.sh
script in the longtail mode, use the following command format:
run_postprocess.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name --mode=LONGTAIL_CLUSTERS [--reducer=number_of_reducer_jobs]
The following table describes the options and arguments that you can specify to run the
run_postprocess.sh
script:
Option
Argument
Description
--config
configuration_file_name
Absolute path and file name of the configuration file.
--input
input_file_in_HDFS
Absolute path to the directory that contains the poor quality data.
If you run the initial clustering job without the
--outputpath
parameter, you can find the poor quality data in the following directory:
<Working Directory in HDFS>/batch-cluster/<Job ID>/poorqualitydata
If you run the initial clustering job with the
--outputpath
parameter, you can find the poor quality data in the following directory:
<Output Directory in HDFS>/batch-cluster/poorqualitydata
--hdfsdir
working_directory_in_HDFS
Absolute path to a working directory in HDFS. The post-clustering job uses the working directory to store the library files.
--rule
matching_rules_file_name
Absolute path and file name of the matching rules file.
The values in the matching rules file override the values in the configuration file.
--mode
LONGTAIL_CLUSTERS
Indicates that the job processes the poor quality data.
--reducer
number_of_reducer_jobs
Optional. Number of reducer jobs that you want to run. Default is 1.
For example, the following command runs the post-clustering job in the longtail mode:
run_postprocess.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/workingdir/batch-cluster/MDMBDE0063_1602999447744334391/poorqualitydata --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --mode=LONGTAIL_CLUSTERS
If you run the post-clustering job without the
--outputpath
parameter, you can find the decrypted data in the following directory:
<Working Directory in HDFS>/ClusterPostProcessing/<Job ID>/output
If you run the post-clustering job with the
--outputpath
parameter, you can find the decrypted data in the following directory:
<Output Directory in HDFS>/ClusterPostProcessing/output

Export Mode

To run the
run_postprocess.sh
script in the export mode, use the following command format:
run_postprocess.sh --config=configuration_file_name --input=input_file_in_HDFS --matchinput=match_output_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name --mode=CSV_OUTPUT [--reducer=number_of_reducer_jobs]
The following table describes the options and arguments that you can specify to run the
run_postprocess.sh
script:
Option
Argument
Description
--config
configuration_file_name
Absolute path and file name of the configuration file.
--input
input_file_in_HDFS
Absolute path to the directory that contains the input files of the initial clustering job.
--matchinput
match_output_in_HDFS
Absolute path to the directory that contains the linked data.
If you run the initial clustering job without the
--outputpath
parameter, you can find the linked data in the following directory:
<Working Directory in HDFS>/batch-cluster/<Job ID>/match/dir
If you run the initial clustering job with the
--outputpath
parameter, you can find the linked data in the following directory:
<Output Directory in HDFS>/batch-cluster/match/dir
--hdfsdir
working_directory_in_HDFS
Absolute path to a working directory in HDFS. The post-clustering job uses the working directory to store the library files.
--rule
matching_rules_file_name
Absolute path and file name of the matching rules file.
The values in the matching rules file override the values in the configuration file.
--mode
CSV_OUTPUT
Indicates to export the input data in the CSV format.
--reducer
number_of_reducer_jobs
Optional. Number of reducer jobs that you want to run. Default is 1.
For example, the following command runs the post-clustering job in the export mode:
run_postprocess.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/Source10Million --matchinput=/usr/hdfs/workingdir/batch-cluster/MDMBDE0063_1602999447744334391/match/dir/ --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --mode=CSV_OUTPUT
If you run the post-clustering job without the
--outputpath
parameter, you can find the CSV files in the following directory:
<Working Directory in HDFS>/ClusterPostProcessing/<Job ID>/output-match
If you run the post-clustering job with the
--outputpath
parameter, you can find the CSV files in the following directory:
<Output Directory in HDFS>/ClusterPostProcessing/output-match

0 COMMENTS

We’d like to hear from you!