Table of Contents

Search

  1. Preface
  2. Introduction to Informatica MDM - Relate 360
  3. Linking Batch Data
  4. Tokenizing Batch Data
  5. Processing Streaming Data
  6. Creating Relationship Graph
  7. Loading Linked and Consolidated Data into Hive
  8. Searching Data
  9. Monitoring the Batch Jobs
  10. Troubleshooting
  11. Glossary

User Guide

User Guide

Running the Load Clustering Job

Running the Load Clustering Job

The load clustering job loads the tokenized data from HDFS into the repository.
To run the load clustering job, run the
run_clusterload.sh
script located in the following directory:
/usr/local/mdmbdrm-<Version Number>
Use the following command to run the
run_clusterload.sh
script:
run_clusterload.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name [--reducer=number_of_reducer_jobs] [--hbaseregionsplitpath=output_directory_of_region_splitter_job] [--outputpath=directory_for_output_files]
The following table describes the options and arguments that you can specify to run the
run_clusterload.sh
script:
Option
Argument
Description
--config
configuration_file_name
Absolute path and file name of the configuration file that you create.
--input
input_file_in_HDFS
Absolute path to the directory that contains tokenized data.
If you run the HDFS tokenization job without the
--outputpath
parameter, you can find the tokenized data in the following directory:
<Working Directory in HDFS>/batch-tokenize/<Job ID>/tokenize
If you run the HDFS tokenization job with the
--outputpath
parameter, you can find the tokenized data in the following directory:
<Output Directory in HDFS>/batch-tokenize/tokenize
--reducer
number_of_reducer_jobs
Optional. Number of reducer jobs that you want to run to perform load clustering. Default is 1.
--hdfsdir
working_directory_in_HDFS
Absolute path to a working directory in HDFS. The load clustering process uses the working directory to store the library files.
--rule
matching_rules_file_name
Absolute path and file name of the matching rules file that you create.
The values in the matching rules file override the values in the configuration file.
--hbaseregionsplitpath
output_directory_of_region_splitter_job
Optional. Absolute path to the output files of the region splitter job.
If you run the region splitter job without the
--outputpath
parameter, you can find the output files in the following directory:
<Working Directory in HDFS>/MDMBDRMRegionAnalysis/<Job ID>
If you run the region splitter job with the
--outputpath
parameter, you can find the output files in the following directory:
<Output Directory in HDFS>/MDMBDRMRegionAnalysis
By default, the load clustering job randomly distributes the tokenized data and might result in inconsistent distribution of data across the regions.
--outputpath
directory_for_output_files
Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job.
By default, the batch job loads the output files to the working directory in HDFS.
For example, the following command runs the load clustering job:
run_clusterload.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/workingdir/batch-tokenize/MDMBDE0063_1602999447744334391/tokenize --reducer=16 --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --hbaseregionsplitpath=/usr/hdfs/workingdir/MDMBDRMRegionAnalysis/MDMBDRM0063_8862443019807752334
The following sample output of the load clustering job shows index ID, fuzzy keys, and the values from the column family:
ROW COLUMN+CELL 00KCKSHX$$ SALESFO column=aml_link_columns:CLUSTERNUMBER, timestamp=1454327424272, value=5fb1e9b8-1b51-47a3-bd45-d885bdc6bbcf RCE0000000066 00KCKSHX$$ SALESFO column=aml_link_columns:LMT_MATCHED_PK, timestamp=1454327424272, value= RCE0000000066 00KCKSHX$$ SALESFO column=aml_link_columns:LMT_MATCHED_RECORD_SOURCE, timestamp=1454327424272, value= RCE0000000066 00KCKSHX$$ SALESFO column=aml_link_columns:LMT_MATCHED_SCORE, timestamp=1454327424272, value=0 RCE0000000066 00KCKSHX$$ SALESFO column=aml_link_columns:LMT_SOURCE_NAME, timestamp=1454327424272, value= SALESFORCE RCE0000000066 00KCKSHX$$ SALESFO column=aml_link_columns:NAME, timestamp=1454327424272, value=Abbott Laboratories RCE0000000066 00KCKSHX$$ SALESFO column=aml_link_columns:ROWID, timestamp=1454327424272, value=0000000066

0 COMMENTS

We’d like to hear from you!