Relate 360
- Relate 360 10.1
- All Products
run_clusterload.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name [--reducer=number_of_reducer_jobs] [--hbaseregionsplitpath=output_directory_of_region_splitter_job] [--outputpath=directory_for_output_files]
Option
| Argument
| Description
|
---|---|---|
--config
| configuration_file_name
| Absolute path and file name of the configuration file that you create.
|
--input
| input_file_in_HDFS
| Absolute path to the directory that contains tokenized data.
If you run the HDFS tokenization job without the
--outputpath parameter, you can find the tokenized data in the following directory:
<Working Directory in HDFS>/batch-tokenize/<Job ID>/tokenize If you run the HDFS tokenization job with the
--outputpath parameter, you can find the tokenized data in the following directory:
<Output Directory in HDFS>/batch-tokenize/tokenize |
--reducer
| number_of_reducer_jobs
| Optional. Number of reducer jobs that you want to run to perform load clustering. Default is 1.
|
--hdfsdir
| working_directory_in_HDFS
| Absolute path to a working directory in HDFS. The load clustering process uses the working directory to store the library files.
|
--rule
| matching_rules_file_name
| Absolute path and file name of the matching rules file that you create.
The values in the matching rules file override the values in the configuration file.
|
--hbaseregionsplitpath
| output_directory_of_region_splitter_job
| Optional. Absolute path to the output files of the region splitter job.
If you run the region splitter job without the
--outputpath parameter, you can find the output files in the following directory:
<Working Directory in HDFS>/MDMBDRMRegionAnalysis/<Job ID> If you run the region splitter job with the
--outputpath parameter, you can find the output files in the following directory:
<Output Directory in HDFS>/MDMBDRMRegionAnalysis By default, the load clustering job randomly distributes the tokenized data and might result in inconsistent distribution of data across the regions.
|
--outputpath
| directory_for_output_files
| Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job. By default, the batch job loads the output files to the working directory in HDFS. |
run_clusterload.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/workingdir/batch-tokenize/MDMBDE0063_1602999447744334391/tokenize --reducer=16 --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --hbaseregionsplitpath=/usr/hdfs/workingdir/MDMBDRMRegionAnalysis/MDMBDRM0063_8862443019807752334
ROW COLUMN+CELL 00KCKSHX$$ SALESFO column=aml_link_columns:CLUSTERNUMBER, timestamp=1454327424272, value=5fb1e9b8-1b51-47a3-bd45-d885bdc6bbcf RCE0000000066 00KCKSHX$$ SALESFO column=aml_link_columns:LMT_MATCHED_PK, timestamp=1454327424272, value= RCE0000000066 00KCKSHX$$ SALESFO column=aml_link_columns:LMT_MATCHED_RECORD_SOURCE, timestamp=1454327424272, value= RCE0000000066 00KCKSHX$$ SALESFO column=aml_link_columns:LMT_MATCHED_SCORE, timestamp=1454327424272, value=0 RCE0000000066 00KCKSHX$$ SALESFO column=aml_link_columns:LMT_SOURCE_NAME, timestamp=1454327424272, value= SALESFORCE RCE0000000066 00KCKSHX$$ SALESFO column=aml_link_columns:NAME, timestamp=1454327424272, value=Abbott Laboratories RCE0000000066 00KCKSHX$$ SALESFO column=aml_link_columns:ROWID, timestamp=1454327424272, value=0000000066