Relate 360
- Relate 360 10.1
- All Products
run_genclusters.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name [--outputpath=directory_for_output_files] [--reducer=number_of_reducers] [--keeptemp=true|false] [--compression=true|false] [--probableMatchTopic=probable_match_topic_name] [--zookeeper=zookeeper_connection_string]
Option
| Argument
| Description
|
|---|---|---|
--config
| configuration_file_name
| Absolute path and file name of the configuration file that you create.
|
--input
| input_file_in_HDFS
| Absolute path to the input files in HDFS.
|
--reducer
| number_of_reducers
| Optional. Number of reducer jobs that you want to run to perform initial clustering. Default is 1.
|
--hdfsdir
| working_directory_in_HDFS
| Absolute path to a working directory in HDFS. The initial clustering job uses the working directory to store the output and library files.
|
--rule
| matching_rules_file_name
| Absolute path and file name of the matching rules file that you create.
|
--outputpath
| directory_for_output_files
| Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job. By default, the batch job loads the output files to the working directory in HDFS. |
--keeptemp
| true|false
| Optional. Indicates whether to retain the intermediate output tables that the initial clustering job creates. You can use the intermediate output tables for troubleshooting purposes.
Set to true to retain the intermediate output tables, and set to false to remove the intermediate output tables after the successful run of the job.
Default is false.
|
--compression
| true|false
| Optional. Indicates whether to compress the output files that the initial clustering job creates. You can compress the output files to avoid any storage issues.
Set to true to compress the output files, and set to false to retain the original size of the output files.
Default is false.
|
--probableMatchTopic
| probable_match_topic_name
| Optional. Name of the topic in Kafka to which you want to publish the probable output matches.
Applicable only when the Kafka parameters are configured in the configuration file.
|
--zookeeper
| zookeeper_connection_string
| Required if you use the
probableMatchTopic option. Connection string to access the ZooKeeper server.
Use the following format for the connection string:
<Host Name>:<Port> The connection string uses the following parameters:
The following example connection string uses the default ZooKeeper root directory:
server1.domain.com:2182 If you use an ensemble of ZooKeeper servers, you can specify multiple ZooKeeper servers separated by commas.
|
run_genclusters.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/Source10Million --reducer=16 --outputpath=/usr/hdfs/outputfolder --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --probableMatchTopic=doubtfulrecords --zookeeper=server1.domain.com:2182
683e6e41-9174-4a22-b08e-d5d4adc9b2ee 0000000007ERP 3M 3M Center St. Paul USA ARC SOFT 07_UK_Rule1 00000001ZZB>$$$$01000004NAH-C$$$QVM$*K$-N?H-C$$-NAH$$$$-
run_genclusters.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name --incremental --clustereddirs=indexed_linked_data_directory [--reducer=number_of_reducers] [--outputpath=directory_for_output_files] [--consolidate] [--keeptemp=true|false] [--compression=true|false] [--probableMatchTopic=probable_match_topic_name] [--zookeeper=zookeeper_connection_string]
Option
| Argument
| Description
|
|---|---|---|
--config
| configuration_file_name
| Absolute path and file name of the configuration file that you create.
|
--input
| input_file_in_HDFS
| Absolute path to the input files in HDFS.
|
--reducer
| number_of_reducers
| Optional. Number of reducer jobs that you want to run to perform initial clustering. Default is 1.
|
--hdfsdir
| working_directory_in_HDFS
| Absolute path to a working directory in HDFS. The initial clustering job uses the working directory to store the output and library files.
|
--rule
| matching_rules_file_name
| Absolute path and file name of the matching rules file that you create.
|
--incremental
| Runs the initial clustering job in the incremental mode.
If you want to incrementally update the indexed and linked data in HDFS, run the job in the incremental mode.
By default, the initial clustering job runs in the initial mode.
| |
--clustereddirs
| indexed_linked_data_directory
| Absolute path to the directory that contains linked data. If you run the initial clustering job without the
--outputpath parameter, you can find the linked data in the following directory:
<Working Directory in HDFS>/batch-cluster/<Job ID>/output/dir/pass-join If you run the initial clustering job with the
--outputpath parameter, you can find the linked data in the following directory:
<Output Directory in HDFS>/batch-cluster/output/dir/pass-join |
--consolidate
| Consolidates the incremental data with the existing indexed and linked data in HDFS.
By default, the initial clustering job indexes and links only the incremental data.
When you set null_ind=2 and run the initial clustering job in the incremental mode, Informatica recommends that you specify the
--consolidate option. The
--consolidate option ensures that the initially linked data updates with the changes from the incremental data.
| |
--outputpath
| directory_for_output_files
| Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job. By default, the batch job loads the output files to the working directory in HDFS. |
--keeptemp
| true|false
| Optional. Indicates whether to retain the intermediate output tables that the initial clustering job creates. You can use the intermediate output tables for troubleshooting purposes.
Set to true to retain the intermediate output tables, and set to false to remove the intermediate output tables after the successful run of the job.
Default is false.
|
--compression
| true|false
| Optional. Indicates whether to compress the output files that the initial clustering job creates. You can compress the output files to avoid any storage issues.
Set to true to compress the output files, and set to false to retain the original size of the output files.
Default is false.
|
--probableMatchTopic
| probable_match_topic_name
| Optional. Name of the topic in Kafka to which you want to publish the probable output matches.
Applicable only when the Kafka parameters are configured in the configuration file.
|
--zookeeper
| zookeeper_connection_string
| Required if you use the
probableMatchTopic option. Connection string to access the ZooKeeper server.
Use the following format for the connection string:
<Host Name>:<Port> The connection string uses the following parameters:
The following example connection string uses the default ZooKeeper root directory:
server1.domain.com:2182 If you use an ensemble of ZooKeeper servers, you can specify multiple ZooKeeper servers separated by commas.
|
run_genclusters.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/Source10Million --reducer=16 --hdfsdir=/usr/hdfs/workingdir --outputpath=/usr/hdfs/outputfolder --rule=/usr/local/conf/matching_rules.xml --clustereddirs=/usr/hdfs/workingdir/batch-cluster/MDMBDRM_931211654144593570/output/dir/pass-join --probableMatchTopic=doubtfulrecords --zookeeper=server1.domain.com:2182 --incremental