User Guide

Back Next

Running the Initial Clustering Job

Use the initial clustering job to read data from the input files and then index and link the input data in HDFS. You can also use the initial clustering job to incrementally update the indexed and linked data with additional data.

To run the initial clustering job, run the

run_genclusters.sh

script located in the following directory:

/usr/local/mdmbdrm-<Version Number>

Initial Mode

Use the following command to run the

run_genclusters.sh

script in the initial mode:

run_genclusters.sh

--config=configuration_file_name

--input=input_file_in_HDFS

--hdfsdir=working_directory_in_HDFS

--rule=matching_rules_file_name

[--outputpath=directory_for_output_files]

[--reducer=number_of_reducers]

[--keeptemp=true|false]

[--compression=true|false]

[--probableMatchTopic=probable_match_topic_name]

[--zookeeper=zookeeper_connection_string]

The following table describes the options and the arguments that you can specify to run the

run_genclusters.sh

script in the initial mode:

Option	Argument	Description
--config	configuration_file_name	Absolute path and file name of the configuration file that you create.
--input	input_file_in_HDFS	Absolute path to the input files in HDFS.
--reducer	number_of_reducers	Optional. Number of reducer jobs that you want to run to perform initial clustering. Default is 1.
--hdfsdir	working_directory_in_HDFS	Absolute path to a working directory in HDFS. The initial clustering job uses the working directory to store the output and library files.
--rule	matching_rules_file_name	Absolute path and file name of the matching rules file that you create.
--outputpath	directory_for_output_files	Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job. By default, the batch job loads the output files to the working directory in HDFS.
--keeptemp	true\|false	Optional. Indicates whether to retain the intermediate output tables that the initial clustering job creates. You can use the intermediate output tables for troubleshooting purposes. Set to true to retain the intermediate output tables, and set to false to remove the intermediate output tables after the successful run of the job. Default is false.
--compression	true\|false	Optional. Indicates whether to compress the output files that the initial clustering job creates. You can compress the output files to avoid any storage issues. Set to true to compress the output files, and set to false to retain the original size of the output files. Default is false.
--probableMatchTopic	probable_match_topic_name	Optional. Name of the topic in Kafka to which you want to publish the probable output matches. Applicable only when the Kafka parameters are configured in the configuration file.
--zookeeper	zookeeper_connection_string	Required if you use the probableMatchTopic option. Connection string to access the ZooKeeper server. Use the following format for the connection string: <Host Name>:<Port> The connection string uses the following parameters: Host Name . Host name of the ZooKeeper server. Port . Port on which the ZooKeeper server listens. The following example connection string uses the default ZooKeeper root directory: server1.domain.com:2182 If you use an ensemble of ZooKeeper servers, you can specify multiple ZooKeeper servers separated by commas.

For example, the following command runs the initial clustering job in the initial mode:

run_genclusters.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/Source10Million --reducer=16 --outputpath=/usr/hdfs/outputfolder --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --probableMatchTopic=doubtfulrecords --zookeeper=server1.domain.com:2182

If you run the initial clustering job without the

--outputpath

parameter, you can find the processed data in the following directory:

<Working Directory in HDFS>/batch-cluster/<Job ID>/output/dir

Each job generates a unique ID, and you can identify the job ID based on the time stamp of the

folder.

If you run the initial clustering job with the

--outputpath

parameter, you can find the processed data in the following directory:

<Output Directory in HDFS>/batch-cluster/output/dir

Based on the parameters in the configuration file, the initial clustering job creates the following folders for the processed data:

pass-join

. Contains the linked data.

probable-match-pairs

. Contains the probable matching records. You get this folder only when you configure the lower threshold value in the configuration file.

The following sample output of an initial clustering job shows the cluster ID, the field values, match rule name, and the metadata related to the cluster:

683e6e41-9174-4a22-b08e-d5d4adc9b2ee    	0000000007ERP    	3M    	3M Center    	St. Paul    	USA    	ARC SOFT    	07_UK_Rule1       00000001ZZB>$$$$01000004NAH-C$$$QVM$*K$-N?H-C$$-NAH$$$$-

Incremental Mode

Use the following command to run the

run_genclusters.sh

script to incrementally update the indexed and linked data with additional data:

run_genclusters.sh

--config=configuration_file_name

--input=input_file_in_HDFS

--hdfsdir=working_directory_in_HDFS

--rule=matching_rules_file_name

--incremental

--clustereddirs=indexed_linked_data_directory

[--reducer=number_of_reducers]

[--outputpath=directory_for_output_files]

[--consolidate]

[--keeptemp=true|false]

[--compression=true|false]

[--probableMatchTopic=probable_match_topic_name]

[--zookeeper=zookeeper_connection_string]

The following table describes the options and the arguments that you can specify to run the

run_genclusters.sh

script:

Option	Argument	Description
--config	configuration_file_name	Absolute path and file name of the configuration file that you create.
--input	input_file_in_HDFS	Absolute path to the input files in HDFS.
--reducer	number_of_reducers	Optional. Number of reducer jobs that you want to run to perform initial clustering. Default is 1.
--hdfsdir	working_directory_in_HDFS	Absolute path to a working directory in HDFS. The initial clustering job uses the working directory to store the output and library files.
--rule	matching_rules_file_name	Absolute path and file name of the matching rules file that you create.
--incremental		Runs the initial clustering job in the incremental mode. If you want to incrementally update the indexed and linked data in HDFS, run the job in the incremental mode. By default, the initial clustering job runs in the initial mode.
--clustereddirs	indexed_linked_data_directory	Absolute path to the directory that contains linked data. If you run the initial clustering job without the --outputpath parameter, you can find the linked data in the following directory: <Working Directory in HDFS>/batch-cluster/<Job ID>/output/dir/pass-join If you run the initial clustering job with the --outputpath parameter, you can find the linked data in the following directory: <Output Directory in HDFS>/batch-cluster/output/dir/pass-join
--consolidate		Consolidates the incremental data with the existing indexed and linked data in HDFS. By default, the initial clustering job indexes and links only the incremental data. When you set null_ind=2 and run the initial clustering job in the incremental mode, Informatica recommends that you specify the --consolidate option. The --consolidate option ensures that the initially linked data updates with the changes from the incremental data.
--outputpath	directory_for_output_files	Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job. By default, the batch job loads the output files to the working directory in HDFS.
--keeptemp	true\|false	Optional. Indicates whether to retain the intermediate output tables that the initial clustering job creates. You can use the intermediate output tables for troubleshooting purposes. Set to true to retain the intermediate output tables, and set to false to remove the intermediate output tables after the successful run of the job. Default is false.
--compression	true\|false	Optional. Indicates whether to compress the output files that the initial clustering job creates. You can compress the output files to avoid any storage issues. Set to true to compress the output files, and set to false to retain the original size of the output files. Default is false.
--probableMatchTopic	probable_match_topic_name	Optional. Name of the topic in Kafka to which you want to publish the probable output matches. Applicable only when the Kafka parameters are configured in the configuration file.
--zookeeper	zookeeper_connection_string	Required if you use the probableMatchTopic option. Connection string to access the ZooKeeper server. Use the following format for the connection string: <Host Name>:<Port> The connection string uses the following parameters: Host Name . Host name of the ZooKeeper server. Port . Port on which the ZooKeeper server listens. The following example connection string uses the default ZooKeeper root directory: server1.domain.com:2182 If you use an ensemble of ZooKeeper servers, you can specify multiple ZooKeeper servers separated by commas.

For example, the following command runs the initial clustering job in the incremental mode:

run_genclusters.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/Source10Million --reducer=16 --hdfsdir=/usr/hdfs/workingdir --outputpath=/usr/hdfs/outputfolder --rule=/usr/local/conf/matching_rules.xml --clustereddirs=/usr/hdfs/workingdir/batch-cluster/MDMBDRM_931211654144593570/output/dir/pass-join --probableMatchTopic=doubtfulrecords --zookeeper=server1.domain.com:2182 --incremental

If you run the initial clustering job in the incremental mode with the

--consolidate

option, you get the output files in the following directories:

<Working Directory in HDFS>/batch-cluster/<Job ID>/output/dir/pass-join

. Contains the consolidated linked data.

When you run the initial clustering job again in the incremental mode, use this directory path as the value for the

--clustereddirs

option.

<Working Directory in HDFS>/batch-cluster/<Job ID>/output/dir/incr-only-pass-join

Rename Saved Search

Table of Contents

User Guide

User Guide

Running the Initial Clustering Job

Running the Initial Clustering Job

Initial Mode

Incremental Mode