Table of Contents

Search

  1. Preface
  2. Introduction to Informatica MDM - Relate 360
  3. Linking Batch Data
  4. Tokenizing Batch Data
  5. Processing Streaming Data
  6. Creating Relationship Graph
  7. Loading Linked and Consolidated Data into Hive
  8. Searching Data
  9. Monitoring the Batch Jobs
  10. Troubleshooting
  11. Glossary

User Guide

User Guide

Running the Initial Clustering Job

Running the Initial Clustering Job

Use the initial clustering job to read data from the input files and then index and link the input data in HDFS. You can also use the initial clustering job to incrementally update the indexed and linked data with additional data.
To run the initial clustering job, run the
run_genclusters.sh
script located in the following directory:
/usr/local/mdmbdrm-<Version Number>

Initial Mode

Use the following command to run the
run_genclusters.sh
script in the initial mode:
run_genclusters.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name [--outputpath=directory_for_output_files] [--reducer=number_of_reducers] [--keeptemp=true|false] [--compression=true|false] [--probableMatchTopic=probable_match_topic_name] [--zookeeper=zookeeper_connection_string]
The following table describes the options and the arguments that you can specify to run the
run_genclusters.sh
script in the initial mode:
Option
Argument
Description
--config
configuration_file_name
Absolute path and file name of the configuration file that you create.
--input
input_file_in_HDFS
Absolute path to the input files in HDFS.
--reducer
number_of_reducers
Optional. Number of reducer jobs that you want to run to perform initial clustering. Default is 1.
--hdfsdir
working_directory_in_HDFS
Absolute path to a working directory in HDFS. The initial clustering job uses the working directory to store the output and library files.
--rule
matching_rules_file_name
Absolute path and file name of the matching rules file that you create.
--outputpath
directory_for_output_files
Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job.
By default, the batch job loads the output files to the working directory in HDFS.
--keeptemp
true|false
Optional. Indicates whether to retain the intermediate output tables that the initial clustering job creates. You can use the intermediate output tables for troubleshooting purposes.
Set to true to retain the intermediate output tables, and set to false to remove the intermediate output tables after the successful run of the job.
Default is false.
--compression
true|false
Optional. Indicates whether to compress the output files that the initial clustering job creates. You can compress the output files to avoid any storage issues.
Set to true to compress the output files, and set to false to retain the original size of the output files.
Default is false.
--probableMatchTopic
probable_match_topic_name
Optional. Name of the topic in Kafka to which you want to publish the probable output matches.
Applicable only when the Kafka parameters are configured in the configuration file.
--zookeeper
zookeeper_connection_string
Required if you use the
probableMatchTopic
option. Connection string to access the ZooKeeper server.
Use the following format for the connection string:
<Host Name>:<Port>
The connection string uses the following parameters:
  • Host Name
    . Host name of the ZooKeeper server.
  • Port
    . Port on which the ZooKeeper server listens.
The following example connection string uses the default ZooKeeper root directory:
server1.domain.com:2182
If you use an ensemble of ZooKeeper servers, you can specify multiple ZooKeeper servers separated by commas.
For example, the following command runs the initial clustering job in the initial mode:
run_genclusters.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/Source10Million --reducer=16 --outputpath=/usr/hdfs/outputfolder --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --probableMatchTopic=doubtfulrecords --zookeeper=server1.domain.com:2182
If you run the initial clustering job without the
--outputpath
parameter, you can find the processed data in the following directory:
<Working Directory in HDFS>/batch-cluster/<Job ID>/output/dir
Each job generates a unique ID, and you can identify the job ID based on the time stamp of the
<Job ID>
folder.
If you run the initial clustering job with the
--outputpath
parameter, you can find the processed data in the following directory:
<Output Directory in HDFS>/batch-cluster/output/dir
Based on the parameters in the configuration file, the initial clustering job creates the following folders for the processed data:
  • pass-join
    . Contains the linked data.
  • probable-match-pairs
    . Contains the probable matching records. You get this folder only when you configure the lower threshold value in the configuration file.
The following sample output of an initial clustering job shows the cluster ID, the field values, match rule name, and the metadata related to the cluster:
683e6e41-9174-4a22-b08e-d5d4adc9b2ee 0000000007ERP 3M 3M Center St. Paul USA ARC SOFT 07_UK_Rule1 00000001ZZB>$$$$01000004NAH-C$$$QVM$*K$-N?H-C$$-NAH$$$$-

Incremental Mode

Use the following command to run the
run_genclusters.sh
script to incrementally update the indexed and linked data with additional data:
run_genclusters.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name --incremental --clustereddirs=indexed_linked_data_directory [--reducer=number_of_reducers] [--outputpath=directory_for_output_files] [--consolidate] [--keeptemp=true|false] [--compression=true|false] [--probableMatchTopic=probable_match_topic_name] [--zookeeper=zookeeper_connection_string]
The following table describes the options and the arguments that you can specify to run the
run_genclusters.sh
script:
Option
Argument
Description
--config
configuration_file_name
Absolute path and file name of the configuration file that you create.
--input
input_file_in_HDFS
Absolute path to the input files in HDFS.
--reducer
number_of_reducers
Optional. Number of reducer jobs that you want to run to perform initial clustering. Default is 1.
--hdfsdir
working_directory_in_HDFS
Absolute path to a working directory in HDFS. The initial clustering job uses the working directory to store the output and library files.
--rule
matching_rules_file_name
Absolute path and file name of the matching rules file that you create.
--incremental
Runs the initial clustering job in the incremental mode.
If you want to incrementally update the indexed and linked data in HDFS, run the job in the incremental mode.
By default, the initial clustering job runs in the initial mode.
--clustereddirs
indexed_linked_data_directory
Absolute path to the directory that contains linked data.
If you run the initial clustering job without the
--outputpath
parameter, you can find the linked data in the following directory:
<Working Directory in HDFS>/batch-cluster/<Job ID>/output/dir/pass-join
If you run the initial clustering job with the
--outputpath
parameter, you can find the linked data in the following directory:
<Output Directory in HDFS>/batch-cluster/output/dir/pass-join
--consolidate
Consolidates the incremental data with the existing indexed and linked data in HDFS.
By default, the initial clustering job indexes and links only the incremental data.
When you set null_ind=2 and run the initial clustering job in the incremental mode, Informatica recommends that you specify the
--consolidate
option. The
--consolidate
option ensures that the initially linked data updates with the changes from the incremental data.
--outputpath
directory_for_output_files
Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job.
By default, the batch job loads the output files to the working directory in HDFS.
--keeptemp
true|false
Optional. Indicates whether to retain the intermediate output tables that the initial clustering job creates. You can use the intermediate output tables for troubleshooting purposes.
Set to true to retain the intermediate output tables, and set to false to remove the intermediate output tables after the successful run of the job.
Default is false.
--compression
true|false
Optional. Indicates whether to compress the output files that the initial clustering job creates. You can compress the output files to avoid any storage issues.
Set to true to compress the output files, and set to false to retain the original size of the output files.
Default is false.
--probableMatchTopic
probable_match_topic_name
Optional. Name of the topic in Kafka to which you want to publish the probable output matches.
Applicable only when the Kafka parameters are configured in the configuration file.
--zookeeper
zookeeper_connection_string
Required if you use the
probableMatchTopic
option. Connection string to access the ZooKeeper server.
Use the following format for the connection string:
<Host Name>:<Port>
The connection string uses the following parameters:
  • Host Name
    . Host name of the ZooKeeper server.
  • Port
    . Port on which the ZooKeeper server listens.
The following example connection string uses the default ZooKeeper root directory:
server1.domain.com:2182
If you use an ensemble of ZooKeeper servers, you can specify multiple ZooKeeper servers separated by commas.
For example, the following command runs the initial clustering job in the incremental mode:
run_genclusters.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/Source10Million --reducer=16 --hdfsdir=/usr/hdfs/workingdir --outputpath=/usr/hdfs/outputfolder --rule=/usr/local/conf/matching_rules.xml --clustereddirs=/usr/hdfs/workingdir/batch-cluster/MDMBDRM_931211654144593570/output/dir/pass-join --probableMatchTopic=doubtfulrecords --zookeeper=server1.domain.com:2182 --incremental
If you run the initial clustering job in the incremental mode with the
--consolidate
option, you get the output files in the following directories:
  • <Working Directory in HDFS>/batch-cluster/<Job ID>/output/dir/pass-join
    . Contains the consolidated linked data.
    When you run the initial clustering job again in the incremental mode, use this directory path as the value for the
    --clustereddirs
    option.
  • <Working Directory in HDFS>/batch-cluster/<Job ID>/output/dir/incr-only-pass-join
    . Contains only the incremental linked data.

0 COMMENTS

We’d like to hear from you!