Table of Contents

Search

  1. Preface
  2. Introduction to Informatica MDM - Relate 360
  3. Linking Batch Data
  4. Tokenizing Batch Data
  5. Processing Streaming Data
  6. Creating Relationship Graph
  7. Loading Linked and Consolidated Data into Hive
  8. Searching Data
  9. Monitoring the Batch Jobs
  10. Troubleshooting
  11. Glossary

User Guide

User Guide

Running the HDFS Tokenization Job

Running the HDFS Tokenization Job

Use the HDFS tokenization job to read data from the input files in HDFS, create match tokens for the input data, and write the tokenized data to the output files in HDFS.
To run the HDFS tokenization job, run the
run_tokenizer.sh
script located in the following directory:
/usr/local/mdmbdrm-<Version Number>

Initial Mode

Use the following command to run the
run_tokenizer.sh
script in the initial mode:
run_tokenizer.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name [--outputpath=directory_for_output_files] [--reducer=number_of_reducers]
The following table describes the options and the arguments that you can specify to run the
run_tokenizer.sh
script in the initial mode:
Option
Argument
Description
--config
configuration_file_name
Absolute path and file name of the configuration file that you create.
In the configuration file, if you set the
StoreAllFields
parameter to false, the output files of the job does not include all the columns but includes only the columns that you use to index the input data. If you want to include all the columns in the output files, ensure that you set the
StoreAllFields
parameter to true in the configuration file before you run the job.
--input
input_file_in_HDFS
Absolute path to the input files in HDFS.
--reducer
number_of_reducers
Optional. Number of reducer jobs that you want to run. Default is 1.
--hdfsdir
working_directory_in_HDFS
Absolute path to a working directory in HDFS. The HDFS tokenization job uses the working directory to store the output and library files.
--rule
matching_rules_file_name
Absolute path and file name of the matching rules file that you create.
--outputpath
directory_for_output_files
Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job.
By default, the batch job loads the output files to the working directory in HDFS.
For example, the following command runs the HDFS tokenization job in the initial mode:
run_tokenizer.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/GenerateTokens --reducer=16 --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml
If you run the HDFS tokenization job without the
--outputpath
parameter, you can find the tokenized data in the following directory:
<Working Directory in HDFS>/batch-tokenize/<Job ID>/tokenize
Each job generates a unique ID, and you can identify the job ID based on the time stamp of the
<Job ID>
folder.
If you run the HDFS tokenization job with the
--outputpath
parameter, you can find the tokenized data in the following directory:
<Output Directory in HDFS>/batch-tokenize/tokenize
The following sample output of the HDFS tokenization job shows the cluster ID, the field values, and the token for an input record:
a3ebff6d-c578-4ace-b6d1-2805788d78a6 0000000007ERP 3M 3M Center St. Paul USA ARC SOFT 00000001ZZB>$$$$01000004N?H-C$$-NAH$$$$-NAH-C$$$QVM$*K$-

Incremental Mode

Use the following command to run the
run_tokenizer.sh
script in the incremental mode:
run_tokenizer.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name --incremental --clustereddirs=tokenize_output_data_directory [--outputpath=directory_for_output_files] [--reducer=number_of_reducers]
The following table describes the options and the arguments that you can specify to run the
run_tokenizer.sh
script in the initial mode:
Option
Argument
Description
--config
configuration_file_name
Absolute path and file name of the configuration file that you create.
--input
input_file_in_HDFS
Absolute path to the input files in HDFS.
--reducer
number_of_reducers
Optional. Number of reducer jobs that you want to run. Default is 1.
--hdfsdir
working_directory_in_HDFS
Absolute path to a working directory in HDFS. The HDFS tokenization job uses the working directory to store the output and library files.
--rule
matching_rules_file_name
Absolute path and file name of the matching rules file that you create.
--incremental
Runs the HDFS tokenization job in the incremental mode.
If you want to incrementally update the output files of a HDFS tokenization job, run the job in the incremental mode.
By default, the HDFS tokenization job runs in the initial mode.
--clustereddirs
tokenize_output_data_directory
Absolute path to the directory that contains tokenized data.
If you run the HDFS tokenization job without the
--outputpath
parameter, you can find the tokenized data in the following directory:
<Working Directory in HDFS>/batch-tokenize/<Job ID>/tokenize
If you run the HDFS tokenization job with the
--outputpath
parameter, you can find the tokenized data in the following directory:
<Output Directory in HDFS>/batch-tokenize/tokenize
--outputpath
directory_for_output_files
Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job.
By default, the batch job loads the output files to the working directory in HDFS.
For example, the following command runs the HDFS tokenization job in the incremental mode:
run_tokenizer.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/GenerateTokens --reducer=16 --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --clustereddirs=/usr/hdfs/workingdir/batch-tokenize/MDMBDRM_931211654144593570/tokenize --incremental

0 COMMENTS

We’d like to hear from you!