User Guide

Back Next

Running the HDFS Tokenization Job

Use the HDFS tokenization job to read data from the input files in HDFS, create match tokens for the input data, and write the tokenized data to the output files in HDFS.

To run the HDFS tokenization job, run the

run_tokenizer.sh

script located in the following directory:

/usr/local/mdmbdrm-<Version Number>

Initial Mode

Use the following command to run the

run_tokenizer.sh

script in the initial mode:

run_tokenizer.sh

--config=configuration_file_name

--input=input_file_in_HDFS

--hdfsdir=working_directory_in_HDFS

--rule=matching_rules_file_name

[--outputpath=directory_for_output_files]

[--reducer=number_of_reducers]

The following table describes the options and the arguments that you can specify to run the

run_tokenizer.sh

script in the initial mode:

Option	Argument	Description
--config	configuration_file_name	Absolute path and file name of the configuration file that you create. In the configuration file, if you set the StoreAllFields parameter to false, the output files of the job does not include all the columns but includes only the columns that you use to index the input data. If you want to include all the columns in the output files, ensure that you set the StoreAllFields parameter to true in the configuration file before you run the job.
--input	input_file_in_HDFS	Absolute path to the input files in HDFS.
--reducer	number_of_reducers	Optional. Number of reducer jobs that you want to run. Default is 1.
--hdfsdir	working_directory_in_HDFS	Absolute path to a working directory in HDFS. The HDFS tokenization job uses the working directory to store the output and library files.
--rule	matching_rules_file_name	Absolute path and file name of the matching rules file that you create.
--outputpath	directory_for_output_files	Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job. By default, the batch job loads the output files to the working directory in HDFS.

For example, the following command runs the HDFS tokenization job in the initial mode:

run_tokenizer.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/GenerateTokens --reducer=16 --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml

If you run the HDFS tokenization job without the

--outputpath

parameter, you can find the tokenized data in the following directory:

<Working Directory in HDFS>/batch-tokenize/<Job ID>/tokenize

Each job generates a unique ID, and you can identify the job ID based on the time stamp of the

folder.

If you run the HDFS tokenization job with the

--outputpath

parameter, you can find the tokenized data in the following directory:

<Output Directory in HDFS>/batch-tokenize/tokenize

The following sample output of the HDFS tokenization job shows the cluster ID, the field values, and the token for an input record:

a3ebff6d-c578-4ace-b6d1-2805788d78a6    	0000000007ERP    	3M    	3M Center    	St. Paul    	USA    	ARC SOFT    	00000001ZZB>$$$$01000004N?H-C$$-NAH$$$$-NAH-C$$$QVM$*K$-

Incremental Mode

Use the following command to run the

run_tokenizer.sh

script in the incremental mode:

run_tokenizer.sh

--config=configuration_file_name

--input=input_file_in_HDFS

--hdfsdir=working_directory_in_HDFS

--rule=matching_rules_file_name

--incremental

--clustereddirs=tokenize_output_data_directory

[--outputpath=directory_for_output_files]

[--reducer=number_of_reducers]

The following table describes the options and the arguments that you can specify to run the

run_tokenizer.sh

script in the initial mode:

Option	Argument	Description
--config	configuration_file_name	Absolute path and file name of the configuration file that you create.
--input	input_file_in_HDFS	Absolute path to the input files in HDFS.
--reducer	number_of_reducers	Optional. Number of reducer jobs that you want to run. Default is 1.
--hdfsdir	working_directory_in_HDFS	Absolute path to a working directory in HDFS. The HDFS tokenization job uses the working directory to store the output and library files.
--rule	matching_rules_file_name	Absolute path and file name of the matching rules file that you create.
--incremental		Runs the HDFS tokenization job in the incremental mode. If you want to incrementally update the output files of a HDFS tokenization job, run the job in the incremental mode. By default, the HDFS tokenization job runs in the initial mode.
--clustereddirs	tokenize_output_data_directory	Absolute path to the directory that contains tokenized data. If you run the HDFS tokenization job without the --outputpath parameter, you can find the tokenized data in the following directory: <Working Directory in HDFS>/batch-tokenize/<Job ID>/tokenize If you run the HDFS tokenization job with the --outputpath parameter, you can find the tokenized data in the following directory: <Output Directory in HDFS>/batch-tokenize/tokenize
--outputpath	directory_for_output_files	Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job. By default, the batch job loads the output files to the working directory in HDFS.

For example, the following command runs the HDFS tokenization job in the incremental mode:

run_tokenizer.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/GenerateTokens --reducer=16 --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --clustereddirs=/usr/hdfs/workingdir/batch-tokenize/MDMBDRM_931211654144593570/tokenize --incremental

Rename Saved Search

Table of Contents

User Guide

User Guide

Running the HDFS Tokenization Job

Running the HDFS Tokenization Job

Initial Mode

Incremental Mode