Table of Contents

Search

  1. Preface
  2. Introduction to Informatica MDM - Relate 360
  3. Linking Batch Data
  4. Tokenizing Batch Data
  5. Processing Streaming Data
  6. Creating Relationship Graph
  7. Loading Linked and Consolidated Data into Hive
  8. Searching Data
  9. Monitoring the Batch Jobs
  10. Troubleshooting
  11. Glossary

User Guide

User Guide

Running the Repository Tokenization Job

Running the Repository Tokenization Job

The repository tokenization job reads the input data from HDFS, creates match tokens for the input data in HDFS, and loads the tokenized data into the repository.
To run the repository tokenization job, run the
run_tokenloader.sh
script located in the following directory:
/usr/local/mdmbdrm-<Version Number>
Use the following command to run the
run_tokenloader.sh
script:
run_tokenloader.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name --tmpdir=temporary_working_directory [--outputpath=directory_for_output_files] [--reducer=number_of_reducer_jobs]
The following table describes the options and arguments that you can specify to run the
run_tokenloader.sh
script:
Option
Argument
Description
--config
configuration_file_name
Absolute path and file name of the configuration file that you create.
In the configuration file, if you set the
StoreAllFields
parameter to false, the repository does not persist all the columns but persists only the columns that you use to index the input data. If you want to persist all the columns in the repository, ensure that you set the
StoreAllFields
parameter to true in the configuration file before you tokenize the input data.
--input
input_file_in_HDFS
Absolute path to the input files in HDFS.
--reducer
number_of_reducer_jobs
Optional. Number of reducer jobs that you want to run. Default is 1.
--hdfsdir
working_directory_in_HDFS
Absolute path to a working directory in HDFS. The repository tokenization process uses the working directory to store the library files.
--rule
matching_rules_file_name
Absolute path and file name of the matching rules file that you create.
The values in the matching rules file override the values in the configuration file.
--tmpdir
temporary_working_directory
Absolute path to a temporary directory to which you have write permission in the local file system.
The repository tokenization job uses the directory to store the intermediate files.
--outputpath
directory_for_output_files
Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job.
By default, the batch job loads the output files to the working directory in HDFS.
For example, the following command runs the repository tokenization job:
run_tokenloader.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/GenerateTokens --reducer=16 --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --tmpdir=/tmp
The following sample output of the repository tokenization job shows index ID, token, and the values from the column family:
ROW COLUMN+CELL 00KCKSHX$$ SALESFORCE0000000 column=aml_link_columns:CLUSTERNUMBER, timestamp=1454406691384, value=f9febe82-8f55-4b7e-98de-a11290ae2807 066 00KCKSHX$$ SALESFORCE0000000 column=aml_link_columns:LMT_MATCHED_PK, timestamp=1454406691384, value= 066 00KCKSHX$$ SALESFORCE0000000 column=aml_link_columns:LMT_MATCHED_RECORD_SOURCE, timestamp=1454406691384, value= 066 00KCKSHX$$ SALESFORCE0000000 column=aml_link_columns:LMT_MATCHED_SCORE, timestamp=1454406691384, value=0 066 00KCKSHX$$ SALESFORCE0000000 column=aml_link_columns:LMT_SOURCE_NAME, timestamp=1454406691384, value= 066 00KCKSHX$$ SALESFORCE0000000 column=aml_link_columns:NAME, timestamp=1454406691384, value=Abbott Laboratories 066 00KCKSHX$$ SALESFORCE0000000 column=aml_link_columns:ROWID, timestamp=1454406691384, value=0000000066 066

0 COMMENTS

We’d like to hear from you!