Relate 360
- Relate 360 10.1
- All Products
run_tokenizer.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name [--outputpath=directory_for_output_files] [--reducer=number_of_reducers]
Option
| Argument
| Description
|
|---|---|---|
--config
| configuration_file_name
| Absolute path and file name of the configuration file that you create.
In the configuration file, if you set the
StoreAllFields parameter to false, the output files of the job does not include all the columns but includes only the columns that you use to index the input data. If you want to include all the columns in the output files, ensure that you set the
StoreAllFields parameter to true in the configuration file before you run the job.
|
--input
| input_file_in_HDFS
| Absolute path to the input files in HDFS.
|
--reducer
| number_of_reducers
| Optional. Number of reducer jobs that you want to run. Default is 1.
|
--hdfsdir
| working_directory_in_HDFS
| Absolute path to a working directory in HDFS. The HDFS tokenization job uses the working directory to store the output and library files.
|
--rule
| matching_rules_file_name
| Absolute path and file name of the matching rules file that you create.
|
--outputpath
| directory_for_output_files
| Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job. By default, the batch job loads the output files to the working directory in HDFS. |
run_tokenizer.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/GenerateTokens --reducer=16 --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml
a3ebff6d-c578-4ace-b6d1-2805788d78a6 0000000007ERP 3M 3M Center St. Paul USA ARC SOFT 00000001ZZB>$$$$01000004N?H-C$$-NAH$$$$-NAH-C$$$QVM$*K$-
run_tokenizer.sh --config=configuration_file_name --input=input_file_in_HDFS --hdfsdir=working_directory_in_HDFS --rule=matching_rules_file_name --incremental --clustereddirs=tokenize_output_data_directory [--outputpath=directory_for_output_files] [--reducer=number_of_reducers]
Option
| Argument
| Description
|
|---|---|---|
--config
| configuration_file_name
| Absolute path and file name of the configuration file that you create.
|
--input
| input_file_in_HDFS
| Absolute path to the input files in HDFS.
|
--reducer
| number_of_reducers
| Optional. Number of reducer jobs that you want to run. Default is 1.
|
--hdfsdir
| working_directory_in_HDFS
| Absolute path to a working directory in HDFS. The HDFS tokenization job uses the working directory to store the output and library files.
|
--rule
| matching_rules_file_name
| Absolute path and file name of the matching rules file that you create.
|
--incremental
| Runs the HDFS tokenization job in the incremental mode.
If you want to incrementally update the output files of a HDFS tokenization job, run the job in the incremental mode.
By default, the HDFS tokenization job runs in the initial mode.
| |
--clustereddirs
| tokenize_output_data_directory
| Absolute path to the directory that contains tokenized data.
If you run the HDFS tokenization job without the
--outputpath parameter, you can find the tokenized data in the following directory:
<Working Directory in HDFS>/batch-tokenize/<Job ID>/tokenize If you run the HDFS tokenization job with the
--outputpath parameter, you can find the tokenized data in the following directory:
<Output Directory in HDFS>/batch-tokenize/tokenize |
--outputpath
| directory_for_output_files
| Optional. Absolute path to a directory in HDFS to which the batch job loads the output files. Use a different directory when you rerun the batch job. If you want to use the same directory, delete all the files in the directory and rerun the job. By default, the batch job loads the output files to the working directory in HDFS. |
run_tokenizer.sh --config=/usr/local/conf/config_big.xml --input=/usr/hdfs/GenerateTokens --reducer=16 --hdfsdir=/usr/hdfs/workingdir --rule=/usr/local/conf/matching_rules.xml --clustereddirs=/usr/hdfs/workingdir/batch-tokenize/MDMBDRM_931211654144593570/tokenize --incremental