MDM - Relate 360
All Products
setup_realtime.sh --config=configuration_file_name --rule=matching_rules_file_name [--consolidate=consolidation_rules_file_name] [--instanceName=instance_name] [--zookeeper=zookeeper_connection_string] [--skipCreateTopic] [--sparkMaster=deployment_mode] [--sparkMicroBatchDuration=batch_processing_time] [--checkpointDirectory=directory_to_store_checkpoint] [--partitions=number_of_partitions] [--replica=number_of_replicas] [--outputTopic=output_topic_name]
Option
| Argument
| Description
|
---|---|---|
--config
| configuration_file_name
| Absolute path and file name of the configuration file.
Ensure that the configuration file is present in the same directory path in all the Spark nodes.
|
--rule
| matching_rules_file_name
| Absolute path and file name of the matching rules file.
The values in the matching rules file override the values in the configuration file.
Ensure that the matching rules file is present in the same directory path in all the Spark nodes.
|
--consolidate
| consolidation_rules_file_name
| Optional.
Absolute path and file name of the consolidation rules file. Use the consolidation rules file only when you want to consolidate the linked data and create preferred records for all the clusters. |
--instanceName
| instance_name
| Optional. Name for the Spark instance that processes the input data.
Default is
BDRMRTIngestSpark .
|
--zookeeper
| zookeeper_connection_string
| Optional. Connection string to access the ZooKeeper server.
Use the following format for the connection string:
<Host Name>:<Port>[/<chroot>] The connection string uses the following parameters:
The following example connection string uses the default ZooKeeper root directory:
server1.domain.com:2182 The following example connection string uses the user-defined ZooKeeper root directory:
server1.domain.com:2182/kafkaroot If you use an ensemble of ZooKeeper servers, you can specify multiple ZooKeeper servers separated by commas.
|
--skipCreateTopic
| Required if the topic that you specify in the configuration file already exists in Kafka. Indicates to skip creating the topic.
By default, the script creates the topic.
| |
--partitions
| number_of_partitions
| Optional. Number of partitions for the topic. Use partitions to split the data in the topic across multiple brokers. Default is 1. Ensure that the number of partitions is equal to the number of node managers in the cluster. |
--replica
| number_of_replicas
| Optional. Number of replicas that you want to create for the topic. Use replicas for high availability purposes.
Default is 1.
|
--sparkMaster
| deployment_mode
| Indicates whether the Spark runs in the standalone or cluster mode.
Use one of the following values:
Default is
local[*] .
|
--sparkMicroBatchDuration
| batch_processing_time
| Number of seconds to wait before attempting to poll the next batch of data.
Default is 2 seconds.
|
--checkpointDirectory
| directory_to_store_checkpoint
| Absolute path to a HDFS directory or a shared NFS directory to store the checkpoint-related information. Spark uses the checkpoint-related information when a node recovers from a failure.
For example, the following sample directory path stores the checkpoint-related information in HDFS:
hdfs:///user/spark/checkpoint |
--outputTopic
| output_topic_name
| Optional. Name of the topic in Kafka to which you want to publish the output messages. By default, the output messages are not published.
The script does not create the output topic, so ensure that you create the output topic to publish the output messages to it.
|
setup_realtime.sh --config=/usr/local/conf/config_big.xml --rule=/usr/local/conf/matching_rules.xml --consolidate=/usr/local/conf/consolidationfile.xml --instanceName=Prospects --zookeeper=10.28.10.345 --skipCreateTopic --partitions=3 --replica=2 --sparkMaster=yarn --sparkMicroBatchDuration=5 --checkpointDirectory=hdfs:///user/spark/checkpoint --outputTopic=InsuranceOutput
Updated June 27, 2019