Table of Contents

Search

  1. Preface
  2. Installing Informatica MDM - Relate 360
  3. Configuring Relate 360
  4. Configuring Security
  5. Setting Up the Environment to Process Streaming Data
  6. Configuring Distributed Search
  7. Packaging and Deploying the RESTful Web Services

Installation and Configuration Guide

Installation and Configuration Guide

Step 3. Deploy Relate 360 on Spark

Step 3. Deploy Relate 360 on Spark

You must deploy Relate 360 on Spark to link, consolidate, or tokenize the input data. Run the setup_realtime.sh script located in the following directory to deploy Relate 360 on Spark: /usr/local/mdmbdrm-<Version Number>
Use the following command to run the setup_realtime.sh script:
setup_realtime.sh

--config=configuration_file_name

--rule=matching_rules_file_name

--resumeFrom=checkpoint_value

[--consolidate=consolidation_rules_file_name]

[--instanceName=instance_name]

[--zookeeper=zookeeper_connection_string]

[--skipCreateTopic]

[--sparkMaster=deployment_mode]

[--sparkMicroBatchDuration=batch_processing_time]

[--checkpointDirectory=directory_to_store_checkpoint]						

[--partitions=number_of_partitions]

[--replica=number_of_replicas]

[--outputTopic=output_topic_name]

[--driverMemory=driver_memory]

[--executorMemory=executor_memory]

[--sparkNumExecutors=number_of_executors]

[--sparkNumCoresPerExecutor=number_of_cores]

[--sparkAppJars=list_of_application_jars]

[--sparkDriverJar=driver_jar]

[--maxInputRatePerPartition=number_of_records_per_partition]

[--sparkMicroBatchDuration=batch_duration]

[--enableBackPressure]
The following table describes the options and arguments that you can specify to run the setup_realtime.sh script:
Option Argument Description
--config configuration_file_name Absolute path and file name of the configuration file.
Ensure that the configuration file is present in the same directory path in all the Spark nodes.
--rule matching_rules_file_name Absolute path and file name of the matching rules file.
The values in the matching rules file override the values in the configuration file.
Ensure that the matching rules file is present in the same directory path in all the Spark nodes.
--resumeFrom checkpoint_value
Required when you deploy Relate 360 on Spark for the first time. Indicates the checkpoint from which the Spark instance must recover after a failure. The value of the --resumeFrom parameter overrides the checkpoint information that the ZooKeeper directory stores.
Configure one of the following values for the --resumeFrom parameter:
  • smallest. Resets the checkpoint to the smallest offset value of the topic. The Spark instance processes all the records in the Kafka input topic.
  • largest. Resets the checkpoint to the largest offset value of the topic. The Spark instance processes the records in the Kafka input topic from the point of failure.
  • <Instance Name>. Resets the checkpoint to the checkpoint of another Spark instance that you specify. The Spark instance processes the records based on the checkpoint of the specified Spark instance. Ensure that both the Spark instances use the same Kafka input topic.
Use the --resumeFrom parameter with caution because the Spark instance might reprocess the input records based on the value that you specify.
--consolidate consolidation_rules_file_name
Optional. Absolute path and file name of the consolidation rules file.
Use the consolidation rules file only when you want to consolidate the linked data and create preferred records for all the clusters.
--instanceName instance_name Optional. Name for the Spark instance that processes the input data.
Default is BDRMRTIngestSpark.
--zookeeper zookeeper_connection_string Optional. Connection string to access the ZooKeeper server.
Use the following format for the connection string:
<Host Name>:<Port>[/<chroot>]
The connection string uses the following parameters:
  • Host Name. Host name of the ZooKeeper server.
  • Port. Port on which the ZooKeeper server listens.
  • chroot. Optional. ZooKeeper root directory that you configure in Kafka. Default is /.
The following example connection string uses the default ZooKeeper root directory: server1.domain.com:2182
The following example connection string uses the user-defined ZooKeeper root directory: server1.domain.com:2182/kafkaroot
If you use an ensemble of ZooKeeper servers, you can specify multiple ZooKeeper servers separated by commas.
--skipCreateTopic Required if the topic that you specify in the configuration file already exists in Kafka. Indicates to skip creating the topic.
By default, the script creates the topic.
--partitions number_of_partitions Optional. Number of partitions for the topic. Use partitions to split the data in the topic across multiple brokers. Default is 1.
Ensure that the number of partitions is equal to the number of node managers in the cluster.
--replica number_of_replicas Optional. Number of replicas that you want to create for the topic. Use replicas for high availability purposes.
Default is 1.
--sparkMaster deployment_mode Indicates whether the Spark runs in the standalone or cluster mode.
Use one of the following values:
  • local[*]. Indicates to run Spark locally with as many worker threads as the number of cores on your machine.
  • local[N]. Indicates to run Spark locally with N worker threads.
  • yarn. Indicates to run Spark in the cluster mode.
Default is local[*].
--sparkMicroBatchDuration batch_processing_time Number of seconds to wait before attempting to poll the next batch of data.
Default is 2 seconds.
--checkpointDirectory directory_to_store_checkpoint Absolute path to a HDFS directory or a shared NFS directory to store the checkpoint-related information. Spark uses the checkpoint-related information when a node recovers from a failure.
For example, the following sample directory path stores the checkpoint-related information in HDFS: hdfs:///user/spark/checkpoint
--outputTopic output_topic_name Optional. Name of the topic in Kafka to which you want to publish the output messages. By default, the output messages are not published.
The script does not create the output topic, so ensure that you create the output topic to publish the output messages to it.
--driverMemory driver_memory Optional. Amount of memory in gigabytes that you want to allocate to the driver process of the Spark instance. Default is 1g.
--executorMemory executor_memory Optional. Amount of memory in gigabytes that you want to allocate to each executor process of the Spark instance. Default is 1g.
--sparkNumExecutors number_of_executors
Optional. Number of executor processes that you want to use for the Spark instance. By default, the number of executor processes depends on the data size and the number of node managers in the cluster.
Applicable only when you run the Spark instance on YARN.
--sparkNumCoresPerExecutor number_of_cores Optional. Number of cores for each executor process to use. Default is 1.
--sparkAppJars list_of_application_jars
Optional. Comma-separated list of library JAR files and their paths that you want to include in the driver and executor class paths.
You can specify the following JAR files:
  • /usr/local/mdmbdrm-<Version Number>/bin/BDRMRTProcessor.jar
  • /usr/local/mdmbdrm-<Version Number>/bin/fastutil-7.0.2.jar
  • /usr/local/mdmbdrm-<Version Number>/bin/guava-12.0.1.jar
  • /usr/local/mdmbdrm-<Version Number>/bin/htrace-core.jar
  • /usr/local/mdmbdrm-<Version Number>/bin/ssan3.jar
--sparkDriverJar driver_jar
Optional. Name and path of the bdrm-rt-ingest-spark-10.0.HF5.jar file to include in the driver and executor class paths.
You can find the bdrm-rt-ingest-spark-10.0.HF5.jar file in the following directory:
/usr/local/mdmbdrm-<Version Number>/bin
--maxInputRatePerPartition number_of_records_per_partition Optional. Maximum number of records that the Spark instance can read from each Kafka partition. By default, the Spark instance reads all the records.
--sparkMicroBatchDuration batch_duration Optional. Number of seconds for the Spark instance to wait before packaging the input records into a batch. Default is 2.
When you kill a batch in the Spark web UI, the Spark instance skips the unprocessed records in the batch.
--enableBackPressure Optional. Indicates to enable the internal backpressure mechanism of Spark. The mechanism controls the receiving rate of the streaming data. By default, the internal backpressure mechanism is disabled.
For example, the following command runs the script that deploys Relate 360 on Spark:
setup_realtime.sh --config=/usr/local/conf/config_big.xml --rule=/usr/local/conf/matching_rules.xml --resumeFrom=smallest --instanceName=Prospects --zookeeper=10.28.10.345 --partitions=3 --replica=2 --sparkMaster=yarn --sparkMicroBatchDuration=5 --checkpointDirectory=hdfs:///user/spark/checkpoint --outputTopic=InsuranceOutput --driverMemory=2g --executorMemory=2g --sparkNumExecutors=3 --sparkNumCoresPerExecutor=2 -–sparkAppJars=$sparkDriverLibraryPath/ssan3.jar,$sparkDriverLibraryPath/BDRMRTProcessor.jar,$sparkDriverLibraryPath/fastutil-7.0.2.jar, $sparkDriverLibraryPath/htrace-core.jar,$sparkDriverLibraryPath/guava-12.0.1.jar --sparkDriverJar=$sparkDriverLibraryPath/bdrm-rt-ingest-spark-10.0.HF5.jar --maxInputRatePerPartition=40 --sparkMicroBatchDuration=10 –-enableBackPressure
When you run the setup_realtime.sh script after an unexpected shutdown of the node, the script recovers Spark from the failure point and ignores other parameters that you specify in the command. If you do not want to recover Spark from the failure point, delete the check point directory before you run the setup_realtime.sh script.


Updated June 27, 2019


Explore Informatica Network