Table of Contents

Search

  1. Preface
  2. Installing Informatica MDM - Relate 360
  3. Configuring Relate 360
  4. Configuring Security
  5. Setting Up the Environment to Process Streaming Data
  6. Configuring Distributed Search
  7. Packaging and Deploying the RESTful Web Services
  8. Troubleshooting

Installation and Configuration Guide

Installation and Configuration Guide

Step 3. Deploy Relate 360 on Spark

Step 3. Deploy
Relate 360
on Spark

Deploy
Relate 360
on Spark to link, consolidate, or tokenize the input data. To deploy
Relate 360
on Spark, run the
setup_realtime.sh
script located in the following directory:
/usr/local/mdmbdrm-<Version Number>
If you use Cloudera CDH, ensure that you set the
spark2-submit
command as the default command to run the Spark applications.
Use the following command to run the
setup_realtime.sh
script:
setup_realtime.sh --config=configuration_file_name --rule=matching_rules_file_name [--checkpointDirectory=directory_to_store_checkpoint] [--consolidate=consolidation_rules_file_name] [--driverMemory=driver_memory] [--enableBackPressure] [--executorMemory=executor_memory] [--instanceName=instance_name] [--keytab=keytab_file_name] [--maxInputRatePerPartition=number_of_records_per_partition] [--outputTopic=output_topic_name] [--probableMatchTopic=probable_match_topic_name] [--partitions=number_of_partitions] [--principal=kerberos_principal_name] [--replica=number_of_replicas] [--resumeFrom=checkpoint_value] [--skipCreateTopic] [--sparkAppJars=list_of_application_jars] [--sparkDriverJar=driver_jar] [--sparkMaster=deployment_mode] [--sparkMicroBatchDuration=batch_duration] [--sparkNumCoresPerExecutor=number_of_cores] [--sparkNumExecutors=number_of_executors] [--zookeeper=zookeeper_connection_string]
The following table describes the options and arguments that you can specify to run the
setup_realtime.sh
script:
Option
Argument
Description
--config
configuration_file_name
Absolute path and file name of the configuration file.
Ensure that the configuration file is present in the same directory path in all the Spark nodes.
--rule
matching_rules_file_name
Absolute path and file name of the matching rules file.
The values in the matching rules file override the values in the configuration file.
Ensure that the matching rules file is present in the same directory path in all the Spark nodes.
--checkpointDirectory
directory_to_store_checkpoint
Optional. Absolute path to a HDFS directory or a shared NFS directory to store the checkpoint-related information. The checkpoint-related information is useful when you redeploy an existing Spark instance after a failure.
For example, the following sample directory path stores the checkpoint-related information in HDFS:
hdfs:///user/spark/checkpoint
When you redeploy an existing Spark instance from the failure point, the
setup_realtime.sh
script ignores other options that you specify in the
setup_realtime.sh
script.
If you do not want to recover the Spark instance from the failure point, delete the checkpoint directory before you run the
setup_realtime.sh
script or use an unique name for the Spark instance.
--consolidate
consolidation_rules_file_name
Optional.
Absolute path and file name of the consolidation rules file.
Use the consolidation rules file only when you want to consolidate the linked data and create preferred records for all the clusters.
--driverMemory
driver_memory
Optional. Amount of memory in gigabytes that you want to allocate to the driver process of the Spark instance. Default is 1g.
--enableBackPressure
Optional. Indicates to enable the internal backpressure mechanism of Spark. The mechanism controls the receiving rate of the streaming data. By default, the internal backpressure mechanism is disabled.
--executorMemory
executor_memory
Optional. Amount of memory in gigabytes that you want to allocate to each executor process of the Spark instance. Default is 1g.
--instanceName
instance_name
Optional. Name for the Spark instance that processes the input data. Default is
BDRMRTIngestSpark
.
If you use an existing Spark instance name and specify the check point directory, the
setup_realtime.sh
script ignores other options that you specify when you run the
setup_realtime.sh
script.
If you want the
setup_realtime.sh
script to use all the options that you specify, use an unique name for the Spark instance.
--keytab
keytab_file_name
Required if you use Kerberos for authentication. Absolute path and file name of the keytab file. The keytab file must contain the Kerberos principal name that you specify in the
--principal
parameter.
The directory that contains the keytab file must not be SELinux enabled. To remove the SELinux permissions from a directory, use the
setfattr
command.
--maxInputRatePerPartition
number_of_records_per_partition
Optional. Maximum number of records that the Spark instance can read from each Kafka partition. By default, the Spark instance reads all the records.
--outputTopic
output_topic_name
Optional. Name of the topic in Kafka to which you want to publish the matching records. By default, the matching records are not published.
The script does not create the output topic, so ensure that you create the output topic to publish the matching records to it.
--probableMatchTopic
probable_match_topic_name
Required if you specify a lower threshold score in the configuration file. Name of the topic in Kafka to which you want to publish the probable matching records.
The script does not create the probable match topic, so ensure that you create the probable match topic to publish the probable matching records to it.
--partitions
number_of_partitions
Optional. Number of partitions for the topic. Use partitions to split the data in the topic across multiple brokers. Default is 1.
Ensure that the number of partitions is equal to the number of node managers in the cluster.
--principal
kerberos_principal_name
Required if you use Kerberos for authentication. Kerberos principal name that has access to submit a Spark job.
--replica
number_of_replicas
Optional. Number of replicas that you want to create for the topic. Use replicas for high availability purposes.
Default is 1.
--resumeFrom
checkpoint_value
Optional. Indicates the offset position of the Kafka input topic from which the Spark instance must process the records. Applicable only if you use an unique name in the
--instanceName
option.
Configure one of the following values:
  • smallest
    . Processes all the records from the beginning of the Kafka input topic.
  • largest
    . Processes the records from the latest position of the Kafka input topic.
  • <Instance Name>
    . Processes the records based on the offset position of the specified Spark instance that uses the same Kafka input topic.
When you redeploy an existing Spark instance after a failure, the Spark instance processes the records in the Kafka topic from the point of failure by default. If you want to process the records from the beginning of the topic or from the current position, reset the offset position for the Spark instance before you redeploy it. For more information about resetting the offset position for a Spark instance, see Resetting the Offset Position for a Spark Instance.
--skipCreateTopic
Required if the topic that you specify in the configuration file already exists in Kafka. Indicates to skip creating the topic.
By default, the script creates the topic.
--sparkAppJars
list_of_application_jars
Optional. Comma-separated list of library JAR files and their paths that you want to include in the driver and executor class paths.
You can specify the following JAR files:
  • /usr/local/mdmbdrm-<Version Number>/bin/BDRMRTProcessor.jar
  • /usr/local/mdmbdrm-<Version Number>/bin/fastutil-7.0.2.jar
  • /usr/local/mdmbdrm-<Version Number>/bin/guava-12.0.1.jar
  • /usr/local/mdmbdrm-<Version Number>/bin/htrace-core.jar
  • /usr/local/mdmbdrm-<Version Number>/bin/ssan3.jar
--sparkDriverJar
driver_jar
Optional. Name and path of the
bdrm-rt-ingest-spark-10.0.HF5.jar
file to include in the driver and executor class paths.
You can find the
bdrm-rt-ingest-spark-10.0.HF5.jar
file in the following directory:
/usr/local/mdmbdrm-<Version Number>/bin
--sparkMaster
deployment_mode
Indicates whether the Spark runs in the standalone or cluster mode.
Use one of the following values:
  • local[*]. Indicates to run Spark locally with as many worker threads as the number of cores on your machine.
  • local[N]. Indicates to run Spark locally with N worker threads.
  • yarn. Indicates to run Spark in the cluster mode.
Default is
local[*]
.
--sparkMicroBatchDuration
batch_duration
Optional. Number of seconds for the Spark instance to wait before packaging the input records into a batch. Default is 2.
When you kill a batch in the Spark web UI, the Spark instance skips the unprocessed records in the batch.
--sparkNumCoresPerExecutor
number_of_cores
Optional. Number of cores for each executor process to use. Default is 1.
--sparkNumExecutors
number_of_executors
Optional. Number of executor processes that you want to use for the Spark instance. By default, the number of executor processes depends on the data size and the number of node managers in the cluster.
Applicable only when you run the Spark instance on YARN.
--zookeeper
zookeeper_connection_string
Connection string to access the ZooKeeper server.
Use the following format for the connection string:
<Host Name>:<Port>[/<chroot>]
The connection string uses the following parameters:
  • Host Name
    . Host name of the ZooKeeper server.
  • Port
    . Port on which the ZooKeeper server listens.
  • chroot
    . Optional. ZooKeeper root directory that you configure in Kafka. Default is /.
The following example connection string uses the default ZooKeeper root directory:
server1.domain.com:2182
The following example connection string uses the user-defined ZooKeeper root directory:
server1.domain.com:2182/kafkaroot
If you use an ensemble of ZooKeeper servers, you can specify multiple ZooKeeper servers separated by commas.
For example, the following command runs the script that deploys
Relate 360
on Spark:
setup_realtime.sh --config=/usr/local/conf/config_big.xml --rule=/usr/local/conf/matching_rules.xml --resumeFrom=smallest --instanceName=Prospects --zookeeper=10.28.10.345 --partitions=3 --replica=2 --sparkMaster=yarn --sparkMicroBatchDuration=5 --checkpointDirectory=hdfs:///user/spark/checkpoint --outputTopic=InsuranceOutput --driverMemory=2g --executorMemory=2g --sparkNumExecutors=3 --sparkNumCoresPerExecutor=2 -–sparkAppJars=$sparkDriverLibraryPath/ssan3.jar,$sparkDriverLibraryPath/BDRMRTProcessor.jar,$sparkDriverLibraryPath/fastutil-7.0.2.jar, $sparkDriverLibraryPath/htrace-core.jar,$sparkDriverLibraryPath/guava-12.0.1.jar --sparkDriverJar=$sparkDriverLibraryPath/bdrm-rt-ingest-spark-10.0.HF5.jar --maxInputRatePerPartition=40 --sparkMicroBatchDuration=10 –-enableBackPressure --principal=kafka/kafka1.hostname.com@EXAMPLE.COM --keytab=/etc/security/keytabs/kafka_server.keytab --probableMatchTopic=probableOutput


Updated June 27, 2019