Table of Contents


  1. Preface
  2. Introduction to Big Data Streaming
  3. Big Data Streaming Administration
  4. Sources in a Streaming Mapping
  5. Targets in a Streaming Mapping
  6. Streaming Mappings
  7. Window Transformation
  8. Appendix A: Connections
  9. Appendix B: Sample Files

Mapping Configurations

Mapping Configurations

To configure a mapping, configure the connection and run-time properties for the mapping.
When you configure the mapping, configure the following properties:
Validation Environment
The environment in which the validations are done. Select Hadoop in the validation environment and select the Spark engine. The Data Integration Service pushes the mapping logic to the Spark engine..
Execution Environment
The environment in which the mappings are executed. Select Hadoop as the execution environment.
Specify the following properties for the Spark engine:
  • Connection. Select the connection to the Spark engine used for pushdown of processing. Select
    and browse for a connection or select a connection parameter.
  • Runtime Properties. An optional list of configuration parameters to apply to the Spark engine. You can change the default Spark configuration properties values, such as
    , or
    Use the following format:
    • <property1> is a Spark configuration property.
    • <value> is the value of the property.
    To enter multiple properties, separate each name-value pair with the following text:
    To use a JMS source or Amazon Kinesis Streams source in the mapping, configure two or more executors for the mapping. For example, use the following configuration:
    spark.executor.instances=2 &: spark.executor.cores=2 &: spark.driver.cores=1
    To use an AWS credential profile, configure the following properties for the mapping:
    • spark.yarn.appMasterEnv.AWS_CREDENTIAL_PROFILES_FILE=<absolute path to the credentials file>/credentials
    • spark.executorEnv.AWS_CREDENTIAL_PROFILES_FILE=<absolute path to the credentials file>/credentials
    • spark.driverEnv.AWS_CREDENTIAL_PROFILES_FILE=<absolute path to the credentials file>/credentials
    To maintain global sort order in a streaming mapping that contains a Sorter transformation, you must set the
    property to true and ensure that the Maintain Row Order property is disabled on the target.
    When you use an Amazon S3 target in the mapping to write large amount of data and specify a higher gigabyte value for the
    property, ensure that you add the
    property and set the value of the property to
    or higher.
    In the case of a mapping failure, to enable the mapping to start reading data from the time of failure, configure the
    property. For example: <directory>
    . The directory you specify is created within the directory you specify in the
    State Store
Source Configuration
Specify the following properties to configure how the data is processed:
  • Maximum Rows Read. Specify the maximum number of rows that are read before the mapping stops running. Default is
    Read All Rows
  • Maximum Runtime Interval. Specify the maximum time to run the mapping before it stops. If you set values for this property and the Maximum Rows Read property, the mapping stops running after one of the criteria is met. Default is
    Run Indefinitely
    . A value of
    Run Indefinitely
    enables the mapping to run without stopping.
  • State Store. Specify the HDFS location on the cluster to store information about the state of the Spark Job. Default is
    <Home Directory>/stateStore
    You can configure the state store as part of the configuration of execution options for the Data Integration Service.
You can use these properties to test the mapping.
Streaming Properties
Specify the following streaming properties:
  • Batch interval. The Spark engine processes the streaming data from sources and publishes the data in batches. The batch interval is number of seconds after which a batch is submitted for processing.
  • Cache refresh interval. You can cache a large lookup source or small lookup tables. When you cache the lookup source, the Data Integration Service queries the lookup cache instead of querying the lookup source for each input row. You can configure the interval for refreshing the cache used in a relational Lookup transformation.
Run Configurations
The Developer tool applies configuration properties when you run streaming mappings. Set configuration properties for streaming mappings in the
dialog box.
Configure the following source properties:
  • Read all rows. Reads all rows from the source.
  • Read up to how many rows. The maximum number of rows to read from the source if you do not read all rows.
  • Maximum runtime interval. The maximum time to run the mapping before it stops. If you set values for this property and the
    Maximum Rows Read
    property, the mapping stops running after one of the criteria is met.
When you run the mapping, the Data Integration Service converts the mapping to a Scala program and package it in a JAR file and sends it to the Hadoop cluster. You can view the details in the Spark execution plan in the Developer tool or Administrator tool.
The following image shows the connection and run-time properties:
The Run-time properties shows the Validation Environment and the Execution Environment properties.


We’d like to hear from you!