To use an AWS credential profile, configure the following properties for the mapping:
spark.yarn.appMasterEnv.AWS_CREDENTIAL_PROFILES_FILE=<absolute path to the credentials file>/credentials
spark.executorEnv.AWS_CREDENTIAL_PROFILES_FILE=<absolute path to the credentials file>/credentials
spark.driverEnv.AWS_CREDENTIAL_PROFILES_FILE=<absolute path to the credentials file>/credentials
To maintain global sort order in a streaming mapping that contains a Sorter transformation, you must set the
property to true and ensure that the Maintain Row Order property is disabled on the target.
When you use an Amazon S3 target in the mapping to write large amount of data and specify a higher gigabyte value for the
property, ensure that you add the
property and set the value of the property to
When you use a Google Cloud Storage target in a streaming mapping, you can configure size-based rollover and time-based rollover while creating the mapping.
If you enable high-precision in a streaming mapping, the Spark engine runs the mapping in low-precision mode.
Specify the following properties to configure how the data is processed:
Maximum Rows Read. Specify the maximum number of rows that are read before the mapping stops running. Default is
Read All Rows
Maximum Runtime Interval. Specify the maximum time to run the mapping before it stops. If you set values for this property and the Maximum Rows Read property, the mapping stops running after one of the criteria is met. Default is
. A value of
enables the mapping to run without stopping.
State Store. Specify the HDFS location on the cluster to store information about the state of the Spark Job. Default is
You can configure the state store as part of the configuration of execution options for the Data Integration Service.
You can use these properties to test the mapping.
Specify the following streaming properties:
Batch interval. The Spark engine processes the streaming data from sources and publishes the data in batches. The batch interval is number of seconds after which a batch is submitted for processing.
Cache refresh interval. You can cache a large lookup source or small lookup tables. When you cache the lookup source, the Data Integration Service queries the lookup cache instead of querying the lookup source for each input row. You can configure the interval for refreshing the cache used in a relational Lookup transformation.
When you run a streaming mapping, if the hive lookup table is in ORC or Parquet format, the minimum value for the cache refresh interval property is 1.
State Store Connection. You can select an external storage connection for the state store. Default external storage connection is HDFS. You can browse the state store connection property to select Amazon S3, Microsoft Azure Data Lake Stroage Gen1, or Microsoft Azure Data Lake Storage Gen2 as the external storage. You can also have a parameterized connection.
Checkpoint Directory. You can specify a checkpoint directory to enable a mapping to start reading data from the point of failure when the mapping fails or from the point in which a cluster is deleted. The directory you specify is created within the directory you specify in the State Store property.
The Developer tool applies configuration properties when you run streaming mappings. Set configuration properties for streaming mappings in the
Configure the following source properties:
Read all rows. Reads all rows from the source.
Read up to how many rows. The maximum number of rows to read from the source if you do not read all rows.
Maximum runtime interval. The maximum time to run the mapping before it stops. If you set values for this property and the
Maximum Rows Read
property, the mapping stops running after one of the criteria is met.
When you run the mapping, the Data Integration Service converts the mapping to a Scala program and package it in a JAR file and sends it to the Hadoop cluster. You can view the details in the Spark execution plan in the Developer tool or Administrator tool.