To configure a streaming mapping for Databricks, configure the connection for the mapping.
The following image shows the validation and execution environments for a Databricks mapping:
When you configure the mapping, configure the following properties:
Validation environment
The environment that validates the mapping. Select Databricks in the validation environment and select the Databricks engine. The Data Integration Service pushes the mapping logic to the Databricks engine.
Execution environment
The environment that runs the mapping. Select Databricks as the execution environment.
Databricks
Connection: The connection to the Databricks Spark engine used for pushdown of processing. Select
Connection
and browse for a connection or select a connection parameter.
In the case of a mapping failure, to enable the mapping to start reading data from the time of failure, configure the
infaspark.checkpoint.directory
property. For example:
infaspark.checkpoint.directory <directory>
. The directory you specify is created within the directory you specify in the
State Store
property.
Source configuration
Specify the following properties to configure how the data is processed:
Maximum Rows Read. Specify the maximum number of rows that are read before the mapping stops running. Default is
Read All Rows
.
Maximum Runtime Interval. Specify the maximum time to run the mapping before it stops. If you set values for this property and the
Maximum Rows Read
property, the mapping stops running after one of the criteria is met. Default is
Run Indefinitely
. A value of
Run Indefinitely
enables the mapping to run without stopping.
State Store. Specify the DBFS location on the cluster to store information about the state of the Databricks Job. Default is
<Home Directory>/stateStore
You can configure the state store as part of the configuration of execution options for the Data Integration Service.
You can use these properties to test the mapping.
Streaming properties
Specify the batch interval streaming properties. The batch interval is number of seconds after which a batch is submitted for processing. Based on the batch interval, the Spark engine processes the streaming data from sources and publishes the data in batches.