Application Ingestion and Replication

Back Next

Guidelines for Databricks targets

Consider the following guidelines when you use Databricks targets:

When you use a Databricks target for the first time, perform the following steps before you configure an

application ingestion and replication

task for the target:

In the Databricks connection properties, set the

JDBC Driver Class Name

property to

com.databricks.client.jdbc.Driver

On Windows, install Visual C++ Redistributable Packages for Visual Studio 2013 on the computer where the Secure Agent runs.

For incremental load jobs, you must enable Change Data Capture (CDC) for all source fields.

You can access Databricks tables created on top of the following storage types:

Microsoft Azure Data Lake Storage (ADLS) Gen2

Amazon Web Services (AWS) S3

The Databricks connection uses a JDBC URL to connect to the Databricks cluster. When you configure the target, specify the JDBC URL and credentials to use for connecting to the cluster. Also define the connection information that the target uses to connect to the staging location in Amazon S3 or ADLS Gen2.

Before writing data to Databricks target tables,

application ingestion and replication

jobs stage the data in an Amazon S3 bucket or ADLS directory. You must specify the directory for the data when you configure the

application ingestion and replication

task.

Application Ingestion and Replication

does not use the

ADLS Staging Filesystem Name

and

S3 Staging Bucket

properties in the Databricks connection properties to determine the directory.

Application Ingestion and Replication

uses jobs that run once to load data from staging files on AWS S3 or ADLS Gen2 to external tables.

By default,

Application Ingestion and Replication

runs jobs on the cluster that is specified in the Databricks connection properties. If you want to run the jobs on another cluster, set the dbDeltaUseExistingCluster custom property to false on the

Target

page in the

application ingestion and replication

task wizard.

If the cluster specified in the Databricks connection properties is not up and running, the

application ingestion and replication

job waits for the cluster to start. By default, the job waits for 10 minutes. If the cluster does not start within 10 minutes, the connection times out and deployment of the job fails.

If you want to increase the timeout value for the connection, set the dbClusterStartWaitingTime custom property to the maximum time in seconds for which the ingestion and replication job must wait for the cluster to be up and running. You can set the custom property on the

Target

page in the

application ingestion and replication

task wizard.

By default,

Application Ingestion and Replication

uses the Databricks COPY INTO feature to load data from the staging file to Databricks target tables. You can disable it for all load types by setting the writerDatabricksUseSqlLoad custom property to false on the

Target

page in the

application ingestion and replication

task wizard.

If you use an AWS cluster, you must specify the

S3 Service Regional Endpoint

value in the Databricks connection properties. For example:

s3.us-east-2.amazonaws.com

Before you can test a Databricks connection. you must specify the JDBC URL in the

SQL Endpoint JDBC URL

field in the Databricks connection properties. After you test the connection, remove the

SQL Endpoint JDBC URL

value. Otherwise, when you define an

application ingestion and replication

task that uses the connection, a design-time error occurs because

Application Ingestion and Replication

tries to use the JDBC URL as well as the required

Databricks Host

Cluster ID

Organization ID

, and

Databricks Token

values to connect to target, resulting in login failures.

Processing of Rename Field operations on Databricks target tables, without the need to rewrite the underlying Parquet files, requires the Databricks Column Mapping feature with Databricks Runtime 10.2 or later. If you set the Rename Field option to

Replicate

on the

Schedule and Runtime Options

page in the task wizard, you must alter the generated target table to set the following Databricks table properties after task deployment and before you run the job:

ALTER TABLE <target_table> SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')

These properties enable the Databricks Column Mapping feature with the required reader and writer versions. If you do not set these properties, the

application ingestion and replication

job will fail.

Application ingestion and replication

that have Databricks targets can get schema information for generating the target tables from Databricks Unity Catalog. To enable access to information in a Unity Catalog, specify the catalog name in the

Catalog Name

field in the Databricks connection properties. The catalog name is appended to the

SQL Warehouse JDBC URL

value for a data warehouse.

Catalog use is optional for SQL warehouses and does not apply to job clusters.

If you use Unity Catalog, a personal storage location is automatically provisioned. To use the personal staging location, in the

Staging Environment

field of the connection properties, select

Personal Staging Location

. The Parquet data files for ingestion jobs can then be staged to the local personal storage location, which has a data retention period of 7 days. By default, the staging location is

stage://tmp/<user_name>

in the root AWS or Azure location. The <user_name> is taken from the

Database Token

connection property. This user must have read and write access to the personal staging location.

You can stage data in a Volume in Databricks. Volumes are Unity Catalog objects used to manage and secure non-tabular datasets such as files and directories. Volumes can be of the following types:

Managed volumes

The data is stored in a location fully controlled and managed by Databricks. If you use volumes as the staging environment, you can specify the volume path where staging files are generated in the

Volume Path

field of the connection properties. Files will be written only up to the

dataDirectory

level.

If you not provide a path to the files within a volume, a new volume is generated automatically in the format:

<Volumes/<catalogName>/<databaseName>/<volume_identifier>

The generated volume is deleted when an initial load job is successfully completed. Only a volume that is empty is automatically deleted. The volume is not deleted when the initial load phase of a combined load job is completed.

Unmanaged / external volumes

The data is stored in user-managed storage locations in Amazon S3 or Microsoft Azure Data Lake Storage, which you mount or access from Databricks.

You must provide the volume path where staging files are generated in the

Volume Path

field of the connection properties. Files will be written only up to the

dataDirectory

level.

The volume is not deleted when an initial load job or the initial load phase of a combined load job is completed.

If you generated Databricks unmanaged tables as the target tables and no longer need a target unmanaged table, use a SQL DROP TABLE statement to delete the table from the target database. You must not manually delete the external directory in Amazon S3 or Azure Data Lake Storage for the unmanaged table. If you do so and try to deploy another job that uses that table, the deployment fails with a Metadata Handler error.

Application Ingestion and Replication target types - preparation and usage

Download Guide

Watch

Comments

Cloud Mass Ingestion Homepage