Hi, I'm Ask INFA!
What would you like to know?
ASK INFAPreview
Please to access Ask INFA.

Table of Contents

Search

  1. Preface
  2. Application Ingestion and Replication

Application Ingestion and Replication

Application Ingestion and Replication

Guidelines for Databricks targets

Guidelines for Databricks targets

Consider the following guidelines when you use Databricks targets:
  • When you use a Databricks target for the first time, perform the following steps before you configure an
    application ingestion and replication
    task for the target:
    1. In the Databricks connection properties, set the
      JDBC Driver Class Name
      property to
      com.databricks.client.jdbc.Driver
      .
    2. On Windows, install Visual C++ Redistributable Packages for Visual Studio 2013 on the computer where the Secure Agent runs.
  • For incremental load jobs, you must enable Change Data Capture (CDC) for all source fields.
  • You can access Databricks tables created on top of the following storage types:
    • Microsoft Azure Data Lake Storage (ADLS) Gen2
    • Amazon Web Services (AWS) S3
    The Databricks connection uses a JDBC URL to connect to the Databricks cluster. When you configure the target, specify the JDBC URL and credentials to use for connecting to the cluster. Also define the connection information that the target uses to connect to the staging location in Amazon S3 or ADLS Gen2.
  • Before writing data to Databricks target tables,
    application ingestion and replication
    jobs stage the data in an Amazon S3 bucket or ADLS directory. You must specify the directory for the data when you configure the
    application ingestion and replication
    task.
    Application Ingestion and Replication
    does not use the
    ADLS Staging Filesystem Name
    and
    S3 Staging Bucket
    properties in the Databricks connection properties to determine the directory.
  • Application Ingestion and Replication
    uses jobs that run once to load data from staging files on AWS S3 or ADLS Gen2 to external tables.
    By default,
    Application Ingestion and Replication
    runs jobs on the cluster that is specified in the Databricks connection properties. If you want to run the jobs on another cluster, set the dbDeltaUseExistingCluster custom property to false on the
    Target
    page in the
    application ingestion and replication
    task wizard.
  • If the cluster specified in the Databricks connection properties is not up and running, the
    application ingestion and replication
    job waits for the cluster to start. By default, the job waits for 10 minutes. If the cluster does not start within 10 minutes, the connection times out and deployment of the job fails.
    If you want to increase the timeout value for the connection, set the dbClusterStartWaitingTime custom property to the maximum time in seconds for which the ingestion and replication job must wait for the cluster to be up and running. You can set the custom property on the
    Target
    page in the
    application ingestion and replication
    task wizard.
  • By default,
    Application Ingestion and Replication
    uses the Databricks COPY INTO feature to load data from the staging file to Databricks target tables. You can disable it for all load types by setting the writerDatabricksUseSqlLoad custom property to false on the
    Target
    page in the
    application ingestion and replication
    task wizard.
  • If you use an AWS cluster, you must specify the
    S3 Service Regional Endpoint
    value in the Databricks connection properties. For example:
    s3.us-east-2.amazonaws.com
    Before you can test a Databricks connection. you must specify the JDBC URL in the
    SQL Endpoint JDBC URL
    field in the Databricks connection properties. After you test the connection, remove the
    SQL Endpoint JDBC URL
    value. Otherwise, when you define an
    application ingestion and replication
    task that uses the connection, a design-time error occurs because
    Application Ingestion and Replication
    tries to use the JDBC URL as well as the required
    Databricks Host
    ,
    Cluster ID
    ,
    Organization ID
    , and
    Databricks Token
    values to connect to target, resulting in login failures.
  • Processing of Rename Field operations on Databricks target tables, without the need to rewrite the underlying Parquet files, requires the Databricks Column Mapping feature with Databricks Runtime 10.2 or later. If you set the Rename Field option to
    Replicate
    on the
    Schedule and Runtime Options
    page in the task wizard, you must alter the generated target table to set the following Databricks table properties after task deployment and before you run the job:
    ALTER TABLE <target_table> SET TBLPROPERTIES ( 'delta.columnMapping.mode' = 'name', 'delta.minReaderVersion' = '2', 'delta.minWriterVersion' = '5')
    These properties enable the Databricks Column Mapping feature with the required reader and writer versions. If you do not set these properties, the
    application ingestion and replication
    job will fail.
  • Application ingestion and replication
    that have Databricks targets can get schema information for generating the target tables from Databricks Unity Catalog. To enable access to information in a Unity Catalog, specify the catalog name in the
    Catalog Name
    field in the Databricks connection properties. The catalog name is appended to the
    SQL Warehouse JDBC URL
    value for a data warehouse.
    Catalog use is optional for SQL warehouses and does not apply to job clusters.
    If you use Unity Catalog, a personal storage location is automatically provisioned. To use the personal staging location, in the
    Staging Environment
    field of the connection properties, select
    Personal Staging Location
    . The Parquet data files for ingestion jobs can then be staged to the local personal storage location, which has a data retention period of 7 days. By default, the staging location is
    stage://tmp/<user_name>
    in the root AWS or Azure location. The <user_name> is taken from the
    Database Token
    connection property. This user must have read and write access to the personal staging location.
  • You can stage data in a Volume in Databricks. Volumes are Unity Catalog objects used to manage and secure non-tabular datasets such as files and directories. Volumes can be of the following types:
    • Managed volumes
      • The data is stored in a location fully controlled and managed by Databricks. If you use volumes as the staging environment, you can specify the volume path where staging files are generated in the
        Volume Path
        field of the connection properties. Files will be written only up to the
        dataDirectory
        level.
      • If you not provide a path to the files within a volume, a new volume is generated automatically in the format:
        <Volumes/<catalogName>/<databaseName>/<volume_identifier>
      • The generated volume is deleted when an initial load job is successfully completed. Only a volume that is empty is automatically deleted. The volume is not deleted when the initial load phase of a combined load job is completed.
    • Unmanaged / external volumes
      • The data is stored in user-managed storage locations in Amazon S3 or Microsoft Azure Data Lake Storage, which you mount or access from Databricks.
      • You must provide the volume path where staging files are generated in the
        Volume Path
        field of the connection properties. Files will be written only up to the
        dataDirectory
        level.
      • The volume is not deleted when an initial load job or the initial load phase of a combined load job is completed.
  • If you generated Databricks unmanaged tables as the target tables and no longer need a target unmanaged table, use a SQL DROP TABLE statement to delete the table from the target database. You must not manually delete the external directory in Amazon S3 or Azure Data Lake Storage for the unmanaged table. If you do so and try to deploy another job that uses that table, the deployment fails with a Metadata Handler error.

0 COMMENTS

We’d like to hear from you!