Table of Contents

Search

  1. Preface
  2. Data Profiling
  3. Profiles
  4. Profile results
  5. Tuning data profiling task performance
  6. Troubleshooting

Data Profiling

Data Profiling

Databricks Delta

Databricks Delta

To access a Databricks Delta source object, you must create a Databricks Delta connection to the source object. You can run a profile on Databricks Delta Lake and Databricks Unity Catalog source systems. You can run profiles on Databricks Delta tables created using all-purpose clusters.
You can run a profiling task on a Databricks Delta source on a Data Integration Server and on an advanced cluster. The profiling task runs on a Data Integration Server by default. The Data Integration Server uses the Databricks Delta connector to read Databricks Delta source objects.
You can run profiling tasks on tables and views on a Data Integration Server. Use an advanced cluster to profile complex data types such as maps, structures, and arrays in a Databricks Delta source.
When you run a profiling task on a Data Integration Server, the task flattens the complex data types to Nstrings, and the complex data type columns are returned as string data types in the profiling results.
Configure the following Databricks Delta connection properties to create and run a data profiling task on a Databricks Delta source object:
Property
Description
Connection Name
Name of the connection.
Each connection name must be unique within the organization. Connection names can contain alphanumeric characters, spaces, and the following special characters: _ . + -,
Maximum length is 255 characters.
Description
Description of the connection. Maximum length is 4000 characters.
Type
Databricks Delta
Runtime Environment
The name of the runtime environment where you want to run tasks.
SQL Warehouse JDBC URL
Databricks SQL Warehouse JDBC connection URL.
Required to connect to a Databricks SQL warehouse. Doesn't apply to Databricks clusters.
Databricks SQL Serverless is the recommended Databricks cluster type.
Databricks Token
Personal access token to access Databricks.
Required for SQL warehouse and Databricks cluster.
Catalog Name
If you use Unity Catalog, the name of an existing catalog in the metastore.
Optional for SQL warehouse. Doesn't apply to Databricks cluster.
You can also specify the catalog name in the end of the SQL warehouse JDBC URL.
The catalog name cannot contain special characters.
For more information about Unity Catalog, see the Databricks Delta documentation.

Advanced settings

The following table describes the advanced connection properties:
Property
Description
Database
The database name that you want to connect to in Databricks Delta.
Optional for SQL warehouse and Databricks cluster.
For Data Integration, if you do not provide a database name, all databases available in the workspace are listed. The value you provide here overrides the database name provided in the
SQL Warehouse JDBC URL
connection property.
By default, all databases available in the workspace are listed.
JDBC Driver Class Name
The name of the JDBC driver class.
Optional for SQL warehouse and Databricks cluster.
For JDBC URL versions 2.6.22 or earlier, specify the driver class name as
com.simba.spark.jdbc.Driver
.
For JDBC URL versions 2.6.25 or later, specify the driver class name as
com.databricks.client.jdbc.Driver
.
Specify the driver class name as
com.simba.spark.jdbc.Driver
for the data loader task.
For application ingestion and database ingestion tasks, specify the driver class name as:
com.databricks.client.jdbc.Driver
Staging Environment
The cloud provider where the Databricks cluster is deployed.
Required for SQL warehouse and Databricks cluster.
Select one of the following options:
  • AWS
  • Azure
  • Personal Staging Location. Does not apply to Data Profiling.
Default is Personal Staging Location.
You can select the Personal Staging Location as the staging environment instead of Azure or AWS staging environments to stage data locally for mappings and tasks.
Personal staging location doesn't apply to Databricks cluster.
You cannot switch between clusters once you establish a connection.
Databricks Host
The host name of the endpoint the Databricks account belongs to.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
You can get the Databicks Host from the JDBC URL. The URL is available in the Advanced Options of JDBC or ODBC in the Databricks Delta all-purpose cluster.
The following example shows the Databicks Host in JDBC URL:
jdbc:spark://
<Databricks Host>
:443/ default;transportMode=http; ssl=1;httpPath=sql/ protocolv1/o/<Org Id>/<Cluster ID>; AuthMech=3; UID=token; PWD=<personal-access-token>
The value of PWD in Databricks Host, Organization Id, and Cluster ID is always
<personal-access-token>
.
Doesn't apply to a data loader task.
Cluster ID
The ID of the cluster.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
You can get the cluster ID from the JDBC URL. The URL is available in the Advanced Options of JDBC or ODBC in the Databricks Delta all-purpose cluster
The following example shows the Cluster ID in JDBC URL:
jdbc:spark://<Databricks Host>:443/ default;transportMode=http; ssl=1;httpPath=sql/ protocolv1/o/<Org Id>/
<Cluster ID>
; AuthMech=3;UID=token; PWD=<personal-access-token>
Doesn't apply to a data loader task.
Organization ID
The unique organization ID for the workspace in Databricks.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
You can get the Organization ID from the JDBC URL. The URL is available in the Advanced Options of JDBC or ODBC in the Databricks Delta all-purpose cluster
The following example shows the Organization ID in JDBC URL:
jdbc:spark://<Databricks Host>:443/ default;transportMode=http; ssl=1;httpPath=sql/ protocolv1/o/
<Organization ID>
/ <Cluster ID>;AuthMech=3;UID=token; PWD=<personal-access-token>
Doesn't apply to a data loader task.
Min Workers
1
The minimum number of worker nodes to be used for the Spark job. Minimum value is 1.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
Doesn't apply to a data loader task.
Max Workers
1
The maximum number of worker nodes to be used for the Spark job. If you don't want to autoscale, set Max Workers = Min Workers or don't set Max Workers.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
Doesn't apply to a data loader task.
DB Runtime Version
1
The version of Databricks cluster to spawn when you connect to Databricks cluster to process mappings.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
Select the runtime version 9.1 LTS.
Doesn't apply to a data loader task.
Worker Node Type
1
The worker node instance type that is used to run the Spark job.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
For example, the worker node type for AWS can be i3.2xlarge. The worker node type for Azure can be Standard_DS3_v2.
Doesn't apply to a data loader task.
Driver Node Type
1
The driver node instance type that is used to collect data from the Spark workers.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
For example, the driver node type for AWS can be i3.2xlarge. The driver node type for Azure can be Standard_DS3_v2.
If you don't specify the driver node type, Databricks uses the value you specify in the worker node type field.
Doesn't apply to a data loader task.
Instance Pool ID
1
The instance pool ID used for the Spark cluster.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
If you specify the Instance Pool ID
to run mappings
, the following connection properties are ignored:
  • Driver Node Type
  • EBS Volume Count
  • EBS Volume Type
  • EBS Volume Size
  • Enable Elastic Disk
  • Worker Node Type
  • Zone ID
Doesn't apply to a data loader task.
Elastic Disk
1
Enables the cluster to get additional disk space.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
Enable this option if the Spark workers are running low on disk space.
Doesn't apply to a data loader task.
Spark Configuration
1
The Spark configuration to use in the Databricks cluster.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
The configuration must be in the following format:
"key1"="value1";"key2"="value2";...
For example,
"spark.executor.userClassPathFirst"="False"
Doesn't apply to a data loader task or to Mass Ingestion tasks.
Spark Environment Variables
1
The environment variables to export before launching the Spark driver and workers.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
The variables must be in the following format:
"key1"="value1";"key2"="value2";...
For example,
"MY_ENVIRONMENT_VARIABLE"="true"
Doesn't apply to a data loader task or to Mass Ingestion tasks.
1
Doesn't apply to mappings in advanced mode.

Azure Staging Environment

The following table describes the properties for the Azure staging environment:
Property
Description
ADLS Storage Account Name
The name of the Microsoft Azure Data Lake Storage account.
ADLS Client ID
The ID of your application to complete the OAuth Authentication in the Active Directory.
ADLS Client Secret
The client secret key to complete the OAuth Authentication in the Active Directory.
ADLS Tenant ID
The ID of the Microsoft Azure Data Lake Storage directory that you use to write data.
ADLS Endpoint
The OAuth 2.0 token endpoint from where authentication based on the client ID and client secret is completed.
ADLS Filesystem Name
The name of an existing file system to store the Databricks Delta data.
ADLS Staging Filesystem Name
The name of an existing file system to store the staging data.

AWS Staging Environment

The following table describes the properties for the AWS staging environment:
Property
Description
S3 Access Key
The key to access the Amazon S3 bucket.
S3 Secret Key
The secret key to access the Amazon S3 bucket.
S3 Data Bucket
The existing bucket to store the Databricks Delta data.
S3 Staging Bucket
The existing bucket to store the staging files.
S3 Authentication Mode
The authentication mode to access Amazon S3.
Select one of the following authentication modes:
  • Permanent IAM credentials. Uses the S3 access key and S3 secret key to connect to Databricks Delta.
  • IAM Assume Role. Uses the AssumeRole for IAM authentication to connect to Databricks Delta. Doesn't apply to Databricks cluster.
IAM Role ARN
The Amazon Resource Number (ARN) of the IAM role assumed by the user to use the dynamically generated temporary security credentials.
Set the value of this property if you want to use the temporary security credentials to access the Amazon S3 staging bucket.
For more information about how to get the ARN of the IAM role, see the
AWS documentation
.
Use EC2 Role to Assume Role
Optional. Select the check box to enable the EC2 role to assume another IAM role specified in the IAM Role ARN option.
The EC2 role must have a policy attached with a permission to assume an IAM role from the same or different AWS account.
S3 Region Name
The AWS cluster region in which the bucket you want to access resides.
Select a cluster region if you choose to provide a custom JDBC URL that does not contain a cluster region name in the JDBC URL connection property.
S3 Service Regional Endpoint
The S3 regional endpoint when the S3 data bucket and the S3 staging bucket need to be accessed through a region-specific S3 regional endpoint.
Doesn't apply to Databricks cluster.
Default is
s3.amazonaws.com
.
Zone ID
The zone ID for the Databricks job cluster.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
Applies only if you want to create a Databricks job cluster in a particular zone at runtime.
For example,
us-west-2a
.
The zone must be in the same region where your Databricks account resides.
EBS Volume Type
The type of EBS volumes launched with the cluster.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
EBS Volume Count
The number of EBS volumes launched for each instance. You can choose up to 10 volumes.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
In a Databricks Delta connection, specify at least one EBS volume for node types with no instance store. Otherwise, cluster creation fails.
EBS Volume Size
The size of a single EBS volume in GiB launched for an instance.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
For more information about the Databricks Delta connection properties, see the help for the Databricks Delta connector.

Execution methods for Databricks Delta

You can run a profiling task on a Databricks Delta source on a Data Integration Server and on an advanced cluster.
The following table lists the functionalities available for each of the execution methods:
Functionality
Advanced Mode
Data Integration Server
Profiling on a Managed Databricks Delta table
YES
YES
Profiling on an External Databricks Delta table
YES
YES
Profiling on the view table
NO
YES
Data Preview
NO
YES
Simple Filter
NO
YES
Custom Query
NO
YES
SQL Override in the Advanced Options
NO
YES
FIRST N Rows
NO
YES
Drilldown
NO
YES
Query on profile results
NO
YES
Database Name or Table Name override in the Advanced Options
NO
YES
Ability to create a Databricks Delta source mapplet profile
NO
NO
Support of Data Quality Insight
NO
YES
Ability to create scorecard metrics
YES
YES
Compare Column
YES
YES
Compare Profile runs
YES
YES
Export/Import of Databricks profiles
YES
YES
Ability to link columns of files to Data Quality assets
YES
YES

0 COMMENTS

We’d like to hear from you!