Data Profiling
- Data Profiling
- All Products
Property
| Description
|
|---|---|
Connection Name
| Name of the connection.
Each connection name must be unique within the organization. Connection names can contain alphanumeric characters, spaces, and the following special characters: _ . + -,
Maximum length is 255 characters.
|
Description
| Description of the connection. Maximum length is 4000 characters.
|
Use Secret Vault
| Stores sensitive credentials for this connection in the secrets manager that is configured for your organization.
This property appears only if secrets manager is set up for your organization.
When you enable the secret vault in the connection, you can select
which credentials that the Secure Agent retrieves from the secrets manager. If you
don't enable this option, the credentials are stored in the repository or on a local
Secure Agent, depending on how your organization is configured. If you’re using this connection
to apply data access policies through pushdown or proxy services, you cannot use
the Secret Vault configuration option. For information about how to configure and use a secrets manager, see
"Secrets manager configuration" in the Administrator help .
|
Runtime Environment
| The name of the runtime environment where you want to run tasks.
Select a Secure Agent, Hosted
Agent, serverless, or elastic runtime environment. Hosted Agent is not applicable for mappings in advanced mode.
For more information about how to configure
and use the runtime environments, see Runtime Environments in the
Administrator help. |
SQL Warehouse JDBC URL
| Databricks SQL Warehouse JDBC connection URL.
This property is required only for Databricks SQL warehouse. Doesn't apply to all-purpose cluster and job cluster.
To get the SQL Warehouse JDBC URL, go to the Databricks console and select the JDBC driver version from the JDBC URL menu.
Specify the JDBC URL for Databricks JDBC driver version 2.6.25
or later in the following format: jdbc:databricks://<Databricks Host>:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/endpoints/<SQL endpoint cluster ID>;
Ensure that you set the required environment variables in the Secure Agent. Also specify the correct
JDBC Driver Class Name under advanced connection settings.
Specify the database name in the Database Name connection property. If you specify the database name in the JDBC URL, it is not considered.
|
Property
| Description
|
|---|---|
Database
| The database name that you want to connect to in Databricks Delta.
Optional for SQL warehouse and Databricks cluster.
For Data Integration, if you do not provide a database name, all databases available in the workspace are listed. The value you provide here overrides the database name provided in the
SQL Warehouse JDBC URL connection property.
By default, all databases available in the workspace are listed.
|
JDBC Driver Class Name
| The name of the JDBC driver class.
Optional for SQL warehouse and Databricks cluster.
For JDBC URL versions 2.6.22 or earlier, specify the driver class name as
com.simba.spark.jdbc.Driver .
For JDBC URL versions 2.6.25 or later, specify the driver class name as
com.databricks.client.jdbc.Driver .
Specify the driver class name as
com.simba.spark.jdbc.Driver for the data loader task.
For application ingestion and database ingestion tasks, specify the driver class name as:
com.databricks.client.jdbc.Driver
|
Staging Environment
| The cloud provider where the Databricks cluster is deployed.
Required for SQL warehouse and Databricks cluster.
Select one of the following options:
Default is Personal Staging Location.
You can select the Personal Staging Location as the staging environment instead of Azure or AWS staging environments to stage data locally for mappings and tasks.
Personal staging location doesn't apply to Databricks cluster.
You cannot switch between clusters once you establish a connection.
|
Databricks Host
| The host name of the endpoint the Databricks account belongs to.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
You can get the Databicks Host from the JDBC URL. The URL is available in the Advanced Options of JDBC or ODBC in the Databricks Delta all-purpose cluster.
The following example shows the Databicks Host in JDBC URL:
jdbc:spark:// <Databricks Host> :443/ default;transportMode=http; ssl=1;httpPath=sql/ protocolv1/o/<Org Id>/<Cluster ID>; AuthMech=3; UID=token; PWD=<personal-access-token>
The value of PWD in Databricks Host, Organization Id, and Cluster ID is always
<personal-access-token> .
Doesn't apply to a data loader task.
|
Cluster ID
| The ID of the cluster.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
You can get the cluster ID from the JDBC URL. The URL is available in the Advanced Options of JDBC or ODBC in the Databricks Delta all-purpose cluster
The following example shows the Cluster ID in JDBC URL:
jdbc:spark://<Databricks Host>:443/ default;transportMode=http; ssl=1;httpPath=sql/ protocolv1/o/<Org Id>/ <Cluster ID> ; AuthMech=3;UID=token; PWD=<personal-access-token>
Doesn't apply to a data loader task.
|
Organization ID
| The unique organization ID for the workspace in Databricks.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
You can get the Organization ID from the JDBC URL. The URL is available in the Advanced Options of JDBC or ODBC in the Databricks Delta all-purpose cluster
The following example shows the Organization ID in JDBC URL:
jdbc:spark://<Databricks Host>:443/ default;transportMode=http; ssl=1;httpPath=sql/ protocolv1/o/ <Organization ID> / <Cluster ID>;AuthMech=3;UID=token; PWD=<personal-access-token>
Doesn't apply to a data loader task.
|
Min Workers 1
| The minimum number of worker nodes to be used for the Spark job. Minimum value is 1.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
Doesn't apply to a data loader task.
|
Max Workers 1
| The maximum number of worker nodes to be used for the Spark job. If you don't want to autoscale, set Max Workers = Min Workers or don't set Max Workers.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
Doesn't apply to a data loader task.
|
DB Runtime Version 1
| The version of Databricks cluster to spawn when you connect to Databricks cluster to process mappings.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
Select the runtime version 9.1 LTS.
Doesn't apply to a data loader task.
|
Worker Node Type 1
| The worker node instance type that is used to run the Spark job.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
For example, the worker node type for AWS can be i3.2xlarge. The worker node type for Azure can be Standard_DS3_v2.
Doesn't apply to a data loader task.
|
Driver Node Type 1
| The driver node instance type that is used to collect data from the Spark workers.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
For example, the driver node type for AWS can be i3.2xlarge. The driver node type for Azure can be Standard_DS3_v2.
If you don't specify the driver node type, Databricks uses the value you specify in the worker node type field.
Doesn't apply to a data loader task.
|
Instance Pool ID 1
| The instance pool ID used for the Spark cluster.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
If you specify the Instance Pool ID to run mappings , the following connection properties are ignored:
Doesn't apply to a data loader task.
|
Elastic Disk 1
| Enables the cluster to get additional disk space.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
Enable this option if the Spark workers are running low on disk space.
Doesn't apply to a data loader task.
|
Spark Configuration 1
| The Spark configuration to use in the Databricks cluster.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
The configuration must be in the following format:
"key1"="value1";"key2"="value2";...
For example,
"spark.executor.userClassPathFirst"="False"
Doesn't apply to a data loader task or to Mass Ingestion tasks.
|
Spark Environment Variables 1
| The environment variables to export before launching the Spark driver and workers.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
The variables must be in the following format:
"key1"="value1";"key2"="value2";...
For example,
"MY_ENVIRONMENT_VARIABLE"="true"
Doesn't apply to a data loader task or to Mass Ingestion tasks.
|
1 Doesn't apply to mappings in advanced mode.
| |
Property
| Description
|
|---|---|
ADLS Storage Account Name
| The name of the Microsoft Azure Data Lake Storage account.
|
ADLS Client ID
| The ID of your application to complete the OAuth Authentication in the Active Directory.
|
ADLS Client Secret
| The client secret key to complete the OAuth Authentication in the Active Directory.
|
ADLS Tenant ID
| The ID of the Microsoft Azure Data Lake Storage directory that you use to write data.
|
ADLS Endpoint
| The OAuth 2.0 token endpoint from where authentication based on the client ID and client secret is completed.
|
ADLS Filesystem Name
| The name of an existing file system to store the Databricks data.
|
ADLS Staging Filesystem Name
| The name of an existing file system to store the staging data.
|
Property
| Description
|
|---|---|
S3 Authentication Mode
| The authentication mode to connect to Amazon S3.
Select one of the following authentication modes:
|
S3 Access Key
| The key to access the Amazon S3 bucket.
|
S3 Secret Key
| The secret key to access the Amazon S3 bucket.
|
S3 Data Bucket
| The existing S3 bucket to store the Databricks data.
|
S3 Staging Bucket
| The existing bucket to store the staging files.
|
S3 VPC Endpoint Type 1
| The type of Amazon Virtual Private Cloud endpoint for Amazon S3.
You can use a VPC endpoint to enable private communication with Amazon S3.
Select one of the following options:
|
Endpoint DNS Name for S3 1
| The DNS name for the Amazon S3 interface endpoint.
Replace the asterisk symbol with the
bucket keyword in the DNS name.
Enter the DNS name in the following format:
bucket.<DNS name of the interface endpoint>
For example,
bucket.vpce-s3.us-west-2.vpce.amazonaws.com
|
IAM Role ARN 1
| The Amazon Resource Number (ARN) of the IAM role assumed by the user to use the dynamically generated temporary security credentials.
For more information about how to get the ARN of the IAM role, see the
AWS documentation .
|
Use EC2 Role to Assume Role 1
| Optional. Select the check box to enable the EC2 role to assume another IAM role specified in the IAM Role ARN option.
The EC2 role must have a policy attached with a permission to assume an IAM role from the same or different AWS account.
|
STS VPC Endpoint Type 1
| The type of Amazon Virtual Private Cloud endpoint for AWS Security Token Service.
You can use a VPC endpoint to enable private communication with Amazon Security Token Service.
Select one of the following options:
|
Endpoint DNS Name for AWS STS 1
| The DNS name for the AWS STS interface endpoint.
For example,
vpce-01f22cc14558c241f-s8039x4c.sts.us-west-2.vpce.amazonaws.com
|
S3 Service Regional Endpoint
| The S3 regional endpoint when the S3 data bucket and the S3 staging bucket need to be accessed through a region-specific S3 regional endpoint.
This property is optional for SQL warehouse. Doesn't apply to all-purpose cluster and job cluster.
Default is
s3.amazonaws.com .
|
S3 Region Name 1
| The AWS cluster region in which the bucket you want to access resides.
|
Zone ID 1
| The zone ID for the Databricks job cluster.
This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster.
Specify the Zone ID only if you want to create a Databricks job cluster in a particular zone at runtime.
For example,
us-west-2a .
The zone must be in the same region where your Databricks account resides.
|
EBS Volume Type 1
| The type of EBS volumes launched with the cluster.
This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster.
|
EBS Volume Count 1
| The number of EBS volumes launched for each instance. You can choose up to 10 volumes.
This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster.
In a Databricks connection, specify at least one EBS volume for node types with no instance store. Otherwise, cluster creation fails.
|
EBS Volume Size 1
| The size of a single EBS volume in GiB launched for an instance.
This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster.
|
1 Doesn't apply to mappings in advanced mode.
| |
Functionality
| Advanced Mode
| Data Integration Server
|
|---|---|---|
Profiling on a Managed Databricks Delta table
| YES
| YES
|
Profiling on an External Databricks Delta table
| YES
| YES
|
Profiling on the view table
| NO
| YES
|
Data Preview
| NO
| YES
|
Simple Filter
| NO
| YES
|
Custom Query
| NO
| YES
|
SQL Override in the Advanced Options
| NO
| YES
|
FIRST N Rows
| NO
| YES
|
Drilldown
| NO
| YES
|
Query on profile results
| NO
| YES
|
Database Name or Table Name override in the Advanced Options
| NO
| YES
|
Ability to create a Databricks Delta source mapplet profile
| NO
| YES |
Support of Data Quality Insight
| NO
| YES
|
Ability to create scorecard metrics
| YES
| YES
|
Compare Column
| YES
| YES
|
Compare Profile runs
| YES
| YES
|
Export/Import of Databricks profiles
| YES
| YES
|
Ability to link columns of files to Data Quality assets
| YES
| YES
|