Table of Contents

Search

  1. Preface
  2. Part 1: Introduction to Data Discovery
  3. Part 2: Data Discovery with Informatica Analyst
  4. Part 3: Data Discovery with Informatica Developer
  5. Appendix A: Function Support Based on Profiling Warehouse Connection

Data Discovery Guide

Data Discovery Guide

Data Domain Discovery on the Databricks Cluster

Data Domain Discovery on the Databricks Cluster

Use the Databricks cluster to perform data discovery on the Spark engine. The Databricks cluster is a environment to run the spark jobs. You can run a profile to perform data discovery for the Azure sources using the Databricks cluster.
You need to perform the following steps to connect to the Azure sources in the Databricks cluster:

Prerequisite

Add the following Advanced Spark configuration parameters for the Databricks cluster and restart the cluster:
  • fs.azure.account.auth.type OAuth
  • fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
  • fs.azure.account.oauth2.client.id <your-service-client-id>
  • fs.azure.account.oauth2.client.secret <your-service-client-secret-key>
  • fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<directory-ID-of-Azure-AD>/oauth2/token
  • spark.hadoop.fs.azure.account.key.<<ACCOUNT_NAME>>. dfs.core.windows.net <<VALUE>>

Download and Copy the JAR files for the Profiling Warehouse

  1. Get the Oracle DataDirect JDBC driver JAR files for the profiling warehouse. You can copy the files from the following location:
    <INFA_HOME>/services/shared/jars/thirdparty/com.informatica.datadirect-dworacle-6.0.0_F.jar
    .
  2. Place the Oracle DataDirect JDBC driver JAR files in the following locations:
    • <INFA_HOME>/connectors/thirdparty/informatica.jdbc_v2/spark
    • <INFA_HOME>/connectors/thirdparty/informatica.jdbc_v2/common
    • <INFA_HOME>/services/shared/hadoop/<DataBricksversion>/runtimeLib

Download and Copy the JAR files for the JBDC Delta Objects

  1. Get the JDBC .jar files for JDBC delta objects. You can download the files from the database vendor website.
  2. Place the .jar files at the Developer tool location
    \clients\externaljdbcjars
    to access the metadata.
  3. Restart the Developer tool.

Configure Custom Properties on the Data Integration Service

  1. Launch Informatica Administrator, and then select the
    Data Integration Service
    in the
    Domain Navigator
    .
  2. Click the
    Custom Properties
    option on the
    Properties
    tab.
  3. Set the following custom property to perform automatic installation of the Informatica libraries into the Databricks cluster:
    ExecutionContextOptions.databricks.enable.infa.libs.autoinstall:true
  4. Recycle the Data Integration Service.

Supported sources for data domain discovery on the Databricks Cluster

  • JDBC Delta.
  • Azure Data Lake Store Gen2.

0 COMMENTS

We’d like to hear from you!