Table of Contents

Search

  1. Preface
  2. Part 1: Introduction to Data Discovery
  3. Part 2: Data Discovery with Informatica Analyst
  4. Part 3: Data Discovery with Informatica Developer
  5. Appendix A: Function Support Based on Profiling Warehouse Connection

Data Discovery Guide

Data Discovery Guide

Data Domain Discovery on the Spark engine

Data Domain Discovery on the Spark engine

When you run a profile to perform data discovery on the Spark engine, reference tables are staged on the Hadoop cluster. To make sure that reference tables for all the data domains are staged on the cluster, you can perform the following steps:

Prerequisite:

You must have a permission to impersonate HDFS user when you perform a data domain discovery.

Download the JDBC .JAR Files

  1. Obtain the JDBC .jar files of the reference database that you use. You can download the files from the database vendor web site.
  2. Copy the files that you download to the following location:
    <INFA_HOME>/externaljdbcjars

Configure Custom Properties on the Data Integration Service

  1. Launch Informatica Administrator, and then select the
    Data Integration Service
    in the
    Domain Navigator
    .
  2. Click the
    Custom Properties
    option on the
    Properties tab
    .
  3. Set the following custom properties to stage reference tables for the data domains:
    Property Name
    Property Value
    AdvancedProfilingServiceOptions.ProfilingSparkReferenceDataHDFSDir
    /tmp/cms
    ExecutionContextOptions.SparkRefTableHadoopConnectorArgs
    --connect <JDBC thin driver connection URL>
  4. Make sure
    /tmp/cms
    directory exists on the cluster. If the directory is not present, create the
    /tmp/cms
    directory or a custom directory where you want to stage the data. The reference data is staged at
    /tmp/cms
    directory by default.
  5. Recycle the Data Integration Service.
  6. Open Analyst tool or Developer tool and make sure you run a first profile with all the data domains to stage the reference data.
If you do not select all the data domains in the first profile run and then select additional data domains in the next profile run, the profile run may fail.

0 COMMENTS

We’d like to hear from you!