Table of Contents

Search

  1. Preface
  2. Introduction to PowerExchange for Microsoft Azure Data Lake Storage Gen2
  3. PowerExchange for Microsoft Azure Data Lake Storage Gen2 Configuration
  4. Microsoft Azure Data Lake Storage Gen2 Connections
  5. PowerExchange for Microsoft Azure Data Lake Storage Gen2 Data Objects
  6. Microsoft Azure Data Lake Storage Gen2 Mappings
  7. Appendix A: Microsoft Azure Data Lake Storage Gen2 Datatype Reference

PowerExchange for Microsoft Azure Data Lake Storage Gen2 User Guide

PowerExchange for Microsoft Azure Data Lake Storage Gen2 User Guide

Prerequisites

Prerequisites

Before you use PowerExchange for
Microsoft Azure Data Lake Storage Gen2
, you must complete the following prerequisites:
  • Install and configure the Informatica services.
  • Install and configure the Developer tool. You can install the Developer tool when you install Informatica clients.
  • Create a Data Integration Service and a Model Repository Service in the Informatica domain.
  • Verify that a cluster configuration is created in the domain.
  • Verify that a Metadata Access Service is created in the domain.
  • Verify that the Hadoop distribution version is 3.x or later.
  • Verify that the user used to configure the Informatica domain is added to the cluster and the user has
    sudo
    privileges when you use non-kerberised Cloudera CDH 6.3 Hadoop distribution.
  • Verify that the following tasks are completed before you create a Microsoft Azure Data Lake Storage Gen2 connection:
    • Create an Azure Data Lake Storage Gen2 account and provide
      Contributor
      or
      Reader
      role to users.
      The contributor role grants you full access to manage all resources in the storage account, but does not allow you to assign roles.
      The reader role allows you to view all resources in the storage account, but does not allow you to make any changes.
      To add or remove role assignments, you must have write and delete permissions, such as an Owner role.
    • Create an Azure Active Directory application to authenticate users to access the Azure Data Lake Storage Gen2 account. Provide
      Storage Blob Data Contributor
      or
      Storage Blob Data Reader
      role to the app.
      The Storage Blob Data Contributor role lets you read, write, and delete Azure Storage containers and blobs in the storage account.
      The Storage Blob Data Reader role lets you only read and list Azure Storage containers and blobs in the storage account.
    • Enable hierarchical namespaces for your Azure Data Lake Storage Gen2 account.
    • Create a file system for Microsoft Azure Data Lake Storage Gen2.
    • To access objects from an HDI 4.0 Kerberised cluster, configure the impersonation user details into your Azure Data Lake Storage Gen2 account. Provide
      Contributor
      role and
      full access
      , for the container used in the internal storage account of the HDInsight Data Lake Storage Gen2 cluster, to the impersonation user.
    For more information, see
    Azure Data Lake Storage Gen2
    documentation.
  • To fetch the metadata at design time, you must configure the INFA_PARSER_HOME environment variable for the Metadata Access Service in Informatica Administrator.
    Perform the following steps to configure the INFA_PARSER_HOME property:
    1. Log in to Informatica Administrator.
    2. Click the Metadata Access Service and then click the
      Processes
      tab on the right pane.
    3. Click
      Edit
      in the
      Environment Variables
      section.
    4. Click
      New
      to add an environment variable.
    5. Enter the name of the environment variable as
      INFA_PARSER_HOME
      .
    6. Set the value of the environment variable to the absolute path of the Cloudera CDH 6.3 directory on the machine that runs the Metadata Access Service.
      For example:
      INFA_PARSER_HOME
      =
      <Informatica installation directory>/services/shared/hadoop/CDH_6.3
    7. Recycle the Metadata Access Service.
  • To successfully preview data from a local complex file or run a mapping in the native environment, you must configure the INFA_PARSER_HOME property for the Data Integration Service in Informatica Administrator.
    Perform the following steps to configure the INFA_PARSER_HOME property:
    1. Log in to Informatica Administrator.
    2. Click the Data Integration Service and then click the
      Processes
      tab on the right pane.
    3. Click
      Edit
      in the
      Environment Variables
      section.
    4. Click
      New
      to add an environment variable.
    5. Enter the name of the environment variable as
      INFA_PARSER_HOME
      .
    6. Set the value of the environment variable to the absolute path of the Hadoop distribution directory on the machine that runs the Data Integration Service.
    7. Recycle the Data Integration Service.

Configure Databricks Connection Advanced Properties

Verify that a Databricks connection is created in the domain. If you want to read NULL values from or write NULL values to an Azure source, configure the following advanced properties in the Databricks connection:
  • infaspark.flatfile.reader.nullValue=True
  • infaspark.flatfile.writer.nullValue=True

Configure
Microsoft Azure Data Lake Storage Gen2
Access in Azure Databricks Cluster

Set the following Hadoop credential configuration options under
Spark Config
in your Databricks cluster configuration to access the
Microsoft Azure Data Lake Storage Gen2
:
spark.hadoop.fs.azure.account.auth.type OAuth spark.hadoop.fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider spark.hadoop.fs.azure.account.oauth2.client.id <your-service-client-id> spark.hadoop.fs.azure.account.oauth2.client.secret <your-service-client-secret-key> spark.hadoop.fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<directory-ID-of-Azure-AD>/oauth2/token

Authentication Process

PoweExchange for
Microsoft Azure Data Lake Storage Gen2
uses OAuth 2.0 authorization. The following image shows how does PowerExchange for Azure Data Lake Storage Gen2 receive access tokens and resource access: This image shows the OAuth 2.0 authorization process between PowerExchange for Microsoft Azure Data Lake Storage Gen2 and the Azure Active Directory.

0 COMMENTS

We’d like to hear from you!