PowerExchange for Microsoft Azure Data Lake Storage Gen2 User Guide

10.5
- 10.5.4
- 10.5.3
- 10.5.2
- 10.5.1
- 10.4.1
- 10.4.0

Back Next

Prerequisites

Before you use PowerExchange for

Microsoft Azure Data Lake Storage Gen2

, you must complete the following prerequisites:

Install and configure the Informatica services.

Install and configure the Developer tool. You can install the Developer tool when you install Informatica clients.

Create a Data Integration Service and a Model Repository Service in the Informatica domain.

Verify that a cluster configuration is created in the domain.

Verify that a Metadata Access Service is created in the domain.

Verify that the Hadoop distribution version is 3.x or later.

Verify that the user used to configure the Informatica domain is added to the cluster and the user has

sudo

privileges when you use non-kerberised Cloudera CDH 6.3 Hadoop distribution.

Verify that the following tasks are completed before you create a Microsoft Azure Data Lake Storage Gen2 connection:

Create an Azure Data Lake Storage Gen2 account and provide Contributor
or Reader
role to users.

The contributor role grants you full access to manage all resources in the storage account, but does not allow you to assign roles.

The reader role allows you to view all resources in the storage account, but does not allow you to make any changes.

To add or remove role assignments, you must have write and delete permissions, such as an Owner role.

Create an Azure Active Directory application to authenticate users to access the Azure Data Lake Storage Gen2 account. Provide Storage Blob Data Contributor
or Storage Blob Data Reader
role to the app.

The Storage Blob Data Contributor role lets you read, write, and delete Azure Storage containers and blobs in the storage account.

The Storage Blob Data Reader role lets you only read and list Azure Storage containers and blobs in the storage account.

Enable hierarchical namespaces for your Azure Data Lake Storage Gen2 account.

Create a file system for Microsoft Azure Data Lake Storage Gen2.

To access objects from an HDI 4.0 Kerberised cluster, configure the impersonation user details into your Azure Data Lake Storage Gen2 account. Provide

Contributor

role and

full access

, for the container used in the internal storage account of the HDInsight Data Lake Storage Gen2 cluster, to the impersonation user.

For more information, see Azure Data Lake Storage Gen2
documentation.

To fetch the metadata at design time, you must configure the INFA_PARSER_HOME environment variable for the Metadata Access Service in Informatica Administrator.

Perform the following steps to configure the INFA_PARSER_HOME property:

Click the Metadata Access Service and then click the

Processes

tab on the right pane.

Click

Edit

in the

Environment Variables

section.

Click

New

to add an environment variable.

Enter the name of the environment variable as

INFA_PARSER_HOME

Set the value of the environment variable to the absolute path of the Cloudera CDH 6.3 directory on the machine that runs the Metadata Access Service.

For example:

INFA_PARSER_HOME

<Informatica installation directory>/services/shared/hadoop/CDH_6.3

Recycle the Metadata Access Service.

To successfully preview data from a local complex file or run a mapping in the native environment, you must configure the INFA_PARSER_HOME property for the Data Integration Service in Informatica Administrator.

Perform the following steps to configure the INFA_PARSER_HOME property:

Click the Data Integration Service and then click the

Processes

tab on the right pane.

Click

Edit

in the

Environment Variables

section.

Click

New

to add an environment variable.

Enter the name of the environment variable as

INFA_PARSER_HOME

Set the value of the environment variable to the absolute path of the Hadoop distribution directory on the machine that runs the Data Integration Service.

Recycle the Data Integration Service.

Configure Databricks Connection Advanced Properties

Verify that a Databricks connection is created in the domain. If you want to read NULL values from or write NULL values to an Azure source, configure the following advanced properties in the Databricks connection:

infaspark.flatfile.reader.nullValue=True

infaspark.flatfile.writer.nullValue=True

Configure
Microsoft Azure Data Lake Storage Gen2
Access in Azure Databricks Cluster

Set the following Hadoop credential configuration options under

Spark Config

in your Databricks cluster configuration to access the

Microsoft Azure Data Lake Storage Gen2

spark.hadoop.fs.azure.account.auth.type OAuth
spark.hadoop.fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id <your-service-client-id>
spark.hadoop.fs.azure.account.oauth2.client.secret <your-service-client-secret-key>
spark.hadoop.fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<directory-ID-of-Azure-AD>/oauth2/token

Authentication Process

PoweExchange for

Microsoft Azure Data Lake Storage Gen2

uses OAuth 2.0 authorization. The following image shows how does PowerExchange for Azure Data Lake Storage Gen2 receive access tokens and resource access: This image shows the OAuth 2.0 authorization process between PowerExchange for Microsoft Azure Data Lake Storage Gen2 and the Azure Active Directory.

Rename Saved Search

Table of Contents

PowerExchange for Microsoft Azure Data Lake Storage Gen2 User Guide

PowerExchange for Microsoft Azure Data Lake Storage Gen2 User Guide

Prerequisites

Prerequisites

Configure Databricks Connection Advanced Properties

Configure Microsoft Azure Data Lake Storage Gen2 Access in Azure Databricks Cluster

Authentication Process

Configure
Microsoft Azure Data Lake Storage Gen2
Access in Azure Databricks Cluster