Table of Contents

Search

  1. Preface
  2. Part 1: Hadoop Integration
  3. Part 2: Databricks Integration
  4. Appendix A: Connections

Configure *-site.xml Files for Azure HDInsight

Configure *-site.xml Files for Azure HDInsight

The Hadoop administrator needs to configure *-site.xml file properties and restart the credential service and other impacted services before the Informatica administrator imports cluster information into the domain.

core-site.xml

Configure the following properties in the core-site.xml file:
fs.azure.account.key.<youraccount>.blob.core.windows.net
Required for Azure HDInsight cluster that uses WASB storage. The storage account access key required to access the storage.
You can contact the HDInsight cluster administrator to get the storage account key associated with the HDInsight cluster. If you are unable to contact the administrator, perform the following steps to decrypt the encrypted storage account key:
  • Copy the value of the
    fs.azure.account.key.<youraccount>.blob.core.windows.net
    property.
    <property> <name>fs.azure.account.key.<youraccount>.blob.core.windows.net</name> <value>STORAGE ACCOUNT KEY</value> </property>
    • Decrypt the storage account key. Run the
      decrypt.sh
      specified in the
      fs.azure.shellkeyprovider.script
      property along with the encrypted value you copied in the previous step.
      <property> <name>fs.azure.shellkeyprovider.script</name> <value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value> </property>
    • Copy the decrypted value and update the value of
      fs.azure.account.key.youraccount.blob.core.windows.net
      property in the cluster configuration core-site.xml.
dfs.adls.oauth2.client.id
Required for Azure HDInsight cluster that uses ADLS storage without Enterprise Security Package. The application ID associated with the Service Principal required to authorize the service principal and access the storage.
To find the application ID for a service principal, in the Azure Portal, click
Azure Active Directory
App registrations
Service Principal Display Name
.
dfs.adls.oauth2.refresh.url
Required for Azure HDInsight cluster that uses ADLS storage without Enterprise Security Package. The OAuth 2.0 token endpoint required to authorize the service principal and access the storage.
To find the refresh URL OAuth 2.0 endpoint, in the Azure portal, click
Azure Active Directory
App registrations
Endpoints
.
dfs.adls.oauth2.credential
Required for Azure HDInsight cluster that uses ADLS storage without Enterprise Security Package. The password required to authorize the service principal and access the storage.
To find the password for a service principal, in the Azure portal, click
Azure Active Directory
App registrations
Service Principal Display Name
Settings
Keys
.
hadoop.proxyuser.<proxy user>.groups
Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the <proxy user> is the system user that runs the Informatica daemon.
Set to group names of impersonation users separated by commas. If less security is preferred, use the wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.<proxy user>.users
Required for Azure HDInsight cluster that uses Enterprise Security Package and ADLS storage. Defines the user account that the proxy user account can impersonate. On a secure cluster, the <proxy user> is the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the <proxy user> is the system user that runs the Informatica daemon.
Set to a single user account or set to a comma-separated list. If less security is preferred, use the wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.<proxy user>.hosts
Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the <proxy user> is the system user that runs the Informatica daemon.
Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred, use the wildcard " * " to allow impersonation from any host.
hadoop.proxyuser.yarn.groups
Comma-separated list of groups that you want to allow the YARN user to impersonate on a non-secure cluster.
Set to group names of impersonation users separated by commas. If less security is preferred, use the wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.yarn.hosts
Comma-separated list of hosts that you want to allow the YARN user to impersonate on a non-secure cluster.
Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred, use the wildcard " * " to allow impersonation from any host.
io.compression.codecs
Enables compression on temporary staging tables.
Set to a comma-separated list of compression codec classes on the cluster.
hadoop.security.auth_to_local
Translates the principal names from the Active Directory and MIT realm into local names within the Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.
Set to: RULE:[1:$1@$0](^.*@YOUR.REALM)s/^(.*)@YOUR.REALM\.COM$/$1/g
Set to: RULE:[2:$1@$0](^.*@YOUR.REALM\.$)s/^(.*)@YOUR.REALM\.COM$/$1/g

hbase-site.xml

Configure the following properties in the hbase-site.xml file:
hbase.use.dynamic.jars
Enables metadata import and test connection from the Developer tool. Required for an HDInsight cluster that uses ADLS storage or an Amazon EMR cluster that uses HBase resources in S3 storage.
Set to: false
zookeeper.znode.parent
Identifies HBase master and region servers.
Set to the relative path to the znode directory of HBase.

hive-site.xml

Configure the following properties in the hive-site.xml file:
hive.cluster.delegation.token.store.class
The token store implementation. Required for HiveServer2 high availability and load balancing.
Set to: org.apache.hadoop.hive.thrift.DBTokenStore
hive.compactor.initiator.on
Runs the initiator and cleaner threads on metastore instance. Required for an Update Strategy transformation in a mapping that writes to a Hive target.
Set to: TRUE
hive.compactor.worker.threads
The number of worker threads to run in a metastore instance. Required for an Update Strategy transformation in a mapping that writes to a Hive target.
Set to: 1
hive.enforce.bucketing
Enables dynamic bucketing while loading to Hive. Required for an Update Strategy transformation in a mapping that writes to a Hive target.
Set to: TRUE
hive.exec.dynamic.partition
Enables dynamic partitioned tables for Hive tables. Applicable for Hive versions 0.9 and earlier.
Set to: TRUE
hive.exec.dynamic.partition.mode
Allows all partitions to be dynamic. Required for the Update Strategy transformation in a mapping that writes to a Hive target. Also required if you use Sqoop and define a DDL query to create or replace a partitioned Hive target at run time.
Set to: nonstrict
hive.support.concurrency
Enables table locking in Hive. Required for an Update Strategy transformation in a mapping that writes to a Hive target.
Set to: TRUE
hive.server2.support.dynamic.service.discovery
Enables HiveServer2 dynamic service discovery. Required for HiveServer2 high availability.
Set to: TRUE
hive.server2.zookeeper.namespace
The value of the ZooKeeper namespace in the JDBC connection string. Required for HiveServer2 high availability.
Set to:
jdbc:hive2://<zookeeper_ensemble>/default;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
hive.txn.manager
Turns on transaction support. Required for an Update Strategy transformation in a mapping that writes to a Hive target.
Set to: org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.zookeeper.quorum
Comma-separated list of ZooKeeper server host:ports in a cluster. The value of the ZooKeeper ensemble in the JDBC connection string. Required for HiveServer2 high availability.
Set to:
jdbc:hive2://<zookeeper_ensemble>/default;serviceDiscoveryMode=zooKeeper;

mapred-site.xml

Configure the following properties in the mapred-site.xml file:
mapreduce.framework.name
The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for Sqoop.
Set to: yarn
yarn.app.mapreduce.am.staging-dir
The HDFS staging directory used while submitting jobs.
Set to the staging directory path.

yarn-site.xml

Configure the following properties in the yarn-site.xml file:
yarn.application.classpath
Required for dynamic resource allocation.
Add spark_shuffle.jar to the class path. The .jar file must contain the class "org.apache.spark.network.yarn.YarnShuffleService."
yarn.nodemanager.resource.memory-mb
The maximum RAM available for each container. Set the maximum memory on the cluster to increase resource memory available to the Blaze engine.
Set to 16 GB if value is less than 16 GB.
yarn.nodemanager.resource.cpu-vcores
The number of virtual cores for each container. Required for Blaze engine resource allocation.
Set to 10 if the value is less than 10.
yarn.scheduler.minimum-allocation-mb
The minimum RAM available for each container. Required for Blaze engine resource allocation.
Set to 6 GB if the value is less than 6 GB.
yarn.nodemanager.vmem-check-enabled
Disables virtual memory limits for containers. Required for the Blaze and Spark engines.
Set to: FALSE
yarn.nodemanager.aux-services
Required for dynamic resource allocation for the Spark engine.
Add an entry for "spark_shuffle."
yarn.nodemanager.aux-services.spark_shuffle.class
Required for dynamic resource allocation for the Spark engine.
Set to: org.apache.spark.network.yarn.YarnShuffleService
yarn.resourcemanager.scheduler.class
Defines the YARN scheduler that the Data Integration Service uses to assign resources.
Set to: org.apache.hadoop.yarn.server.resourcemanager.scheduler
yarn.node-labels.enabled
Enables node labeling.
Set to: TRUE
yarn.node-labels.fs-store.root-dir
The HDFS location to update node label dynamically.
Set to: <hdfs://[Node name]:[Port]/[Path to store]/[Node labels]/>

tez-site.xml

Configure the following properties in the tez-site.xml file:
tez.runtime.io.sort.mb
The sort buffer memory. Required when the output needs to be sorted for Blaze and Spark engines.
Set value to 270 MB.