Table of Contents

Search

  1. Preface
  2. Introduction to Hadoop Integration
  3. Before You Begin
  4. Amazon EMR Integration Tasks
  5. Azure HDInsight Integration Tasks
  6. Cloudera CDH Integration Tasks
  7. Hortonworks HDP Integration Tasks
  8. MapR Integration Tasks
  9. Appendix A: Connections

Configure *-site.xml Files for Cloudera CDH

Configure *-site.xml Files for Cloudera CDH

The Hadoop administrator needs to configure *-site.xml file properties and restart impacted services before the Informatica administrator imports cluster information into the domain.

core-site.xml

Configure the following properties in the core-site.xml file:
fs.s3.enableServerSideEncryption
Enables server side encryption for hive buckets. Required if the S3 bucket is encrypted. Required for EMR 5.14 integration if the S3 bucket is encrypted with SSE-KMS.
Set to: TRUE
fs.s3a.access.key
The ID for the Blaze and Spark engines to connect to the Amazon S3 file system.
Set to your access key.
fs.s3a.secret.key
The password for the Blaze and Spark engines to connect to the Amazon S3 file system
Set to your access ID.
fs.s3a.server-side-encryption-algorithm
The server-side encryption algorithm for S3. Required if the S3 bucket is encrypted using an algorithm. Required for EMR 5.14 integration if the S3 bucket is encrypted with SSE-KMS.
Set to the encryption algorithm used.
hadoop.proxyuser.<proxy user>.groups
Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the <proxy user> is the system user that runs the Informatica daemon.
Set to group names of impersonation users separated by commas. If less security is preferred, use the wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.<proxy user>.hosts
Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the <proxy user> is the system user that runs the Informatica daemon.
Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred, use the wildcard " * " to allow impersonation from any host.
io.compression.codecs
Enables compression on temporary staging tables.
Set to a comma-separated list of compression codec classes on the cluster.
hadoop.security.auth_to_local
Translates the principal names from the Active Directory and MIT realm into local names within the Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.
Set to: RULE:[1:$1@$0](^.*@YOUR.REALM)s/^(.*)@YOUR.REALM\.COM$/$1/g
Set to: RULE:[2:$1@$0](^.*@YOUR.REALM\.$)s/^(.*)@YOUR.REALM\.COM$/$1/g

hbase-site.xml

Configure the following properties in the hbase-site.xml file:
zookeeper.znode.parent
Identifies HBase master and region servers.
Set to the relative path to the znode directory of HBase.

hdfs-site.xml

Configure the following properties in the hdfs-site.xml file:
dfs.encryption.key.provider.uri
The KeyProvider used to interact with encryption keys when reading and writing to an encryption zone. Required if sources or targets reside in the HDFS encrypted zone on Java KeyStore KMS-enabled Cloudera CDH cluster or a Ranger KMS-enabled Hortonworks HDP cluster.
Set to: kmf://http@xx11.xyz.com:16000/kms

hive-site.xml

Configure the following properties in the hive-site.xml file:
hive.cluster.delegation.token.store.class
Applies only to Cloudera CDH cluster if HiveServer2 uses Apache Zookeeper for high availability and load balancing. The token store implementation.
Set to: org.apache.hadoop.hive.thrift.ZooKeeperTokenStore
hive.cluster.delegation.token.store.class
The token store implementation. Required for HiveServer2 high availability and load balancing.
Set to: org.apache.hadoop.hive.thrift.DBTokenStore
hive.exec.dynamic.partition
Enables dynamic partitioned tables for Hive tables. Applicable for Hive versions 0.9 and earlier.
Set to: TRUE
hive.exec.dynamic.partition.mode
Allows all partitions to be dynamic. Required if you use Sqoop and define a DDL query to create or replace a partitioned Hive target at run time.
Set to: nonstrict
hiveserver2_load_balancer
Enables high availability for multiple HiveServer2 hosts.
Set to:
jdbc:hive2://<HiveServer2 Load Balancer>:<HiveServer2 Port>/default;principal=hive/<HiveServer2 load Balancer>@<REALM>

mapred-site.xml

Configure the following properties in the mapred-site.xml file:
mapreduce.application.classpath
A comma-separated list of CLASSPATH entries for MapReduce applications. Required for Sqoop.
Include the entries: $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$MR2_CLASSPATH,$CDH_MR2_HOME
mapreduce.framework.name
The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for Sqoop.
Set to: yarn
mapreduce.jobhistory.address
Location of the MapReduce JobHistory Server. The default port is 10020. Required for Sqoop.
Set to: <MapReduce JobHistory Server>:<port>
mapreduce.jobhistory.intermediate-done-dir
Directory where MapReduce jobs write history files. Required for Sqoop.
Set to: /mr-history/tmp
mapreduce.jobhistory.done-dir
Directory where the MapReduce JobHistory Server manages history files. Required for Sqoop.
Set to: /mr-history/done
mapreduce.jobhistory.principal
The Service Principal Name for the MapReduce JobHistory Server. Required for Sqoop.
Set to: mapred/_HOST@YOUR-REALM
mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server. The default value is 19888. Required for Sqoop.
Set to: <host>:<port>
yarn.app.mapreduce.am.staging-dir
The HDFS staging directory used while submitting jobs.
Set to the staging directory path.

yarn-site.xml

Configure the following properties in the yarn-site.xml file:
yarn.application.classpath
Required for dynamic resource allocation.
"Add spark_shuffle.jar to the class path". The .jar file must contain the class "org.apache.network.yarn.YarnShuffleService."
yarn.nodemanager.resource.memory-mb
The maximum RAM available for each container. Set the maximum memory on the cluster to increase resource memory available to the Blaze engine.
Set to 16 GB if value is less than 16 GB.
yarn.nodemanager.resource.cpu-vcores
The number of virtual cores for each container. Required for Blaze engine resource allocation.
Set to 10 if the value is less than 10.
yarn.scheduler.minimum-allocation-mb
The minimum RAM available for each container. Required for Blaze engine resource allocation.
Set to 6 GB if the value is less than 6 GB.
yarn.nodemanager.vmem-check-enabled
Disables virtual memory limits for containers. Required for the Blaze and Spark engines.
Set to: FALSE
yarn.nodemanager.aux-services
Required for dynamic resource allocation for the Spark engine.
Add an entry for "spark_shuffle."
yarn.nodemanager.aux-services.spark_shuffle.class
Required for dynamic resource allocation for the Spark engine.
Set to: org.apache.spark.network.yarn.YarnShuffleService
yarn.resourcemanager.scheduler.class
Defines the YARN scheduler that the Data Integration Service uses to assign resources.
Set to: org.apache.hadoop.yarn.server.resourcemanager.scheduler
yarn.node-labels.enabled
Enables node labeling.
Set to: TRUE
yarn.node-labels.fs-store.root-dir
The HDFS location to update node label dynamically.
Set to: <hdfs://[Node name]:[Port]/[Path to store]/[Node labels]/>

0 COMMENTS

We’d like to hear from you!