Table of Contents

Search

  1. Preface
  2. Part 1: Hadoop Integration
  3. Part 2: Databricks Integration
  4. Appendix A: Managing Distribution Packages
  5. Appendix B: Connections Reference

Configure *-site.xml Files for Amazon EMR

Configure *-site.xml Files for Amazon EMR

The Hadoop administrator needs to configure *-site.xml file properties and restart impacted services before the Informatica administrator imports cluster information into the domain.

capacity-scheduler.xml

Configure the following properties in the capacity-scheduler.xml file:
yarn.scheduler.capacity.<queue path>.disable_preemption
Disables preemption for the Capacity Scheduler. The Blaze engine does not support preemption. If YARN preemption is enabled for the cluster, you need to disable it for the queue allocated to the Blaze engine.
Set to TRUE for queue allocated to the Blaze engine.

core-site.xml

Configure the following properties in the core-site.xml file:
fs.s3.awsAccessKeyID
The ID for the run-time engine to connect to the Amazon S3 file system. Required for the Blaze engine and for the Spark engine if the Data Integration if S3 policy does not allow EMR access, or if you use EMRFS access and the Informatica domain does not reside on an EC2 instance.
If the Data Integration Service is deployed on an EC2 instance and the IAM roles and policies allow access to S3 and other resources, this property is not required. If the Data Integration Service is deployed on-premises, then you can choose to configure the value for this property in the cluster configuration on the Data Integration Service after you import the cluster configuration. Configuring the AccessKeyID value on the cluster configuration is more secure than configuring it in core-site.xml on the cluster.
Set to your access ID.
fs.s3.awsSecretAccessKey
The access key for the Blaze and Spark engines to connect to the Amazon S3 file system. Required for the Blaze engine and for the Spark engine if the Data Integration if S3 policy does not allow EMR access, or if you use EMRFS access and the Informatica domain does not reside on an EC2 instance.
If the Data Integration Service is deployed on an EC2 instance and the IAM roles and policies allow access to S3 and other resources, this property is not required. If the Data Integration Service is deployed on-premises, then you can choose to configure the value for this property in the cluster configuration on the Data Integration Service after you import the cluster configuration. Configuring the AccessKeyID value on the cluster configuration is more secure than configuring it in core-site.xml on the cluster.
Set to your access key.
fs.s3.enableServerSideEncryption
Enables server side encryption for S3 buckets. Required for SSE and SSE-KMS encryption.
Set to TRUE.
fs.s3a.server-side-encryption-algorithm
The server-side encryption algorithm for S3. Required for SSE and SSE-KMS encryption. Set to the encryption algorithm used.
fs.s3a.endpoint
URL of the entry point for the web service.
For example:
<property> <name>fs.s3a.endpoint</name> <value>s3-us-west-1.amazonaws.com</value> </property>
fs.s3a.bucket.BUCKET_NAME.server-side-encryption.key
Server-side encryption key for the S3 bucket. Required if the S3 bucket is encrypted with SSE-KMS.
For example:
<property> <name>fs.s3a.bucket.BUCKET_NAME.server-side-encryption.key</name> <value>arn:aws:kms:us-west-1*******/value> <source>core-site.xml</source> </property>
where BUCKET_NAME is the name of the S3 bucket.
hadoop.proxyuser.<proxy user>.groups
Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the <proxy user> is the system user that runs the Informatica daemon.
Set to group names of impersonation users separated by commas. If less security is preferred, use the wildcard " * " to allow impersonation from any group.
After you make changes to proxy user properties, restart the credential service and other cluster services that use core-site configuration values.
hadoop.proxyuser.<proxy user>.hosts
Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the <proxy user> is the system user that runs the Informatica daemon.
Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred, use the wildcard " * " to allow impersonation from any host.
After you make changes to proxy user properties, restart the credential service and other cluster services that use core-site configuration values.
hadoop.proxyuser.hive.hosts
Comma-separated list of hosts that you want to allow the Hive user to impersonate on a non-secure cluster.
When
hive.server2.enable.doAs
is false, append a comma-separated list of Informatica server host names or IP address where the Data Integration Service is running. If less security is preferred, use the wildcard " * " to allow impersonation from any host.
After you make changes to this property, restart the cluster services that use core-site configuration values.
hadoop.proxyuser.yarn.groups
Comma-separated list of groups that you want to allow the YARN user to impersonate on a non-secure cluster.
Set to group names of impersonation users separated by commas. If less security is preferred, use the wildcard " * " to allow impersonation from any group.
After you make changes to proxyuser properties, restart the credential service and other cluster services that use core-site configuration values.
hadoop.proxyuser.yarn.hosts
Comma-separated list of hosts that you want to allow the YARN user to impersonate on a non-secure cluster.
Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred, use the wildcard " * " to allow impersonation from any host.
After you make changes to proxy user properties, restart the credential service and other cluster services that use core-site configuration values.
hadoop.security.auth_to_local
Translates the principal names from the Active Directory and MIT realm into local names within the Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.
Set to: RULE:[1:$1@$0](^.*@YOUR.REALM)s/^(.*)@YOUR.REALM\.COM$/$1/g
Set to: RULE:[2:$1@$0](^.*@YOUR.REALM\.$)s/^(.*)@YOUR.REALM\.COM$/$1/g
io.compression.codecs
Enables compression on temporary staging tables.
Set to a comma-separated list of compression codec classes on the cluster.

fair-scheduler.xml

Configure the following properties in the fair-scheduler.xml file:
allowPreemptionFrom
Enables preemption for the Fair Scheduler. The Blaze engine does not support preemption. If YARN preemption is enabled for the cluster, you need to disable it for the queue allocated to the Blaze engine.
Set to FALSE for the queue allocated to the Blaze engine.
For example:
<queue name="Blaze"> <weight>1.0</weight> <allowPreemptionFrom>false</allowPreemptionFrom> <schedulingPolicy>fsp</schedulingPolicy> <aclSubmitApps>*</aclSubmitApps> <aclAdministerApps>*</aclAdministerApps> </queue>

hbase-site.xml

Configure the following properties in the hbase-site.xml file:
hbase.use.dynamic.jars
Enables metadata import and test connection from the Developer tool. Required for an HDInsight cluster that uses ADLS storage or an Amazon EMR cluster that uses HBase resources in S3 storage.
Set to: false
zookeeper.znode.parent
Identifies HBase master and region servers.
Set to the relative path to the znode directory of HBase.

hive-site.xml

Configure the following properties in the hive-site.xml file:
hive.compactor.initiator.on
Runs the initiator and cleaner threads on metastore instance. Required for an Update Strategy transformation in a mapping that writes to a Hive target.
Set to: TRUE
hive.compactor.worker.threads
The number of worker threads to run in a metastore instance. Required for an Update Strategy transformation in a mapping that writes to a Hive target.
Set to: 1
hive.conf.hidden.list
Comma-separated list of hidden configuration properties.
Set to: javax.jdo.option.ConnectionPassword,hive.server2.keystore.password,fs.s3n.awsAccessKeyId,fs.s3n.awsSecretAccessKey,fs.s3a.access.key,fs.s3a.secret.key,fs.s3a.proxy.password
hive.enforce.bucketing
Enables dynamic bucketing while loading to Hive. Required for an Update Strategy transformation in a mapping that writes to a Hive target.
Set to: TRUE
hive.exec.dynamic.partition
Enables dynamic partitioned tables for Hive tables. Applicable for Hive versions 0.9 and earlier.
Set to: TRUE
hive.exec.dynamic.partition.mode
Allows all partitions to be dynamic. Required for the Update Strategy transformation in a mapping that writes to a Hive target. Also required if you use Sqoop and define a DDL query to create or replace a partitioned Hive target at run time.
Set to: nonstrict
hive.support.concurrency
Enables table locking in Hive. Required for an Update Strategy transformation in a mapping that writes to a Hive target.
Set to: TRUE
hive.txn.manager
Turns on transaction support. Required for an Update Strategy transformation in a mapping that writes to a Hive target.
Set to: org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
The following properties enable pre-task and post-task monitoring statistics for Amazon EMR jobs in the Developer tool:
hive.async.log.enabled
Enables asynchronous logging. Required when you enable pre-task and post-task monitoring statistics on a Dataproc cluster.
Set to: FALSE
hive.server2.in.place.progress
Allows HiveServer2 to send progress bar update information. Takes effect only when you enable Tez. Required when you enable pre-task and post-task monitoring statistics on a Dataproc cluster.
Set to: TRUE
hive.server2.logging.operation.enabled
Enables logs to be saved. Required when you enable pre-task and post-task monitoring statistics on a Dataproc cluster.
Set to: TRUE
hive.server2.logging.operation.level
Hive Server2 logging level at the session level. Requires hive.server2.logging.operation.enabled to be set to TRUE. Required when you enable pre-task and post-task monitoring statistics on a Dataproc cluster.
Set to: EXECUTION

kms-site.xml

Configure the following properties in the kms-site.xml file:
hadoop.kms.authentication.kerberos.name.rules
Translates the principal names from the Active Directory and MIT realm into local names within the Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.
Set to: RULE:[1:$1@$0](^.*@YOUR.REALM\.COM$)s/^(.*)@YOUR.REALM\.COM$/$1/g
Set to: RULE:[2:$1@$0](^.*@YOUR.REALM\.COM$)s/^(.*)@YOUR.REALM\.COM$/$1/g

mapred-site.xml

Configure the following properties in the mapred-site.xml file:
mapreduce.framework.name
The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for Sqoop.
Set to: yarn
yarn.app.mapreduce.am.staging-dir
The HDFS staging directory used while submitting jobs.
Set to the staging directory path.

tez-site.xml

Configure the following properties in the tez-site.xml file:
tez.am.tez-ui.history-url.template
Tez UI URL template for the application. The application manager uses this URL to redirect the user to the Tez UI. Required when you enable pre-task and post-task monitoring statistics on a Dataproc cluster.
Set value to:
_HISTORY_URL_BASE?%2F%23%2Ftez-app%2FAPPLICATION_ID
The values of
_HISTORY_URL_BASE_
and
_APPLICATION_ID
are resolved at runtime. Do not edit the string to supply values.
tez.task.generate.counters.per.io
Enables pre-task and post-task monitoring statistics on an Amazon EMR or Dataproc cluster.
Set to: TRUE

yarn-site.xml

Configure the following properties in the yarn-site.xml file:
yarn.application.classpath
Required for dynamic resource allocation.
Add spark_shuffle.jar to the class path. The .jar file must contain the class "org.apache.spark.network.yarn.YarnShuffleService."
yarn.nodemanager.resource.memory-mb
The maximum RAM available for each container. Set the maximum memory on the cluster to increase resource memory available to the Blaze engine.
Set the value to at least 16GB.
yarn.nodemanager.resource.cpu-vcores
The number of virtual cores for each container. Required for Blaze engine resource allocation.
Set the value to at least 10.
yarn.scheduler.minimum-allocation-mb
The minimum RAM available for each container. Required for Blaze engine resource allocation.
Set the value to at least 6GB.
yarn.nodemanager.vmem-check-enabled
Disables virtual memory limits for containers. Required for the Blaze and Spark engines.
Set to: false
yarn.nodemanager.aux-services
Required for dynamic resource allocation for the Spark engine.
Add an entry for "spark_shuffle."
yarn.nodemanager.aux-services.spark_shuffle.class
Required for dynamic resource allocation for the Spark engine.
Set to: org.apache.spark.network.yarn.YarnShuffleService
yarn.resourcemanager.scheduler.class
Defines the YARN scheduler that the Data Integration Service uses to assign resources.
Set to: org.apache.hadoop.yarn.server.resourcemanager.scheduler
yarn.node-labels.enabled
Enables node labeling.
Set to: TRUE
yarn.node-labels.fs-store.root-dir
The HDFS location to update node label dynamically.
Set to: <hdfs://[Node name]:[Port]/[Path to store]/[Node labels]/>

0 COMMENTS

We’d like to hear from you!