Integration Guide

10.5.3
- 10.5.5
- 10.5.4.1
- 10.5.4
- 10.5.2
- 10.5.1
- 10.5
- 10.4.1
- 10.4.0
- 10.2.2 HotFix 1
- 10.2.2 Service Pack 1
- 10.2.2
- 10.2.1

Back Next

Configure *-site.xml Files for Google Dataproc

The Hadoop administrator needs to configure *-site.xml file properties and restart impacted services before the Informatica administrator imports cluster information into the domain.

core-site.xml

Configure the following properties in the core-site.xml file:

fs.s3.enableServerSideEncryption: Enables server side encryption for S3 buckets. Required for SSE and SSE-KMS encryption.; Set to TRUE.
fs.s3a.access.key: The ID for the Blaze and Spark engines to connect to the Amazon S3 file system.; Set to your access key.
fs.s3a.secret.key: The password for the Blaze and Spark engines to connect to the Amazon S3 file system; Set to your access ID.
fs.s3a.server-side-encryption-algorithm: The server-side encryption algorithm for S3. Required for SSE and SSE-KMS encryption. Set to the encryption algorithm used.
hadoop.proxyuser.<proxy user>.groups: Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the <proxy user> is the system user that runs the Informatica daemon.; Set to group names of impersonation users separated by commas. If less security is preferred, use the wildcard " * " to allow impersonation from any group.
After you make changes to proxy user properties, restart the credential service and other cluster services that use core-site configuration values.
hadoop.proxyuser.<proxy user>.hosts: Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the <proxy user> is the system user that runs the Informatica daemon.; Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred, use the wildcard " * " to allow impersonation from any host.
After you make changes to proxy user properties, restart the credential service and other cluster services that use core-site configuration values.
hadoop.proxyuser.hive.hosts: Comma-separated list of hosts that you want to allow the Hive user to impersonate on a non-secure cluster.
When
hive.server2.enable.doAs
is false, append a comma-separated list of Informatica server host names or IP address where the Data Integration Service is running. If less security is preferred, use the wildcard " * " to allow impersonation from any host.

After you make changes to this property, restart the cluster services that use core-site configuration values.
io.compression.codecs: Enables compression on temporary staging tables.; Set to a comma-separated list of compression codec classes on the cluster.
hadoop.security.auth_to_local: Translates the principal names from the Active Directory and MIT realm into local names within the Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.; Set to: RULE:[1:$1@$0](^.*@YOUR.REALM)s/^(.*)@YOUR.REALM\.COM$/$1/g; Set to: RULE:[2:$1@$0](^.*@YOUR.REALM\.$)s/^(.*)@YOUR.REALM\.COM$/$1/g

fair-scheduler.xml

Configure the following properties in the fair-scheduler.xml file:

allowPreemptionFrom: Enables preemption for the Fair Scheduler. The Blaze engine does not support preemption. If YARN preemption is enabled for the cluster, you need to disable it for the queue allocated to the Blaze engine.; Set to FALSE for the queue allocated to the Blaze engine.; For example:
<queue name="Blaze"> <weight>1.0</weight> <allowPreemptionFrom>false</allowPreemptionFrom> <schedulingPolicy>fsp</schedulingPolicy> <aclSubmitApps>*</aclSubmitApps> <aclAdministerApps>*</aclAdministerApps> </queue>

hbase-site.xml

Configure the following properties in the hbase-site.xml file:

zookeeper.znode.parent: Identifies HBase master and region servers.; Set to the relative path to the znode directory of HBase.

hdfs-site.xml

Configure the following properties in the hdfs-site.xml file:

dfs.encryption.key.provider.uri: The KeyProvider used to interact with encryption keys when reading and writing to an encryption zone. Required if sources or targets reside in the HDFS encrypted zone on Java KeyStore KMS-enabled Cloudera CDH cluster or a Ranger KMS-enabled Hortonworks HDP cluster.; Set to: kmf://http@xx11.xyz.com:16000/kms
dfs.namenode.rpc-bind-host: The actual address the Remote Procedure Call (RPC) server will bind to. If this optional address is set, it overrides the hostname portion of dfs.namenode.rpc-address. Enables the cluster to listen on private and public network interfaces, allowing remote access and datanode access. Required when you run mappings on a non-VPN Dataproc cluster.

Set to: 0.0.0.0 to enable the cluster to listen on private and public ports, allowing remote access and datanode access.
dfs.namenode.servicerpc-bind-host: The actual address the Remote Procedure Call (RPC) server will bind to. If this optional address is set, it overrides the hostname portion of dfs.namenode.rpc-address. Enables the cluster to listen on private and public network interfaces, allowing remote access and datanode access. Required when you run mappings on a non-VPN Dataproc cluster.

Set to: 0.0.0.0 to enable the cluster to listen on private and public ports, allowing remote access and datanode access.
dfs.namenode.http-bind-host: The actual address the Remote Procedure Call (RPC) server will bind to. If this optional address is set, it overrides the hostname portion of dfs.namenode.rpc-address. Enables the cluster to listen on private and public network interfaces, allowing remote access and datanode access. Required when you run mappings on a non-VPN Dataproc cluster.

Set to: 0.0.0.0 to enable the cluster to listen on private and public ports, allowing remote access and datanode access.
dfs.namenode.https-bind-host: The actual address the Remote Procedure Call (RPC) server will bind to. If this optional address is set, it overrides the hostname portion of dfs.namenode.rpc-address. Enables the cluster to listen on private and public network interfaces, allowing remote access and datanode access. Required when you run mappings on a non-VPN Dataproc cluster.

Set to: 0.0.0.0 to enable the cluster to listen on private and public ports, allowing remote access and datanode access.

hive-site.xml

Configure the following properties in the hive-site.xml file:

hive.async.log.enabled: Enables asynchronous logging. Required when you enable pre-task and post-task monitoring statistics on a Dataproc cluster.; Set to: FALSE
hive.cluster.delegation.token.store.class: The token store implementation. Required for HiveServer2 high availability and load balancing.; Set to: org.apache.hadoop.hive.thrift.DBTokenStore
hive.exec.dynamic.partition: Enables dynamic partitioned tables for Hive tables. Applicable for Hive versions 0.9 and earlier.; Set to: TRUE
hive.exec.dynamic.partition.mode: Allows all partitions to be dynamic. Required if you use Sqoop and define a DDL query to create or replace a partitioned Hive target at run time.; Set to: nonstrict
hive.server2.in.place.progress: Allows HiveServer2 to send progress bar update information. Takes effect only when you enable Tez. Required when you enable pre-task and post-task monitoring statistics on a Dataproc cluster.; Set to: TRUE
hive.server2.logging.operation.level: Hive Server2 logging level at the session level. Requires hive.server2.logging.operation.enabled to be set to TRUE. Required when you enable pre-task and post-task monitoring statistics on a Dataproc cluster.; Set to: EXECUTION
hive.server2.logging.operation.enabled: Enables logs to be saved. Required when you enable pre-task and post-task monitoring statistics on a Dataproc cluster.; Set to: TRUE
hive.server2.zookeeper.namespace: The value of the ZooKeeper namespace in the JDBC connection string. Required for HiveServer2 high availability.; Set to:
jdbc:hive2://<zookeeper_ensemble>/default;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
hive.zookeeper.quorum: Comma-separated list of ZooKeeper server host:ports in a cluster. The value of the ZooKeeper ensemble in the JDBC connection string. Required for HiveServer2 high availability.; Set to:
jdbc:hive2://<zookeeper_ensemble>/default;serviceDiscoveryMode=zooKeeper;

mapred-site.xml

Configure the following properties in the mapred-site.xml file:

mapreduce.framework.name: The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for Sqoop.; Set to: yarn
mapreduce.jobhistory.address: Location of the MapReduce JobHistory Server. The default port is 10020. Required for Sqoop.; Set to: <MapReduce JobHistory Server>:<port>
mapreduce.jobhistory.intermediate-done-dir: Directory where MapReduce jobs write history files. Required for Sqoop.; Set to: /mr-history/tmp
mapreduce.jobhistory.done-dir: Directory where the MapReduce JobHistory Server manages history files. Required for Sqoop.; Set to: /mr-history/done
mapreduce.jobhistory.principal: The Service Principal Name for the MapReduce JobHistory Server. Required for Sqoop.; Set to: mapred/_HOST@YOUR-REALM
mapreduce.jobhistory.webapp.address: Web address of the MapReduce JobHistory Server. The default value is 19888. Required for Sqoop.; Set to: <host>:<port>
yarn.app.mapreduce.am.staging-dir: The HDFS staging directory used while submitting jobs.; Set to the staging directory path.

tez-site.xml

Configure the following properties in the tez-site.xml file:

tez.am.tez-ui.history-url.template: Tez UI URL template for the application. The application manager uses this URL to redirect the user to the Tez UI. Required when you enable pre-task and post-task monitoring statistics on a Dataproc cluster.; Set value to:
_HISTORY_URL_BASE?%2F%23%2Ftez-app%2FAPPLICATION_ID

The values of
_HISTORY_URL_BASE_
and
_APPLICATION_ID
are resolved at runtime. Do not edit the string to supply values.
tez.runtime.io.sort.mb: The sort buffer memory. Required when the output needs to be sorted for Blaze and Spark engines.; Set value to 270 MB.
tez.task.generate.counters.per.io: Enables pre-task and post-task monitoring statistics on an Amazon EMR or Dataproc cluster.; Set to: TRUE

yarn-site.xml

Configure the following properties in the yarn-site.xml file:

yarn.application.classpath: Required for dynamic resource allocation.; Add spark_shuffle.jar to the class path. The .jar file must contain the class "org.apache.spark.network.yarn.YarnShuffleService."
yarn.nodemanager.resource.memory-mb: The maximum RAM available for each container. Set the maximum memory on the cluster to increase resource memory available to the Blaze engine.; Set the value to at least 16GB.
yarn.nodemanager.resource.cpu-vcores: The number of virtual cores for each container. Required for Blaze engine resource allocation.; Set the value to at least 10.
yarn.scheduler.minimum-allocation-mb: The minimum RAM available for each container. Required for Blaze engine resource allocation.; Set the value to at least 6GB.
yarn.nodemanager.vmem-check-enabled: Disables virtual memory limits for containers. Required for the Blaze and Spark engines.; Set to: false
yarn.nodemanager.aux-services: Required for dynamic resource allocation for the Spark engine.; Add an entry for "spark_shuffle."
yarn.nodemanager.aux-services.spark_shuffle.class: Required for dynamic resource allocation for the Spark engine.; Set to: org.apache.spark.network.yarn.YarnShuffleService
yarn.resourcemanager.scheduler.class: Defines the YARN scheduler that the Data Integration Service uses to assign resources.; Set to: org.apache.hadoop.yarn.server.resourcemanager.scheduler
yarn.node-labels.enabled: Enables node labeling.; Set to: TRUE
yarn.node-labels.fs-store.root-dir: The HDFS location to update node label dynamically.; Set to: <hdfs://[Node name]:[Port]/[Path to store]/[Node labels]/>

Rename Saved Search

Table of Contents

Integration Guide

Integration Guide

Configure *-site.xml Files for Google Dataproc

Configure *-site.xml Files for Google Dataproc

core-site.xml

fair-scheduler.xml

hbase-site.xml

hdfs-site.xml

hive-site.xml

mapred-site.xml

tez-site.xml

yarn-site.xml