Implementing Data Engineering Integration with Google Dataproc

Implementing Data Engineering Integration with Google Dataproc

Troubleshooting Mapping Failures on the Dataproc Cluster

Troubleshooting Mapping Failures on the Dataproc Cluster

This section provides information on troubleshooting common error messages and limitations that you might encounter when you run mappings on a Dataproc cluster.
When you run mappings on a Dataproc cluster to read or write data to SerDe-backed Hive tables, the mapping fails with the java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe error.
Add the following property in the
hive-site.xml
on all the nodes of the Dataproc cluster and restart the Hive server:
<property> <name>hive.aux.jars.path</name> <value>file:///usr/lib/hive/lib/hive-hcatalog-core-<version>.jar</value> </property>
When you run mappings on a non-VPN Dataproc cluster, the mappings fail.
Configure the following properties in the
hdfs-site.xml
on all the nodes of the Dataproc cluster:
<property> <name>dfs.namenode.rpc-bind-host</name> <value>0.0.0.0</value> </property> <property> <name>dfs.namenode.servicerpc-bind-host</name> <value>0.0.0.0</value> </property> <property> <name>dfs.namenode.http-bind-host</name> <value>0.0.0.0</value> </property> <property> <name>dfs.namenode.https-bind-host</name> <value>0.0.0.0</value> </property> <property> <name>dfs.client.use.datanode.hostname</name> <value>true</value> </property>
When I run a mapping with a Hive source or a Hive target on a different cluster, the Data Integration Service fails to push the mapping to Hadoop with the following error:
Failed to execute query [exec0_query_6] with error code [10], error message [FAILED: Error in semantic analysis: Line 1:181 Table not found customer_eur], and SQL state [42000]].
When you run a mapping in a Hadoop environment, the Hive connection selected for the Hive source or Hive target, and the mapping must be on the same Hive metastore.
When I run a mapping with SQL overrides concurrently, the mapping hangs.
There are not enough available resources because the cluster is being shared across different engines.
Use different YARN scheduler queues for the Blaze and Spark engines to allow HiveServer2 to run SQL overrides through these engines.
When I configure a mapping to create a partitioned Hive table, the mapping fails with the error "Need to specify partition columns because the destination table is partitioned."
This issue happens because of internal Informatica requirements for a query that is designed to create a Hive partitioned table. For details and a workaround, see Knowledge Base article 516266.
Time stamp data that is precise to the nanosecond is truncated when a mapping runs on the Spark engine
Spark stores time stamp data to a precision of 1 microsecond (1us) and does not support nanosecond precision. When a mapping that runs on the Spark engine reads datetime data that has nanosecond precision, the data is truncated to the microsecond. For example,
2015-01-02 00:00:00.000456789
is truncated to
2015-01-02 00:00:00.000456
.
The Blaze engine supports nanosecond precision.

0 COMMENTS

We’d like to hear from you!