Table of Contents

Search

  1. Preface
  2. Introduction to Informatica Data Engineering Integration
  3. Mappings
  4. Mapping Optimization
  5. Sources
  6. Targets
  7. Transformations
  8. Python Transformation
  9. Data Preview
  10. Cluster Workflows
  11. Profiles
  12. Monitoring
  13. Hierarchical Data Processing
  14. Hierarchical Data Processing Configuration
  15. Hierarchical Data Processing with Schema Changes
  16. Intelligent Structure Models
  17. Blockchain
  18. Stateful Computing
  19. Appendix A: Connections Reference
  20. Appendix B: Data Type Reference
  21. Appendix C: Function Reference

Troubleshooting Mappings in a Non-native Environment

Troubleshooting Mappings in a Non-native Environment

Consider troubleshooting tips for mappings in a non-native environment.

Hadoop Environment

When I run a mapping with a Hive source or a Hive target on a different cluster, the Data Integration Service fails to push the mapping to Hadoop with the following error:
Failed to execute query [exec0_query_6] with error code [10], error message [FAILED: Error in semantic analysis: Line 1:181 Table not found customer_eur], and SQL state [42000]].
When you run a mapping in a Hadoop environment, the Hive connection selected for the Hive source or Hive target, and the mapping must be on the same Hive metastore.
When I run a mapping with SQL overrides concurrently, the mapping hangs.
There are not enough available resources because the cluster is being shared across different engines.
Use different YARN scheduler queues for the Blaze and Spark engines to allow HiveServer2 to run SQL overrides through these engines.
Mappings run on the Blaze engine fail with the following preemption error messages:
2018-09-27 11:05:27.208 INFO: Container completion status: id [container_e135_1537815195064_4755_01_000012]; state [COMPLETE]; diagnostics [Container preempted by scheduler]; exit status [-102].. 2018-09-27 11:05:27.208 SEVERE: Service [OOP_Container_Manager_Service_2] has stopped running..
The Blaze engine does not support YARN preemption on either the Capacity Scheduler or the Fair Scheduler. Ask the Hadoop administrator to disable preemption on the queue allocated to the Blaze engine. For more information, see Mappings Fail with Preemption Errors.
When I configure a mapping to create a partitioned Hive table, the mapping fails with the error "Need to specify partition columns because the destination table is partitioned."
This issue happens because of internal Informatica requirements for a query that is designed to create a Hive partitioned table. For details and a workaround, see Knowledge Base article 516266.
When Spark runs a mapping with a Hive source and target, and uses the Hive Warehouse Connector, the mapping fails with the following error:
[[SPARK_1003] Spark task [<task name>] failed with the following error: [User class threw exception: java.lang.reflect.InvocationTargetException​ ... java.sql.SQLException: Cannot create PoolableConnectionFactory (Could not open client transport for any of the Server URI's in ZooKeeper: Could not establish connection...)
The issue occurs because the Data Integration Service fails to fetch the Hive DT.
Workaround: Add the URL for HiveServer2 Interactive to the advanced properties of the Hadoop connection:
  1. ​In the Ambari web console, browse to
    Services
    Hive
    Configs
    Advanced
    Advanced hive-site
    ​ and copy the value of the property hive.server2.authentication.kerberos.principal.
  2. Edit the Advanced Properties of the Hadoop connection to add the property spark.sql.hive.hiveserver2.jdbc.url.principal.
  3. Paste the value that you copied in step 1 as the value of spark.sql.hive.hiveserver2.jdbc.url.principal.
Time stamp data that is precise to the nanosecond is truncated when a mapping runs on the Spark engine
Spark stores time stamp data to a precision of 1 microsecond (1us) and does not support nanosecond precision. When a mapping that runs on the Spark engine reads datetime data that has nanosecond precision, the data is truncated to the microsecond. For example,
2015-01-02 00:00:00.000456789
is truncated to
2015-01-02 00:00:00.000456
.
The Blaze engine supports nanosecond precision.

Databricks Environment

Mappings fail with the following error:
SEVERE: Run with ID [1857] failed with state [INTERNAL_ERROR] and error message [Library installation timed out after 1800 seconds. Libraries that are not yet installed: jar: "dbfs:/tmp/DATABRICKS/sess6250142538173973565/staticCode.jar"
This might happen when you run concurrent jobs. When Databricks does not have resources to process a job, it queues the job for a maximum of 1,800 seconds (30 minutes). If resources are not available in 30 minutes, the job fails. Consider the following actions to avoid timeouts:
  • Configure preemption environment variables on the Databricks cluster to control the amount of resources that get allocated to each job. For more information about preemption, see the
    Data Engineering Integration Guide
    .
  • Run cluster workflows to create ephemeral clusters. You can configure the workflow to create a cluster, run the job, and then delete the cluster. For more information about ephemeral clusters, see Cluster Workflows.
When I run mappings on a Dataproc cluster to read or write data to SerDe-backed Hive tables, the mapping fails with the java.lang.ClassNotFoundException: org.apache.hive.hcatalog.data.JsonSerDe error.
Add the following property in the
hive-site.xml
on all the nodes of the Dataproc cluster and restart the Hive server:
<property> <name>hive.aux.jars.path</name> <value>file:///usr/lib/hive/lib/hive-hcatalog-core-<version>.jar</value> </property>
When I run mappings on a non-VPN Dataproc cluster, the mappings fail.
Configure the following properties in the
hdfs-site.xml
on all the nodes of the Dataproc cluster:
<property> <name>dfs.namenode.rpc-bind-host</name> <value>0.0.0.0</value> </property> <property> <name>dfs.namenode.servicerpc-bind-host</name> <value>0.0.0.0</value> </property> <property> <name>dfs.namenode.http-bind-host</name> <value>0.0.0.0</value> </property> <property> <name>dfs.namenode.https-bind-host</name> <value>0.0.0.0</value> </property> <property> <name>dfs.client.use.datanode.hostname</name> <value>true</value> </property>
When I set job-level optimization on a mapping, the session log reflects the setting, but the Databricks Spark engine ignores it.
The Databricks Spark engine performs auto-optimization on jobs based on configuration settings for the cluster. It ignores custom configurations, such as the Spark.default.parallelism property. You cannot configure job-level optimization on a mapping that runs on the Databricks Spark engine.
Informatica integrates with Databricks, supporting standard concurrency clusters. Standard concurrency clusters have a maximum queue time of 30 minutes, and jobs fail when the timeout is reached. The maximum queue time cannot be extended. Setting the preemption threshold allows more jobs to run concurrently, but with a lower percentage of allocated resources, the jobs can take longer to run. Also, configuring the environment for preemption does not ensure that all jobs will run. In addition to configuring preemption, you might choose to run cluster workflows to create ephemeral clusters that create the cluster, run the job, and then delete the cluster. For more information about Databricks concurrency, contact Azure Databricks.

0 COMMENTS

We’d like to hear from you!