Tuning and Sizing Guidelines for Data Engineering Integration (10.4.x)

Back Next

Troubleshooting Spark Job Failures

This section provides information on troubleshooting common error messages and limitations that you might encounter when you enable dynamic resource allocation on the Spark engine. These errors might occur when you process a large volume of data, such as 10 TB or more, or when a job has a large shuffle volume.

Could not find CoarseGrainedScheduler.: When you stop a process, you might lose one or more executors with the following error:

cluster.YarnScheduler: Lost executor 8 on myhost1.com: remote Rpc client disassociated

One of the most common reasons for executor failure is insufficient memory. When an executor consumes more memory than the maximum limit, YARN causes the executor to fail. By default, Spark does not set an upper limit for the number of executors if dynamic allocation is enabled. (SPARK-14228)

Configure the following advanced properties for Spark in the Hadoop connection:

Property
Description

spark.dynamicAllocation.maxExecutors
Set a limit for the number of executors. Determine the value based on available cores and memory per node.

spark.executor.memory
Increase the amount of memory per executor process. The default value is 6 GB.
Total size of serialized results is bigger than spark.driver.maxResultSize.: The spark.driver.maxResultSize property determines the limit for total size of serialized results across all partitions for each Spark action, such as the collect action. Spark driver issues a collect() for the whole broadcast data set. The spark default of 1 GB is overridden and increased to 4 GB. This value should suffice most use cases. If the spark driver fails with the following error message, consider increasing this value:

Total size of serialized results is bigger than spark.driver.maxResultSize

Configure the following advanced property for Spark in the Hadoop connection:

Property
Description

spark.driver.maxResultSize
Set the result size to a size equal to or greater than the driver memory, or 0 for unlimited.
java.util.concurrent.TimeoutException; Futures timed out after [300 seconds].: The default broadcast timeout limit is set to 300 seconds. Increase the SQL broadcast timeout limit.

Configure the following advanced property for Spark in the Hadoop connection:

Property
Description

spark.sql.broadcastTimeout
Set the timeout limit to at least 600 seconds.
A job fails due to Spark speculative execution of tasks.: With spark speculation, the Spark engine relaunches one or more tasks that are running slowly in a stage. To successfully run the job, disable spark speculation.

Configure the following advanced property for Spark in the Hadoop connection:

Property
Description

spark.speculation
Set the value to false.
The Spark driver process hangs.: The Spark driver process might hang due to multiple reasons. The Spark driver process dump and the YARN application logs might not reveal any information to isolate the cause.

The following Informatica Knowledge Base article describes a step-by-step process that you can use to troubleshoot mappings that fail due to the Spark driver process:
HOW TO: Troubleshoot a mapping that fails on the Spark engine when the Spark driver process hangs
ShuffleMapStage 12 (rdd at InfaSprk1.scala:48) has failed the maximum allowable number of times: 4.: The Spark shuffle service fails because the garbage collector exceeded the overhead limit. This forces the Node Manager to shut down, which eventually causes the Spark job to fail.

To resolve this issue, perform the following steps:

Open the YARN node manager.
In the NodeManager Java heap size property, increase the maximum heap size in MB.

For further debugging, check the Node Manager logs:

java.lang.OutOfMemoryError : GC overhead limit exceeded 2016-12-0 7 19:38:29,934 FATAL yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException 51))- Thread Thread[IPCServer handler 0on 8040,5,main] threw an error. Shutting down now...
NoRouteToHostException shown in the YARN application master log with the Sequence Generator transformation.: Spark tasks communicate with the Data Integration Service through the Data Integration Service HTTP port to get the sequence range. Ensure that the Data Integration Service is accessible through the HTTP port from all Hadoop Cluster nodes. Spark tasks, including the Sequence Generator transformation, will fail if the HTTP port is not accessible.

The NoRouteToHostException in the YARN application master log indicates that the Data Integration Service HTTP port is not accessible from the Hadoop cluster nodes. The following example shows the NoRouteToHostException in the YARN application master log:; 18/06/29 14:26:31 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, <DIS_Host_Name>, executor 1): java.net.NoRouteToHostException: No route to host (Host unreachable) at java.net.PlainSocketImpl.socketConnect(Native Method)

Property	Description
spark.dynamicAllocation.maxExecutors	Set a limit for the number of executors. Determine the value based on available cores and memory per node.
spark.executor.memory	Increase the amount of memory per executor process. The default value is 6 GB.

Rename Saved Search

Table of Contents

Tuning and Sizing Guidelines for Data Engineering Integration (10.4.x)

Tuning and Sizing Guidelines for Data Engineering Integration (10.4.x)

Troubleshooting Spark Job Failures

Troubleshooting Spark Job Failures