Advanced Clusters

Back Next

Troubleshooting an advanced cluster

Troubleshooting an
advanced cluster

What should I do if the status of the advanced cluster is Unknown?

When the cluster status is Unknown, first verify that the Secure Agent is running. If the agent is not running, enable the agent and check whether the cluster starts running.

If the cluster does not start running, an administrator can run the command to list clusters. If the command output returns the cluster state as partial or in-use, the administrator can run the command to delete the cluster.

For more information about the commands, see the Administrator help.

I looked at the ccs-operation.log file to troubleshoot the advanced cluster , but there wasn’t enough information. Where else can I look?

You can look at the

cluster-operation

logs that are dedicated to the instance of the

advanced cluster

. When an external command set begins running, the

ccs-operation

log displays the path to the

cluster-operation

logs.

For example:

2020-06-15 21:22:36.094 [reqid:] [T:000057] INFO   : c.i.c.s.c.ClusterComputingService        [CCS_10400] Starting to run command set [<command set>] which contains the following commands: [ 
    <commands> ; 
]. The execution log can be found in the following location: [/data2/home/cldagnt/SystemAgent/apps/At_Scale_Server/35.0.1.1/ccs_home/3xukm9iqp5zeahyrb7rqoz.k8s.local/infa/cluster-operation.log].

The specified folder contains all

cluster-operation

logs that belong to the instance of the cluster. You can use the logs to view the full

stdout

and

stderr

output streams of the command set.

The number in the log name indicates the log’s generation and each

cluster-operation

log is at most 10 MB. For example, if the cluster instance generated 38 MB of log messages while running external commands, the folder contains four

cluster-operation

logs. The latest log has 0 in the file name and the oldest log has 3 in the file name. You can view the messages in the

cluster-operation0.log

file to view the latest errors.

If you set the log level for the

Elastic Server

to DEBUG, the

ccs-operation

log shows the same level of detail as the

cluster-operation

logs.

How do I find the initialization script logs for the nodes where the init script failed?

To find the init script logs, complete the following tasks:

Locate the

ccs-operation.log

file in the following directory on the Secure Agent machine:

<Secure Agent installation directory>/apps/At_Scale_Server/<version>/ccs_home/

In the

ccs-operation.log

file, find a message that is similar to the following message:

Failed to run the init script for cluster [<cluster instance ID>] on the following nodes: [<cluster node IDs]. Review the log in the following S3 file path: [<cloud platform location>].

Navigate to the cloud platform location that is provided in the message.

Match the cluster node IDs to the init script log file names for the nodes where the init script failed.

How are the resource requirements calculated in the following error message for an advanced cluster ?

2019-04-26T19:04:11.762+00:00 <Thread-16> SEVERE: java.lang.RuntimeException: [java.lang.RuntimeException: The Cluster Computing System rejected the Spark task [InfaSpark0] due to the following error: [[CCS_10252] Cluster [6bjwune8v4bkt3vneokii9.k8s.local] doesn't have enough resources to run the application [spark--infaspark0e6674748-b038-4e39-a2a9-3fd49e63f289infaspark0-driver] which requires a minimum resource of [(KB memory, mCPU)]. The cluster must have enough nodes, and each node must have at least [(KB memory, mCPU)] to run this job.].]

The first resource requirement is the total number of resources that are required by the Spark driver and the Spark executor.

The second resource requirement is calculated based on the minimum resource requirements on each worker node to run a minimum of one Spark process.

The resources are calculated using the following formulas:

Memory: MAX(driver_memory, executor_memory)
CPU: MAX(driver_CPU, executor_CPU)

The Spark process can be either a Spark driver process or a Spark executor process. The cluster must have two nodes where each node fulfills the minimum requirements to run either the driver or the executor, or the cluster must have one node with enough resources to run both the driver and the executor.

The resource requirements for the driver and executor depend on how you configure the following advanced session properties in the

mapping

task:

spark.driver.memory

spark.executor.memory

spark.executor.cores

For more information about minimum resource requirements, see the Administrator help.

I shut down the Secure Agent machine on my cloud platform, but some jobs are still running.

When you shut down the agent machine, the agent starts on a new machine, but jobs do not carry over to the new machine.

In Monitor, cancel the jobs and run them again. The agent on the new machine will start processing the jobs.

To avoid this issue, see the instructions to shut down the agent machine in the Administrator help.

Rename Saved Search

Table of Contents

Advanced Clusters

Advanced Clusters

Troubleshooting an advanced cluster

Troubleshooting an advanced cluster

Troubleshooting an
advanced cluster