Table of Contents

Search

  1. Preface
  2. Introducing Administrator
  3. Organizations
  4. Licenses
  5. Ecosystem single sign-on
  6. SAML single sign-on
  7. Metering
  8. Source control and service upgrade settings
  9. Users and user groups
  10. User roles
  11. Permissions
  12. Runtime environments
  13. Serverless runtime environments
  14. Secure Agent services
  15. Secure Agent installation
  16. Schedules
  17. Bundle management
  18. Event monitoring
  19. File transfer
  20. Troubleshooting

Administrator

Administrator

Troubleshooting an elastic cluster on AWS

Troubleshooting an
elastic cluster
on AWS

Why did the
elastic cluster
fail to start?
To find out why the
elastic cluster
failed to start, use the
ccs-operation.log
file in the following directory on the Secure Agent machine:
<Secure Agent installation directory>/apps/At_Scale_Server/<version>/ccs_home/
The following table lists some reasons why a cluster might fail to start:
Reason
Possible Cause
kops failed to update the cluster.
The VPC limit was reached on your AWS account.
The master node failed to start.
The master instance type isn't supported in the specified region or availability zone or in your AWS account.
All worker nodes failed to start.
The worker instance type isn't supported in the specified region or availability zone or in your AWS account.
The Kubernetes API server failed to start.
The user-defined master role encountered an error.
When a cluster fails to start due to at least one of these reasons, the
ccs-operation.log
file displays a BadClusterConfigException.
For example, you might see the following error:
2019-06-27 00:50:02.012 [T:000060] SEVERE : [CCS_10500] [Operation of <cluster instance ID>: start_cluster-<cluster instance ID>]: com.informatica.cloud.service.ccs.exception.BadClusterConfigException: [[CCS_10207] The cluster configuration for cluster [<cluster instance ID>] is incorrect due to the following error: [No [Master] node has been created on the cluster. Verify that the instance type is supported.]. The Cluster Computing System will stop the cluster soon.]
If the cluster encounters a BadClusterConfigException, the agent immediately stops the cluster to avoid incurring additional resource costs and to avoid potential resource leaks. The agent does not attempt to recover the cluster until the configuration error is resolved.
I looked at the ccs-operation.log file to troubleshoot the
elastic cluster
, but there wasn’t enough information. Where else can I look?
You can look at the
cluster-operation
logs that are dedicated to the instance of the
elastic cluster
. When an external command set begins running, the
ccs-operation
log displays the path to the
cluster-operation
logs.
For example:
2020-06-15 21:22:36.094 [reqid:] [T:000057] INFO : c.i.c.s.c.ClusterComputingService [CCS_10400] Starting to run command set [<command set>] which contains the following commands: [ <commands> ; ]. The execution log can be found in the following location: [/data2/home/cldagnt/SystemAgent/apps/At_Scale_Server/35.0.1.1/ccs_home/3xukm9iqp5zeahyrb7rqoz.k8s.local/infa/cluster-operation.log].
The specified folder contains all
cluster-operation
logs that belong to the instance of the cluster. You can use the logs to view the full
stdout
and
stderr
output streams of the command set.
The number in the log name indicates the log’s generation and each
cluster-operation
log is at most 10 MB. For example, if the cluster instance generated 38 MB of log messages while running external commands, the folder contains four
cluster-operation
logs. The latest log has 0 in the file name and the oldest log has 3 in the file name. You can view the messages in the
cluster-operation0.log
file to view the latest errors.
If you set the log level for the
Elastic Server
to DEBUG, the
ccs-operation
log shows the same level of detail as the
cluster-operation
logs.
I ran a job to start the
elastic cluster
, but the VPC limit was reached.
When you do not specify a VPC in the
elastic configuration
for a cluster, the Secure Agent creates a new VPC on your AWS account. Because the number of VPCs on your AWS account is limited for each region, you might reach the VPC limit.
If you reach the VPC limit, edit the
elastic configuration
and perform one of the following tasks:
  • Provide a different region.
  • Remove the availability zones. Then, provide an existing VPC and specific subnets within the VPC for the cluster to use.
Any cloud resources that were provisioned for the cluster will be reused when the cluster starts in the new region or the existing VPC. For example, the Secure Agent might have provisioned Amazon EBS volumes before it received an error for the VPC limit. The EBS volumes are not deleted, but they are reused during the next startup attempt.
I ran a job to start the
elastic cluster
, but the cluster failed to be created with the following error:
Failed to create cluster [<cluster instance ID>] due to the following error: [[CCS_10302] Failed to invoke AWS SDK API due to the following error: [Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: <request ID>; S3 Extended Request ID: <S3 extended request ID>)].].]
The Secure Agent failed to create the
elastic cluster
because Amazon S3 rejected the agent's request.
Make sure that the S3 bucket policies do not require clients to send requests that contain an encryption header.
How do I troubleshoot a Kubernetes API Server that failed to start?
If the Kubernetes API Server fails to start, the
elastic cluster
fails to start. To troubleshoot the failure, use the Kubernetes API Server logs instead.
To find the Kubernetes API Server logs, complete the following tasks:
  1. Connect to the master node from the Secure Agent machine.
  2. On the master node, locate the Kubernetes API Server log files in the directory
    /var/log/
    .
I updated the staging location for the
elastic cluster
. Now
elastic mappings
fail with the following error:
Error while executing mapping. ExecutionId '<execution ID>'. Cause: [Failed to start cluster for [01000D25000000000005]. Error reported while starting cluster [Cannot apply cluster operation START because the cluster is in an error state.].].
Mappings fail with this error when you change the permissions to the staging location before you change the S3 staging location in the
elastic configuration
.
If you plan to update the staging location, you must first change the S3 staging location in the
elastic configuration
and then change the permissions to the staging location on AWS. If you used role-based security, you must also change the permissions to the staging location on the Secure Agent machine.
To fix the error, perform the following tasks:
  1. Revert the changes to the permissions for the staging location.
  2. Edit the
    elastic configuration
    to revert the S3 staging location.
  3. Stop the cluster when you save the configuration.
  4. Update the S3 staging location in the configuration, and then change the permissions to the staging location on AWS.
I updated the staging location for the
elastic cluster
. Now the following error message appears in the agent job log:
Could not find or load main class com.informatica.compiler.InfaSparkMain
The error message appears when cluster nodes fail to download Spark binaries from the staging location due to access permissions.
Verify access permissions for the staging location based on the type of connectors that the job uses:
Connectors with direct access to Amazon data sources
If you use credential-based security for
elastic jobs
, make sure that the credentials in the Amazon S3 V2 and Amazon Redshift V2 connections can be used to access the staging location.
If you use role-based security for
elastic jobs
, make sure that the
elastic cluster
and the staging location exist under the same AWS account.
Connectors without direct access to Amazon data sources
If you use a user-defined worker role, make sure that the worker role can access both the staging location and the data sources in the
elastic job
.
If you use the default worker role, make sure that the Secure Agent role can access both the staging location and the data sources in the
elastic job
.
What should I do if the status of the
elastic cluster
is Unknown?
When the cluster status is Unknown, first verify that the Secure Agent is running. If the agent is not running, enable the agent and check whether the cluster starts running.
If the cluster does not start running, an administrator can run the command to list clusters. If the command output returns the cluster state as partial or in-use, the administrator can run the command to delete the cluster.
For more information about the commands, see
Data Integration Elastic
Administration
in the Administrator help.
I restarted the Secure Agent machine and now the status of the
elastic cluster
is Error.
Make sure that the Secure Agent machine and the Secure Agent are running. Then, stop the
elastic cluster
in Monitor. In an AWS environment, the cluster might take 3 to 4 minutes to stop. After the cluster stops, you can run an
elastic job
to start the cluster again.
How do I find the initialization script logs for the nodes where the init script failed?
To find the init script logs, complete the following tasks:
  1. Locate the
    ccs-operation.log
    file in the following directory on the Secure Agent machine:
    <Secure Agent installation directory>/apps/At_Scale_Server/<version>/ccs_home/
  2. In the
    ccs-operation.log
    file, find a message that is similar to the following message:
    Failed to run the init script for cluster [<cluster instance ID>] on the following nodes: [<cluster node IDs]. Review the log in the following S3 file path: [<cloud platform location>].
  3. Navigate to the cloud platform location that is provided in the message.
  4. Match the cluster node IDs to the init script log file names for the nodes where the init script failed.
How are the resource requirements calculated in the following error message for an
elastic cluster
?
2019-04-26T19:04:11.762+00:00 <Thread-16> SEVERE: java.lang.RuntimeException: [java.lang.RuntimeException: The Cluster Computing System rejected the Spark task [InfaSpark0] due to the following error: [[CCS_10252] Cluster [6bjwune8v4bkt3vneokii9.k8s.local] doesn't have enough resources to run the application [spark--infaspark0e6674748-b038-4e39-a2a9-3fd49e63f289infaspark0-driver] which requires a minimum resource of [(KB memory, mCPU)]. The cluster must have enough nodes, and each node must have at least [(KB memory, mCPU)] to run this job.].]
The first resource requirement is the total number of resources that are required by the Spark driver and the Spark executor.
The second resource requirement is calculated based on the minimum resource requirements on each worker node to run a minimum of one Spark process.
The resources are calculated using the following formulas:
Memory: MAX(driver_memory, executor_memory) CPU: MAX(driver_CPU, executor_CPU)
The Spark process can be either a Spark driver process or a Spark executor process. The cluster must have two nodes where each node fulfills the minimum requirements to run either the driver or the executor, or the cluster must have one node with enough resources to run both the driver and the executor.
The resource requirements for the driver and executor depend on how you configure the following advanced session properties in the
mapping
task:

    spark.driver.memory

    spark.executor.memory

    spark.executor.cores

For more information about minimum resource requirements, see
Data Integration Elastic
Administration
in the Administrator help.
Is there anything I should do before I use a custom AMI to create cluster nodes?
If you use a custom AMI (Amazon machine image) to create cluster nodes, make sure that the AMI contains an installation of the AWS CLI.
The Secure Agent uses the AWS CLI to propagate tags to Amazon resources and to aggregate logs. The cluster nodes also use the AWS CLI to run initialization scripts.
For information about how to use a custom AMI, contact Informatica Global Customer Support.
My VPC has requirements to restrict internet traffic. Can I configure an
elastic cluster
to comply with these requirements?
By default, an
elastic cluster
uses an internet-facing load balancer to route traffic over the internet.
To restrict internet traffic, you can configure an
elastic cluster
to use an internal load balancer.
To use an internal load balancer, perform the following tasks:
  1. To enable the internal load balancer, contact Informatica Global Customer Support.
  2. Specify a VPC and subnets in the
    elastic configuration
    .
  3. Make sure that the subnets use a NAT gateway so that cluster dependencies can be downloaded from the internet.
For more information about internet-facing and internal load balancers, refer to the AWS documentation.


Updated November 30, 2020