Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(3.9.0-3.10.1) Cluster update intermittently fails because some compute nodes don’t execute update procedure #6412

Open
hanwen-pcluste opened this issue Aug 23, 2024 · 0 comments

Comments

@hanwen-pcluste
Copy link
Contributor

hanwen-pcluste commented Aug 23, 2024

The ParallelCluster team uses this template to report known issues on github. If you are reporting an issue, please use the 'Bug report' template instead.

Bug description

Cluster update intermittently fails:

$ pcluster list-clusters
{
  "clusters": [
    {
      "clusterName": "test",
      "cloudformationStackStatus": "UPDATE_ROLLBACK_COMPLETE",
      "clusterStatus": "UPDATE_FAILED",
      ...
    }
  ]
}

There is the following error in /var/log/chef-client.log on the head node:

    ================================================================================
    Error executing action `run` on resource 'execute[Check cluster readiness]'
    ================================================================================
    
    Mixlib::ShellOut::ShellCommandFailed
    ------------------------------------
    Expected process to exit with [0], but received '1'
    ---- Begin output of /opt/parallelcluster/pyenv/versions/3.9.19/envs/cookbook_virtualenv/bin/python /opt/parallelcluster/scripts/head_node_checks/check_cluster_ready.py --cluster-name demo-cluster --table-name parallelcluster-demo-cluster --config-version 78dPxxb06z0XXX00hMwGxxxfzwxxPlYy --region us-west-2 ----
    STDOUT: 
    STDERR: INFO:__main__:Checking cluster readiness with arguments: cluster_name=demo-cluster, table_name=parallelcluster-demo-cluster, config_version=78dPxxb06z0XXX00hMwGxxxfzwxxPlYy, region=us-west-2
    INFO:__main__:Checking that cluster configuration deployed on cluster nodes for cluster demo-cluster is 78dPxxb06z0XXX00hMwGxxxfzwxxPlYy
    INFO:botocore.credentials:Found credentials from IAM Role: demo-cluster-RoleHeadNode-xxxx
    INFO:__main__:Found batch of 4 cluster node(s): ['i-xxxxxxxxxxxxxxxxx', 'i-yyyyyyyyyyyyyyyyy', 'i-aaaaaaaaaaaaaaaaa', 'i-bbbbbbbbbbbbbbbbb']
    INFO:__main__:Retrieved 4 DDB item(s):
        {'Id': {'S': 'CLUSTER_CONFIG.i-xxxxxxxxxxxxxxxxx'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '78dPxxb06z0XXX00hMwGxxxfzwxxPlYy'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-02 22:27:50 UTC'}}}}
        {'Id': {'S': 'CLUSTER_CONFIG.i-yyyyyyyyyyyyyyyyy'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-01 16:58:37 UTC'}}}}
        {'Id': {'S': 'CLUSTER_CONFIG.i-aaaaaaaaaaaaaaaaa'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '78dPxxb06z0XXX00hMwGxxxfzwxxPlYy'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-02 22:27:38 UTC'}}}}
        {'Id': {'S': 'CLUSTER_CONFIG.i-bbbbbbbbbbbbbbbbb'}, 'Data': {'M': {'node_type': {'S': 'ComputeFleet'}, 'cluster_config_version': {'S': '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'}, 'status': {'S': 'DEPLOYED'}, 'lastUpdateTime': {'S': '2024-08-01 16:58:33 UTC'}}}}
    ERROR:__main__:Some cluster readiness checks failed: Check failed due to the following erroneous records:
      * missing records (0): []
      * incomplete records (0): []
      * wrong records (2): [('i-yyyyyyyyyyyyyyyyy', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft'), ('i-bbbbbbbbbbbbbbbbb', '1_6PjwxxxvWZNZtxxBGxxxRQkVdTGqft')]

Affected versions (OSes, schedulers)

  • ParallelCluster 3.9.0-3.10.1
  • Slurm scheduler
  • All operating systems

Mitigation

See details in Wiki

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants