Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(3.11.0) Job submission failure caused by race condition in Pyxis configuration #6459

Closed
gmarciani opened this issue Oct 11, 2024 · 2 comments

Comments

@gmarciani
Copy link
Contributor

Bug description

We have discovered an issue in the way we configure the Pyxis Slurm plugin in ParallelCluster that can lead to job submission failures. When this issue occurs, the cluster enters an invalid state, and any subsequent job would fail to run, including those that do not require the Pyxis plugin.

If your cluster is affected by this issue, you will experience job failures with the following error in its output:

[ec2-user@ip-27-6-21-47 ~]$ cat slurm-1.out
srun: error: spank: Failed to open /opt/slurm/etc/plugstack.conf.d/sed6Yj8Ga: Permission denied
srun: error: Plug-in initialization failed

When the issue occurs, the cluster is unable to automatically recover from it, and all subsequent jobs will fail to run. However, running jobs will not be affected.

The issue is caused by a race condition happening during the compute node bootstrap process, as multiple processes write temporary files into the shared Slurm configuration directory. The presence of such temporary files causes Slurm failures when loading the SPANK plugins. A failure in removing these temporary files will render the cluster inoperable.

Affected versions (OSes, schedulers)

  • ParallelCluster 3.11.0

Mitigation

You can find a detailed explanation and the mitigation of the problem here.

@gmarciani
Copy link
Contributor Author

Hi,

this issue has been fixed in ParallelCluster 3.11.1

@joehellmersNOAA
Copy link

@gmarciani According to the release notes "Pyxis is now disabled by default, so it must be manually enabled as documented in the product documentation". Where is this documented? I don't see it in the v3 UG. https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants