(3.11.0) Job submission failure caused by race condition in Pyxis configuration #6459

gmarciani · 2024-10-11T11:23:06Z

Bug description

We have discovered an issue in the way we configure the Pyxis Slurm plugin in ParallelCluster that can lead to job submission failures. When this issue occurs, the cluster enters an invalid state, and any subsequent job would fail to run, including those that do not require the Pyxis plugin.

If your cluster is affected by this issue, you will experience job failures with the following error in its output:

[ec2-user@ip-27-6-21-47 ~]$ cat slurm-1.out
srun: error: spank: Failed to open /opt/slurm/etc/plugstack.conf.d/sed6Yj8Ga: Permission denied
srun: error: Plug-in initialization failed

When the issue occurs, the cluster is unable to automatically recover from it, and all subsequent jobs will fail to run. However, running jobs will not be affected.

The issue is caused by a race condition happening during the compute node bootstrap process, as multiple processes write temporary files into the shared Slurm configuration directory. The presence of such temporary files causes Slurm failures when loading the SPANK plugins. A failure in removing these temporary files will render the cluster inoperable.

Affected versions (OSes, schedulers)

ParallelCluster 3.11.0

Mitigation

You can find a detailed explanation and the mitigation of the problem here.

The text was updated successfully, but these errors were encountered:

gmarciani · 2024-10-21T17:21:51Z

Hi,

this issue has been fixed in ParallelCluster 3.11.1

joehellmersNOAA · 2024-10-22T16:46:05Z

@gmarciani According to the release notes "Pyxis is now disabled by default, so it must be manually enabled as documented in the product documentation". Where is this documented? I don't see it in the v3 UG. https://docs.aws.amazon.com/parallelcluster/latest/ug/what-is-aws-parallelcluster.html

hanwen-pcluste added the known issue label Oct 18, 2024

gmarciani added the pending release label Oct 21, 2024

gmarciani closed this as completed Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3.11.0) Job submission failure caused by race condition in Pyxis configuration #6459

(3.11.0) Job submission failure caused by race condition in Pyxis configuration #6459

gmarciani commented Oct 11, 2024

gmarciani commented Oct 21, 2024

joehellmersNOAA commented Oct 22, 2024

(3.11.0) Job submission failure caused by race condition in Pyxis configuration #6459

(3.11.0) Job submission failure caused by race condition in Pyxis configuration #6459

Comments

gmarciani commented Oct 11, 2024

Bug description

Affected versions (OSes, schedulers)

Mitigation

gmarciani commented Oct 21, 2024

joehellmersNOAA commented Oct 22, 2024