You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have discovered an issue in the way we configure the Pyxis Slurm plugin in ParallelCluster that can lead to job submission failures. When this issue occurs, the cluster enters an invalid state, and any subsequent job would fail to run, including those that do not require the Pyxis plugin.
If your cluster is affected by this issue, you will experience job failures with the following error in its output:
[ec2-user@ip-27-6-21-47 ~]$ cat slurm-1.out
srun: error: spank: Failed to open /opt/slurm/etc/plugstack.conf.d/sed6Yj8Ga: Permission denied
srun: error: Plug-in initialization failed
When the issue occurs, the cluster is unable to automatically recover from it, and all subsequent jobs will fail to run. However, running jobs will not be affected.
The issue is caused by a race condition happening during the compute node bootstrap process, as multiple processes write temporary files into the shared Slurm configuration directory. The presence of such temporary files causes Slurm failures when loading the SPANK plugins. A failure in removing these temporary files will render the cluster inoperable.
Affected versions (OSes, schedulers)
ParallelCluster 3.11.0
Mitigation
You can find a detailed explanation and the mitigation of the problem here.
The text was updated successfully, but these errors were encountered:
Bug description
We have discovered an issue in the way we configure the Pyxis Slurm plugin in ParallelCluster that can lead to job submission failures. When this issue occurs, the cluster enters an invalid state, and any subsequent job would fail to run, including those that do not require the Pyxis plugin.
If your cluster is affected by this issue, you will experience job failures with the following error in its output:
When the issue occurs, the cluster is unable to automatically recover from it, and all subsequent jobs will fail to run. However, running jobs will not be affected.
The issue is caused by a race condition happening during the compute node bootstrap process, as multiple processes write temporary files into the shared Slurm configuration directory. The presence of such temporary files causes Slurm failures when loading the SPANK plugins. A failure in removing these temporary files will render the cluster inoperable.
Affected versions (OSes, schedulers)
Mitigation
You can find a detailed explanation and the mitigation of the problem here.
The text was updated successfully, but these errors were encountered: