Skip to content

Upgrade Slurm in an AWS ParallelCluster cluster

Jacopo De Amicis edited this page Oct 11, 2023 · 4 revisions

Upgrade Slurm in an AWS ParallelCluster cluster

An AWS ParallelCluster release comes with a set of AMIs for the supported operating systems and EC2 platforms. Each AMI contains a software stack, including the Slurm package, that has been validated at ParallelCluster release time. If you wish to upgrade the Slurm on your cluster you can follow this guide.

WARNING: due to the integration between ParallelCluster and Slurm, you must keep Slurm within the same major version that was provided in the release of ParallelCluster used in the cluster (see the ParallelCluster public documentation for more information about the versions of Slurm used in the various releases of ParallelCluster). E.g, a cluster running with ParallelCluster 3.7.0 must have a 23.02.x version of Slurm.

Step 1. Upgrading the Slurm package in the head node

If you wish to upgrade Slurm on the head node of your cluster, you cannot rely on upgrading your AMI, as in ParallelCluster the AMI of the head node cannot be changed with a pcluster update-cluster operation. In this case, please follow these steps (here we are installing version 23.02.5).

  1. Stop the compute fleet on the cluster via a pcluster update-compute-fleet -n <cluster_name> --status STOP_REQUESTED operation, and wait for the compute fleet to be stopped.
  2. Verify installed version of Slurm:
$ sinfo -V
slurm 23.02.4
  1. Verify which Slurm daemons are active on the head node and stop them (as root):
$ systemctl stop slurmrestd   # Only if present
$ systemctl stop slurmctld 
$ systemctl stop slurmdbd     # Only if present
  1. Backup existing installation of Slurm.

    WARNING: the following command includes the etc folder under the Slurm installation folder. Please mind this when re-extracting files from this tar onto /opt/slurm/. For instance, you can avoid overriding /opt/slurm/etc by using the --exclude flag of the tar utility.

$ tar czf slurm_backup_"$(date +%Y%m%d-%H%M%S)".tar.gz slurm
  1. As root, recompile Slurm, by executing the following script in the head node (the uninstall and rebuild of Slurm will not impact the etc folder under the installation folder, preserving the existing Slurm configuration).

    WARNING: if you do not wish to recompile Slurm on the head node, you can create a new instance from the EC2 console using the launch more like this functionality and launch the compilation there with the script provided below. You can later copy the /opt/slurm/ folder (excluding the etc subdirectory) back to the head node.

#!/bin/bash

set -e

# Set desired version of Slurm to be installed on the cluster
SLURM_VERSION_NEW=slurm-23-02-5-1 

# activate python virtual env
source /opt/parallelcluster/pyenv/versions/cookbook_virtualenv/bin/activate 

# go to the Chef local cache folder
cd /etc/chef/local-mode-cache/cache/ 

# download new version of slurm
wget https://github.com/SchedMD/slurm/archive/${SLURM_VERSION_NEW}.tar.gz 

# uninstall current version
cd slurm-slurm-* 
make uninstall 
cd - 
rm -rf slurm-slurm-* 

# unpack new version
tar xf ${SLURM_VERSION_NEW}.tar.gz 
cd slurm-${SLURM_VERSION_NEW}

# compile and install the new version
./configure --prefix=/opt/slurm --with-pmix=/opt/pmix --with-jwt=/opt/libjwt --enable-slurmrestd
CORES=$(grep processor /proc/cpuinfo | wc -l)
make -j $CORES
make install 
make install-contrib 

# deactivate python virtual env
deactivate 
  1. Restart the Slurm daemons:
$ systemctl start slurmdbd   # Only if present
$ systemctl start slurmctld 
$ systemctl start slurmrestd # Only if present
  1. Verify installed version of Slurm:
$ sinfo -V
slurm 23.02.5
  1. Verify that all the required daemons are running with the new version of Slurm:
$ sudo grep "slurmctld version" /var/log/slurmctld.log | tail -n 1
[<timestamp>] slurmctld version 23.02.5 started on cluster <cluster-name>
$ sudo grep "slurmdbd version" /var/log/slurmdbd.log | tail -n 1
[<timestamp>] slurmdbd version 23.02.5 started
  1. Restart the compute fleet via a pcluster update-compute-fleet -n <cluster_name> --status START_REQUESTED operation.

Step 2. Upgrading the Slurm package in the compute node

The Slurm package will be available in the compute nodes, through the /opt/slurm shared folder. It’s not required to execute any action on them.

Clone this wiki locally