Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM gpus-per-task issue #73

Open
itzsimpl opened this issue Feb 26, 2022 · 5 comments
Open

SLURM gpus-per-task issue #73

itzsimpl opened this issue Feb 26, 2022 · 5 comments

Comments

@itzsimpl
Copy link

I am having a strange issue running Slurm 21.08.5 in combination with Pyxis v0.11.1 and Enroot v.3.2.0. The issue manifests in only one process receiving a gpu, all others none when specifying --gpus-per-task in combination with a container. For example without a container

$ srun --nodes=1 --tasks=2 --gpus-per-task=1  bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"'
PROC_ID=0 GPU 0: NVIDIA RTX A5000 (UUID: GPU-80b6f32b-92a5-8495-5438-993f0d99d14b)
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-43f1fb2e-11d8-30c4-1b4a-70bd77fc7e54)

with a container

$ srun --nodes=1 --tasks=2 --gpus-per-task=1 --container-image nvcr.io#nvidia/pytorch:22.02-py3  bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"'
pyxis: importing docker image ...
PROC_ID=0 No devices found.
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-43f1fb2e-11d8-30c4-1b4a-70bd77fc7e54)

Slurmd logs show no sign of differences. There is a lengthier discussion open on deepos#1102, where I initially posted the issue. Reposting here since it seems to be related to Pyxis+Enroot. Any help will be appreciated.

@flx42
Copy link
Member

flx42 commented Feb 26, 2022 via email

@flx42
Copy link
Member

flx42 commented Feb 28, 2022

I was able to take a quick look, and it's a bit more complicated than I expected, I'm not sure yet how I can fix this use case.

@flx42
Copy link
Member

flx42 commented Mar 1, 2022

Okay, so as a workaround you should set ENROOT_RESTRICT_DEV=n either in /etc/enroot/enroot.conf, or in the command-line of the container if it's tractable:

$ srun -l --nodes=1 --tasks=2 --gpus-per-task=1 --container-image nvidia/cuda:11.6.0-base-ubuntu20.04 nvidia-smi -L
0: pyxis: importing docker image: nvidia/cuda:11.6.0-base-ubuntu20.04
1: No devices found.
0: GPU 0: NVIDIA TITAN V (UUID: GPU-1619c7d3-8546-ec93-2fcd-d5898925f0df)

$ ENROOT_RESTRICT_DEV=n srun -l --nodes=1 --tasks=2 --gpus-per-task=1 --container-image nvidia/cuda:11.6.0-base-ubuntu20.04 nvidia-smi -L
1: pyxis: importing docker image: nvidia/cuda:11.6.0-base-ubuntu20.04
0: GPU 0: NVIDIA TITAN V (UUID: GPU-1619c7d3-8546-ec93-2fcd-d5898925f0df)
1: GPU 0: NVIDIA GeForce GTX 1080 (UUID: GPU-67d704a1-202d-4f62-f5f1-81186ba21e04)

I need to think a bit more to see if there is a way to detect/fix this.

Technical details

This use case is interesting as it is different from what we we usually do. Our workloads launch multiple tasks per node, but all the tasks have access to all the GPUs of the job (when using a node-exclusive allocation or GRES). This is a prerequisite when doing CUDA IPC between processes from different tasks (e.g. task 0 will use GPU 0, but GPU 1 needs to be accessible in order to be able to do P2P transfers between GPU 0 and GPU 1).

Pyxis goes one step further by putting all tasks on a node inside the same "container" (same mount namespace, user namespace, cgroup namespace), this allows other kind of IPC applications to work correctly. One of the task will "win" the race and be the one creating the container namespaces for all other tasks. But this mount namespace will only have a single device visible with ENROOT_RESTRICT_DEV=y, the device visible for the first task, e.g. /dev/nvidia0. But other tasks should instead have /dev/nvidia1, /dev/nvidia2, ... and they don't because of the single mount namespace. That's why ENROOT_RESTRICT_DEV=n fixes the issue, it will mount /dev/nvidia{0,1,2,...} for all tasks.

@itzsimpl
Copy link
Author

itzsimpl commented Mar 1, 2022

Thank you very much for taking the time. I can confirm, that by adding ENROOT_RESTRICT_DEV=n fixes the issue. If this helps at all, the issue was not present in deepops:20.10 (i.e. Slurm 20.02 and Pyxis 0.8.1). Unfortunately I can't test if in that case all tasks had access to all GPUs or each to just one.

From the standpoint of parameters --gpus-per-task and --gpu-bind:per_thread I would assume that the result achieved with ENROOT_RESTRICT_DEV=n is how they are supposed to work. This types of jobs are for sure legitimate. However, reading the technical details you also made me start wandering if are using a non-optimal configuration. We often run Pytorch jobs with ddp (e.g. the NVIDIA NeMo) and for simplicity of job specification our .sbatch scripts had been using the --gpus-per-task parameter (i.e. only --tasks has to be updated in order to increase/decrease the job size). We started to investigate this issue for this precise reason, as with the update of Slurm and Pyxis the jobs suddenly started failing. Digging into it revealed that the the individual tasks haven't been receiving GPUs.

With your explanation about P2P transfers I am starting to wander if we were using non-optimal configurations all along. That what we should have been using instead is --gres, i.e. giving access to all of the requested GPUs to all tasks. In fact this is our current workaround, and works independently of the value of ENROOT_RESTRICT_DEV. Pytorch ddp seems to work fine or better? with it as well. I would be very interested hearing your thoughts on the subject.

@flx42
Copy link
Member

flx42 commented Mar 1, 2022

It might impact performance, but it will depend on the hardware you are currently running on.
See this comment on the NCCL project: NVIDIA/nccl#324 (comment), it mentions that CUDA IPC and thus NVLink cannot be used if processes see different GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants