SLURM gpus-per-task issue #73

itzsimpl · 2022-02-26T15:28:50Z

I am having a strange issue running Slurm 21.08.5 in combination with Pyxis v0.11.1 and Enroot v.3.2.0. The issue manifests in only one process receiving a gpu, all others none when specifying --gpus-per-task in combination with a container. For example without a container

$ srun --nodes=1 --tasks=2 --gpus-per-task=1  bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"'
PROC_ID=0 GPU 0: NVIDIA RTX A5000 (UUID: GPU-80b6f32b-92a5-8495-5438-993f0d99d14b)
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-43f1fb2e-11d8-30c4-1b4a-70bd77fc7e54)

with a container

$ srun --nodes=1 --tasks=2 --gpus-per-task=1 --container-image nvcr.io#nvidia/pytorch:22.02-py3  bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"'
pyxis: importing docker image ...
PROC_ID=0 No devices found.
PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-43f1fb2e-11d8-30c4-1b4a-70bd77fc7e54)

Slurmd logs show no sign of differences. There is a lengthier discussion open on deepos#1102, where I initially posted the issue. Reposting here since it seems to be related to Pyxis+Enroot. Any help will be appreciated.

The text was updated successfully, but these errors were encountered:

flx42 · 2022-02-26T19:12:51Z

Hi, I will try to look at this sometime next week, but if you have time to test a custom build, you can try commenting these lines and see if that changes anything: https://github.com/NVIDIA/pyxis/blob/12413bb68adb82a6205c0bc828ed80386eaa1699/pyxis_slurmstepd.c#L626-L631

…

On Sat, Feb 26, 2022, 07:29 Iztok Lebar Bajec ***@***.***> wrote: I am having a strange issue running Slurm 21.08.5 in combination with Pyxis v0.11.1 and Enroot v.3.2.0. The issue manifests in only one process receiving a gpu, all others none when specifying --gpus-per-task in combination with a container. For example without a container $ srun --nodes=1 --tasks=2 --gpus-per-task=1 bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"' PROC_ID=0 GPU 0: NVIDIA RTX A5000 (UUID: GPU-80b6f32b-92a5-8495-5438-993f0d99d14b) PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-43f1fb2e-11d8-30c4-1b4a-70bd77fc7e54) with a container $ srun --nodes=1 --tasks=2 --gpus-per-task=1 --container-image nvcr.io#nvidia/pytorch:22.02-py3 bash -c 'echo "PROC_ID=$SLURM_PROCID $(nvidia-smi -L)"' pyxis: importing docker image ... PROC_ID=0 No devices found. PROC_ID=1 GPU 0: NVIDIA RTX A5000 (UUID: GPU-43f1fb2e-11d8-30c4-1b4a-70bd77fc7e54) Slurmd logs show no sign of differences. There is a lengthier discussion open on deepos#1102 <NVIDIA/deepops#1102>, where I initially posted the issue. Reposting here since it seems to be related to Pyxis+Enroot. Any help will be appreciated. — Reply to this email directly, view it on GitHub <#73>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA32BDMIS564WGBVSNKA4CTU5DWT7ANCNFSM5PNDUJSA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

flx42 · 2022-02-28T17:08:20Z

I was able to take a quick look, and it's a bit more complicated than I expected, I'm not sure yet how I can fix this use case.

flx42 · 2022-03-01T00:25:48Z

Okay, so as a workaround you should set ENROOT_RESTRICT_DEV=n either in /etc/enroot/enroot.conf, or in the command-line of the container if it's tractable:

$ srun -l --nodes=1 --tasks=2 --gpus-per-task=1 --container-image nvidia/cuda:11.6.0-base-ubuntu20.04 nvidia-smi -L
0: pyxis: importing docker image: nvidia/cuda:11.6.0-base-ubuntu20.04
1: No devices found.
0: GPU 0: NVIDIA TITAN V (UUID: GPU-1619c7d3-8546-ec93-2fcd-d5898925f0df)

$ ENROOT_RESTRICT_DEV=n srun -l --nodes=1 --tasks=2 --gpus-per-task=1 --container-image nvidia/cuda:11.6.0-base-ubuntu20.04 nvidia-smi -L
1: pyxis: importing docker image: nvidia/cuda:11.6.0-base-ubuntu20.04
0: GPU 0: NVIDIA TITAN V (UUID: GPU-1619c7d3-8546-ec93-2fcd-d5898925f0df)
1: GPU 0: NVIDIA GeForce GTX 1080 (UUID: GPU-67d704a1-202d-4f62-f5f1-81186ba21e04)

I need to think a bit more to see if there is a way to detect/fix this.

Technical details

This use case is interesting as it is different from what we we usually do. Our workloads launch multiple tasks per node, but all the tasks have access to all the GPUs of the job (when using a node-exclusive allocation or GRES). This is a prerequisite when doing CUDA IPC between processes from different tasks (e.g. task 0 will use GPU 0, but GPU 1 needs to be accessible in order to be able to do P2P transfers between GPU 0 and GPU 1).

Pyxis goes one step further by putting all tasks on a node inside the same "container" (same mount namespace, user namespace, cgroup namespace), this allows other kind of IPC applications to work correctly. One of the task will "win" the race and be the one creating the container namespaces for all other tasks. But this mount namespace will only have a single device visible with ENROOT_RESTRICT_DEV=y, the device visible for the first task, e.g. /dev/nvidia0. But other tasks should instead have /dev/nvidia1, /dev/nvidia2, ... and they don't because of the single mount namespace. That's why ENROOT_RESTRICT_DEV=n fixes the issue, it will mount /dev/nvidia{0,1,2,...} for all tasks.

itzsimpl · 2022-03-01T09:46:49Z

Thank you very much for taking the time. I can confirm, that by adding ENROOT_RESTRICT_DEV=n fixes the issue. If this helps at all, the issue was not present in deepops:20.10 (i.e. Slurm 20.02 and Pyxis 0.8.1). Unfortunately I can't test if in that case all tasks had access to all GPUs or each to just one.

From the standpoint of parameters --gpus-per-task and --gpu-bind:per_thread I would assume that the result achieved with ENROOT_RESTRICT_DEV=n is how they are supposed to work. This types of jobs are for sure legitimate. However, reading the technical details you also made me start wandering if are using a non-optimal configuration. We often run Pytorch jobs with ddp (e.g. the NVIDIA NeMo) and for simplicity of job specification our .sbatch scripts had been using the --gpus-per-task parameter (i.e. only --tasks has to be updated in order to increase/decrease the job size). We started to investigate this issue for this precise reason, as with the update of Slurm and Pyxis the jobs suddenly started failing. Digging into it revealed that the the individual tasks haven't been receiving GPUs.

With your explanation about P2P transfers I am starting to wander if we were using non-optimal configurations all along. That what we should have been using instead is --gres, i.e. giving access to all of the requested GPUs to all tasks. In fact this is our current workaround, and works independently of the value of ENROOT_RESTRICT_DEV. Pytorch ddp seems to work fine or better? with it as well. I would be very interested hearing your thoughts on the subject.

flx42 · 2022-03-01T18:34:08Z

It might impact performance, but it will depend on the hardware you are currently running on.
See this comment on the NCCL project: NVIDIA/nccl#324 (comment), it mentions that CUDA IPC and thus NVLink cannot be used if processes see different GPUs.

ajdecon mentioned this issue Mar 2, 2022

SLURM gpus-per-task issue NVIDIA/deepops#1102

Closed

itzsimpl mentioned this issue Jul 21, 2022

Slurm multi nodes cluster installation failure NVIDIA/deepops#1179

Closed

itzsimpl mentioned this issue Jan 20, 2023

Multi-node communication problem using Slurm and NeMo Megatron official GPT docs example NVIDIA/NeMo#5819

Closed

lipovsek-aws mentioned this issue May 26, 2023

how to support nvlink between several k8s pods? NVIDIA/k8s-device-plugin#347

Closed

casparvl mentioned this issue Nov 13, 2023

NCCL not working when each rank only sees its own GPU NVIDIA/nccl#1066

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLURM gpus-per-task issue #73

SLURM gpus-per-task issue #73

itzsimpl commented Feb 26, 2022

flx42 commented Feb 26, 2022 via email •

edited

Loading

flx42 commented Feb 28, 2022

flx42 commented Mar 1, 2022 •

edited

Loading

itzsimpl commented Mar 1, 2022 •

edited

Loading

flx42 commented Mar 1, 2022

SLURM gpus-per-task issue #73

SLURM gpus-per-task issue #73

Comments

itzsimpl commented Feb 26, 2022

flx42 commented Feb 26, 2022 via email • edited Loading

flx42 commented Feb 28, 2022

flx42 commented Mar 1, 2022 • edited Loading

Technical details

itzsimpl commented Mar 1, 2022 • edited Loading

flx42 commented Mar 1, 2022

flx42 commented Feb 26, 2022 via email •

edited

Loading

flx42 commented Mar 1, 2022 •

edited

Loading

itzsimpl commented Mar 1, 2022 •

edited

Loading