Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL not working when each rank only sees its own GPU #1066

Open
casparvl opened this issue Nov 13, 2023 · 1 comment
Open

NCCL not working when each rank only sees its own GPU #1066

casparvl opened this issue Nov 13, 2023 · 1 comment

Comments

@casparvl
Copy link

casparvl commented Nov 13, 2023

Issue
I'm working on a SLURM system, where I noticed the following issue with a small synthetic benchmark running PyTorch DDP, with NCCL as backend. I ran this in two ways, I'll refer to them as Case 1 and Case 2. Case 1:

$ srun -n 2 -c 18 --gpus-per-task 1 python3 pytorch_synthetic_benchmark.py --use-ddp --num-iter 2
Iter #0: 687.1 img/sec per GPU
Iter #1: 669.4 img/sec per GPU

But if I specify --gpus instead of gpus-per-task (Case 2):

$ srun -n 2 -c 18 --gpus 2 python3 pytorch_synthetic_benchmark.py --use-ddp --num-iter 2
Iter #0: 791.6 img/sec per GPU
Iter #1: 767.6 img/sec per GPU

As you can see, the second performance is much better. Using dcgmi dmon -e 1011,1012, I noticed that the first run was not using NVLINK, whereas the second run is using NVLINK.

Running Case 1 with NCCL_DEBUG:

gcn6:1080060:1080060 [0] NCCL INFO Bootstrap : Using eno1np0:172.18.62.6<0>
gcn6:1080060:1080060 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gcn6:1080060:1080060 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/RoCE [RO]; OOB eno1np0:172.18.62.6<0>
gcn6:1080060:1080060 [0] NCCL INFO Using network IB
NCCL version 2.12.12+cuda11.7
gcn6:1080061:1080061 [0] NCCL INFO Bootstrap : Using eno1np0:172.18.62.6<0>
gcn6:1080061:1080061 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gcn6:1080061:1080061 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/RoCE [RO]; OOB eno1np0:172.18.62.6<0>
gcn6:1080061:1080061 [0] NCCL INFO Using network IB

gcn6:1080061:1080359 [0] misc/nvmlwrap.cc:181 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found

gcn6:1080060:1080354 [0] misc/nvmlwrap.cc:181 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
gcn6:1080060:1080354 [0] NCCL INFO Setting affinity for GPU 0 to 03ffff
gcn6:1080061:1080359 [0] NCCL INFO Setting affinity for GPU 0 to 0f,fffc0000
gcn6:1080060:1080354 [0] NCCL INFO Channel 00/02 :    0   1
gcn6:1080060:1080354 [0] NCCL INFO Channel 01/02 :    0   1
gcn6:1080061:1080359 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
gcn6:1080060:1080354 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
gcn6:1080061:1080359 [0] NCCL INFO Channel 00 : 1[32000] -> 0[31000] via direct shared memory
gcn6:1080060:1080354 [0] NCCL INFO Channel 00 : 0[31000] -> 1[32000] via direct shared memory
gcn6:1080061:1080359 [0] NCCL INFO Channel 01 : 1[32000] -> 0[31000] via direct shared memory
gcn6:1080060:1080354 [0] NCCL INFO Channel 01 : 0[31000] -> 1[32000] via direct shared memory
gcn6:1080061:1080359 [0] NCCL INFO Connected all rings
gcn6:1080061:1080359 [0] NCCL INFO Connected all trees
gcn6:1080061:1080359 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
gcn6:1080061:1080359 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
gcn6:1080060:1080354 [0] NCCL INFO Connected all rings
gcn6:1080060:1080354 [0] NCCL INFO Connected all trees
gcn6:1080060:1080354 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
gcn6:1080060:1080354 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
gcn6:1080061:1080359 [0] NCCL INFO comm 0x7f356c0090d0 rank 1 nranks 2 cudaDev 0 busId 32000 - Init COMPLETE
gcn6:1080060:1080354 [0] NCCL INFO comm 0x7f6e800090d0 rank 0 nranks 2 cudaDev 0 busId 31000 - Init COMPLETE
gcn6:1080060:1080060 [0] NCCL INFO Launch mode Parallel

While running Case 2:

host: gcn6.local.snellius.surf.nl, rank: 0, local_rank: 0
gcn6:1080641:1080641 [0] NCCL INFO Bootstrap : Using eno1np0:172.18.62.6<0>
gcn6:1080641:1080641 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gcn6:1080641:1080641 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/RoCE [RO]; OOB eno1np0:172.18.62.6<0>
gcn6:1080641:1080641 [0] NCCL INFO Using network IB
NCCL version 2.12.12+cuda11.7
gcn6:1080642:1080642 [1] NCCL INFO Bootstrap : Using eno1np0:172.18.62.6<0>
gcn6:1080642:1080642 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gcn6:1080642:1080642 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/RoCE [RO]; OOB eno1np0:172.18.62.6<0>
gcn6:1080642:1080642 [1] NCCL INFO Using network IB
gcn6:1080641:1080851 [0] NCCL INFO Setting affinity for GPU 0 to 03ffff
gcn6:1080642:1080858 [1] NCCL INFO Setting affinity for GPU 1 to 0f,fffc0000
gcn6:1080641:1080851 [0] NCCL INFO Channel 00/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Channel 01/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Channel 02/08 :    0   1
gcn6:1080642:1080858 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] -1/-1/-1->1->0 [7] -1/-1/-1->1->0
gcn6:1080641:1080851 [0] NCCL INFO Channel 03/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Channel 04/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Channel 05/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Channel 06/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Channel 07/08 :    0   1
gcn6:1080641:1080851 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1
gcn6:1080642:1080858 [1] NCCL INFO Channel 00 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 00 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 01 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 01 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 02 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 02 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 03 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 03 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 04 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 04 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 05 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 05 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 06 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 06 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Channel 07 : 1[32000] -> 0[31000] via P2P/IPC/read
gcn6:1080641:1080851 [0] NCCL INFO Channel 07 : 0[31000] -> 1[32000] via P2P/IPC/read
gcn6:1080642:1080858 [1] NCCL INFO Connected all rings
gcn6:1080642:1080858 [1] NCCL INFO Connected all trees
gcn6:1080642:1080858 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
gcn6:1080642:1080858 [1] NCCL INFO 8 coll channels, 8 p2p channels, 8 p2p channels per peer
gcn6:1080641:1080851 [0] NCCL INFO Connected all rings
gcn6:1080641:1080851 [0] NCCL INFO Connected all trees
gcn6:1080641:1080851 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
gcn6:1080641:1080851 [0] NCCL INFO 8 coll channels, 8 p2p channels, 8 p2p channels per peer
gcn6:1080642:1080858 [1] NCCL INFO comm 0x7f61c40090d0 rank 1 nranks 2 cudaDev 1 busId 32000 - Init COMPLETE
gcn6:1080641:1080851 [0] NCCL INFO comm 0x7fcbf80090d0 rank 0 nranks 2 cudaDev 0 busId 31000 - Init COMPLETE
gcn6:1080641:1080641 [0] NCCL INFO Launch mode Parallel

The big difference between --gpus-per-task 1 and --gpus 2 is that in the first first case, SLURM limits access for each rank to one GPU:

$ srun -n 2 -c 18 --gpus-per-task 1 nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-dfc2d25d-2803-f8e4-17b1-5d2bf5838777)
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ee9d03c0-33f4-ab88-1362-ace6ce89575d)

Wherease in Case 2, each rank has access to both GPUs:

$ srun -n 2 -c 18 --gpus 2 nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ee9d03c0-33f4-ab88-1362-ace6ce89575d)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-dfc2d25d-2803-f8e4-17b1-5d2bf5838777)
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ee9d03c0-33f4-ab88-1362-ace6ce89575d)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-dfc2d25d-2803-f8e4-17b1-5d2bf5838777)

Potentially related issues
#1017
#324
NVIDIA/pyxis#73

Actual question
I have a pretty good grasp on what is happening here: I guess the NCCL init fails to discover that both GPUs are physically connected, since each process is limited to it's own GPU by a cgroup (set by SLURM). I guess my actual question is: is this a bug/limitation in how NCCL is initialized? I.e. if there were a way to discover accross cgroups that the other GPU is in the same node, would it help? Or would a succesful init not even help because a process (e.g. rank 0, running its compute on GPU 0) need access to 'the other' process' GPU (i.e. GPU 1) in order to even do IPC (and that access is simply not possible due to the cgroups)?

This comment and this comment seem to suggest it the latter, but since that ticket is about GPUs being isolated in different containers (whereas in this case they are 'only' in different c-groups), I'm wasn't sure. I don't know the techincal details of IPC, but I would half expect this to be handled at a different level (kernel? driver, i.e. root?) than the user process, in which case it would/should be possible to communicate across cgroups.

It would be a shame if there is no resolution to this for two reasons

  1. The silent fallback to communication over PCIe could mean a lot of users on SLURM systems are (unkowingly) leaving a lot of performance on the table

  2. While I could recommend users on our cluster to use --gpus or --gpus-per-node (which also doesn't put GPUs in a cgroup per task) instead of --gpus-per-task, the advantage of using --gpus-per-task is that it is easier in the user code: the code then doesn't have to handle device placement explicitely for each rank, since each process only sees a single GPU.

@sjeaugey
Copy link
Member

Indeed, NCCL needs to see all GPUs on the same node for all NVLink detection to work properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants