Why NVLINK doesn't work across the containers in the same node #324

cheyang · 2020-04-14T22:48:42Z

1. Issue or feature description

I run two containers which shares ipc, net,uts, pid namespace, but they are still connect through the net instead P2P. But if I run in the same container, they are able to use P2P

2. Steps to reproduce the issue

The docker run command

docker run --env NVIDIA_VISIBLE_DEVICES=0,1 \
-tid \
-v /dev:/dev  \
--name=pytorch_test_0 \
--net=host \
--ipc=host \
--uts=host \
--pid=host \
-v /root:/root \
pytorch/pytorch:1.3-cuda10.1-cudnn7-runtime bash

docker run --env NVIDIA_VISIBLE_DEVICES=0,1 \
-v /dev:/dev  \
-tid --name=pytorch_test_1 \
--net=host \
--ipc=host \
--uts=host \
--pid=host \
-v /root:/root \
pytorch/pytorch:1.3-cuda10.1-cudnn7-runtime \
bash

and I run dist_cifar10.py in the both containers,

torch.cuda.set_device(gpu_id)
torch.distributed.init_process_group(backend='nccl')

And the start command in two containers are

DATA_DIR=/root/pytorch_test/cifar10 VISIBLE_DEVICE_LIST=0 NCCL_DEBUG=INFO RANK=0 WORLD_SIZE=2 MASTER_ADDR=127.0.0.1 MASTER_PORT=30000 python example.py 2>&1 | tee /root/nccl_1.log

DATA_DIR=/root/pytorch_test/cifar10 VISIBLE_DEVICE_LIST=1 NCCL_DEBUG=INFO RANK=1 WORLD_SIZE=2 MASTER_ADDR=127.0.0.1 MASTER_PORT=30000 python example.py 2>&1 | tee /root/nccl_2.log

From the logs,

iZuf6a4vs2nipcmiuh1pjwZ:45679:45951 [1] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/1
iZuf6a4vs2nipcmiuh1pjwZ:45679:45951 [1] NCCL INFO Ring 02 : 0 -> 1 [receive] via NET/Socket/0
iZuf6a4vs2nipcmiuh1pjwZ:45679:45951 [1] NCCL INFO NET/Socket: Using 1 threads and 1 sockets per thread
iZuf6a4vs2nipcmiuh1pjwZ:45679:45951 [1] NCCL INFO Ring 02 : 1 -> 0 [send] via NET/Socket/0

But If I run the command in step2 in the same container, I'm able to see the P2P enabled.

iZuf6a4vs2nipcmiuh1pjwZ:73287:73984 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
iZuf6a4vs2nipcmiuh1pjwZ:73287:73984 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC

The text was updated successfully, but these errors were encountered:

cheyang · 2020-04-14T22:50:24Z

In each container, I can get the result.

nvidia-smi topo -m
	GPU0	GPU1	CPU Affinity
GPU0	 X 	NV1	0-81
GPU1	NV1	 X 	0-81

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

sjeaugey · 2020-04-14T22:51:17Z

Which NCCL version are you using ? If running NCCL >= 2.5 can you set NCCL_DEBUG_SUBSYS=GRAPH NCCL_DEBUG=INFO and post the log ?

cheyang · 2020-04-15T00:44:33Z

The NCCL VERSION is version 2.4.8+cuda10.1

sjeaugey · 2020-04-15T01:54:31Z

Thanks. In NCCL 2.4.8, we were reading files in /proc/self/ns/ to determine whether we were on the same node or not (see getHostHash in src/misc/utils.cc). Depending on your node setup it might incorrectly consider the two containers as different nodes.

In recent versions of NCCL, we now read /proc/sys/kernel/random/boot_id which is more reliable. So you might want to try a newer version of NCCL or change your namespace configuration.

2sin18 · 2020-04-15T16:16:15Z

What's the suggested way to launch nvidia-docker with NCCL?

Using latest NCCL, I found it will fail to use NCCL across 2 docker instances running on same node with different --gpus argument, and below errors reported:

misc/nvmlwrap.cc:144 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found

But if I use --gpus all -e CUDA_VISIBLE_DEVICES= to launch nvidia-docker, it works.

I do not find notes about this in neither NCCL nor nvidia-docker documents. Is it by design but undocumented or an unexpected problem? I will create an issue ticket if it's a real problem.

sjeaugey · 2020-04-15T16:56:46Z

I guess when using --gpus you're effectively hiding the other GPU completely, so the other GPU can no longer find it through NVML and determine it has an NVLink connection to it.

sjeaugey · 2020-04-15T17:28:01Z

It's actually a bit more complicated than that. We could, in theory, determine that our local GPU has NVLink connectivity to the other GPU, but we would not be able determine what NVLinks connections that other GPU has to other GPUs, which means we would not be able to draw the whole graph and look for rings properly. To solve that we'd need each rank to not only share with others its busId/cudaDev/... but pretty much all of its local information.

Using CUDA_VISIBLE_DEVICES is also not ideal, since it would only work on NVLink platforms and would probably be detrimental to PCI systems.

WencongXiao · 2020-04-16T00:58:25Z

@sjeaugey Hi, the case mentioned by @2sin18 seems to happen only in the latest version nccl. My test with the version of 2.4.8 goes fine. The local 2 containers can communicate through socket.
Do you have any idea when this feature was introduced? Only in the latest version, or maybe also in the previous release?

sjeaugey · 2020-04-16T01:22:14Z

Hum, indeed, before 2.5, each GPU was exchanging its connectivity matrix with the others, so that case could work.

The topology detection (including NVLink) has been completely rewritten in NCCL 2.5 and is now way more advanced, but indeed, it makes it harder to manage this precise case. To solve that in recent versions would require some work though and additional out-of-band communication between the ranks.

2sin18 · 2020-04-16T11:57:29Z

Hum, indeed, before 2.5, each GPU was exchanging its connectivity matrix with the others, so that case could work.

The topology detection (including NVLink) has been completely rewritten in NCCL 2.5 and is now way more advanced, but indeed, it makes it harder to manage this precise case. To solve that in recent versions would require some work though and additional out-of-band communication between the ranks.

I find the problem is an issue since NCCL 2.6. See #326

2sin18 · 2020-04-17T05:33:01Z

Which NCCL version are you using ? If running NCCL >= 2.5 can you set NCCL_DEBUG_SUBSYS=GRAPH NCCL_DEBUG=INFO and post the log ?

@cheyang I can repro this issue using the NCCL 2.5. I can give my repro steps later. NCCL 2.6 and later it will crash (#326) for now.

2sin18 · 2020-04-17T05:57:04Z

Step 1

Clone nccl-tests modifed to be able to run without MPI and build binaries.

Step 2

Launch two docker containers using below commands:

docker run \
--gpus "device=0" --ipc host --network host --pid host --uts host --ulimit memlock=-1:-1 \
--rm -it \
-e NCCL_DEBUG=INFO -e NCCL_DEBUG_SUBSYS=INIT,P2P,GRAPH \
-v /sys:/sys -v /path/to/my/workspace:/workspace -w /workspace \
my_docker_with_cuda_and_nccl_2507 bash

and

docker run \
--gpus "device=1" --ipc host --network host --pid host --uts host --ulimit memlock=-1:-1 \
--rm -it \
-e NCCL_DEBUG=INFO -e NCCL_DEBUG_SUBSYS=INIT,P2P,GRAPH \
-v /sys:/sys -v /path/to/my/workspace:/workspace -w /workspace \
my_docker_with_cuda_and_nccl_2507 bash

Step 3

Run below commands in corresponding containers (The container is based on nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04 and replaces the original NCCL by NCCL 2.5.7):

NCCL_COMM_ID=myip:myport WORLD_SIZE=2 RANK=0 ./all_reduce_perf

and

NCCL_COMM_ID=myip:myport WORLD_SIZE=2 RANK=1 ./all_reduce_perf

Results

# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid 413820 on paidist001 device  0 [0x4f] Tesla V100-SXM2-16GB
paidist001:413820:413820 [0] NCCL INFO Bootstrap : Using [0]eth0:192.168.1.120<0>
paidist001:413820:413820 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

paidist001:413820:413820 [0] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed

paidist001:413820:413820 [0] transport/net_ib.cc:120 NCCL WARN NET/IB : Unable to open device mlx5_bond_0
paidist001:413820:413820 [0] NCCL INFO NET/IB : No device found.
paidist001:413820:413820 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.1.120<0>
NCCL version 2.5.7ali+cuda10.0
paidist001:413820:413826 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff

paidist001:413820:413826 [0] misc/nvmlwrap.cc:144 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found 
paidist001:413820:413826 [0] NCCL INFO Intel CPU (PCI 12, InterCpu 8)
paidist001:413820:413826 [0] NCCL INFO /sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0/0000:19:04.0/0000:1a:00.0/0000:1b:11.0/0000:2d:00.0/virtio1 -> 0/0/0/0
paidist001:413820:413826 [0] NCCL INFO === System : maxWidth 12 maxSpeed 12 ===
paidist001:413820:413826 [0] NCCL INFO CPU/0
paidist001:413820:413826 [0] NCCL INFO + PCI[12] - PCI/18000
paidist001:413820:413826 [0] NCCL INFO             + PCI[12] - GPU/4F000 (0)
paidist001:413820:413826 [0] NCCL INFO + QPI[ 8] - CPU/FFFFFFFFFFFFFFFF
paidist001:413820:413826 [0] NCCL INFO CPU/FFFFFFFFFFFFFFFF
paidist001:413820:413826 [0] NCCL INFO + PCI[12] - PCI/17000
paidist001:413820:413826 [0] NCCL INFO             + PCI[12] - PCI/19040
paidist001:413820:413826 [0] NCCL INFO                         + PCI[12] - PCI/1B110
paidist001:413820:413826 [0] NCCL INFO                                     + PCI[12] - NIC/0
paidist001:413820:413826 [0] NCCL INFO                                                 + NET[12] - NET/0 (0)
paidist001:413820:413826 [0] NCCL INFO + QPI[ 8] - CPU/0
paidist001:413820:413826 [0] NCCL INFO ==========================================
paidist001:413820:413826 [0] NCCL INFO GPU/4F000 :GPU/4F000 (0/5000/0) CPU/0 (2/12/2) CPU/FFFFFFFFFFFFFFFF (3/8/3) NET/0 (8/8/3) 
paidist001:413820:413826 [0] NCCL INFO NET/0 :GPU/4F000 (8/8/3) CPU/0 (6/8/3) CPU/FFFFFFFFFFFFFFFF (5/12/2) NET/0 (0/5000/0) 
paidist001:413820:413826 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 12/6, nvlink 1, type 3, sameChannels 1
paidist001:413820:413826 [0] NCCL INFO  0 : NET/0 GPU/0 NET/0
paidist001:413820:413826 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 6/6, nvlink 1, type 3, sameChannels 1
paidist001:413820:413826 [0] NCCL INFO  0 : NET/0 GPU/0 NET/0
paidist001:413820:413826 [0] NCCL INFO NCCL_MAX_NRINGS set by environment to 12.
paidist001:413820:413826 [0] NCCL INFO NCCL_MIN_NRINGS set by environment to 4.
paidist001:413820:413826 [0] NCCL INFO Channel 00/04 :    0   1
paidist001:413820:413826 [0] NCCL INFO Channel 01/04 :    0   1
paidist001:413820:413826 [0] NCCL INFO Channel 02/04 :    0   1
paidist001:413820:413826 [0] NCCL INFO Channel 03/04 :    0   1
paidist001:413820:413826 [0] NCCL INFO Threads per block : 512/640/256
paidist001:413820:413826 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64
paidist001:413820:413826 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] -1/-1/-1->0->1|1->0->-1/-1/-1 [2] 1/-1/-1->0->-1|-1->0->1/-1/-1 [3] -1/-1/-1->0->1|1->0->-1/-1/-1
paidist001:413820:413826 [0] NCCL INFO Ring 00 : 0[4f000] -> 1[50000] via direct shared memory
paidist001:413820:413826 [0] NCCL INFO Ring 01 : 0[4f000] -> 1[50000] via direct shared memory
paidist001:413820:413826 [0] NCCL INFO Ring 02 : 0[4f000] -> 1[50000] via direct shared memory
paidist001:413820:413826 [0] NCCL INFO Ring 03 : 0[4f000] -> 1[50000] via direct shared memory
paidist001:413820:413826 [0] NCCL INFO comm 0x7fd2d8001990 rank 0 nranks 2 cudaDev 0 busId 4f000 - Init COMPLETE
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
paidist001:413820:413820 [0] NCCL INFO Launch mode Parallel
    33554432       8388608   float     sum   8783.6    3.82    3.82  0e+00   8779.3    3.82    3.82  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.82105 
#

sjeaugey · 2020-04-17T17:00:55Z

@2sin18 I think there are two different use cases.

All GPUs are visible in each container, i.e. docker run --gpus=all
I believe this is what the original bug report is about (@cheyang can confirm). I could verify this works, at least with NCCL 2.6, with full performance.
One GPU is visible in each container, i.e. docker run --gpus="device=N"
This is your example, which happens to crash with 2.6.4 (Topology detection crashes for all_reduce_perf across the docker containers in same node. #326) unless you apply the fix.
In this situation, we cannot use CUDA IPCs since we don't see the other GPU, and therefore we cannot use NVLink. I tried injecting the full topology to force NCCL to use NVLink but the CUDA IPC mapping failed.

@WencongXiao I'm not sure how to understand your comment above about 2.4.8 working better than 2.5 or 2.6 ... it should not be the case, except for the crash on 2.6.4 (which has a fix now). Please let me know if I'm missing a use case.

WencongXiao · 2020-04-18T07:23:14Z

My previous comment just want to share the information that 2.4.8 won't fail in case 2. Thanks @sjeaugey for your help to quickly identify the bug.

Do you think it is possible to support the NVLINK communication in the mode of case 2 for multiple containers in the future? And one more step, in non-host mode containers?

sjeaugey · 2020-04-20T15:35:47Z

@WencongXiao That would need to come from CUDA, since even when I try to force CUDA IPCs between containers, we get an error when opening the IPC handle.

zrss · 2020-05-18T01:58:38Z

/cc

zrss · 2020-05-23T16:47:56Z

by the way, currently, i set the the same NCCL_HOSTID and mount the same dir /dev/shm for containers in one node, can make NCCL (built from src of master branch) comm via SHM cross containers (without using --net=host)

Xiaoaier-Z-L · 2023-06-09T12:00:28Z

by the way, currently, i set the the same NCCL_HOSTID and mount the same dir /dev/shm for containers in one node, can make NCCL (built from src of master branch) comm via SHM cross containers (without using --net=host)

when i do it with two nodes, it err: #879

freelizhun · 2024-08-16T02:01:17Z

Hi, @cheyang. Have you solved this problem?

WencongXiao mentioned this issue Apr 16, 2020

NVLINK doesn't work across the containers NVIDIA/nvidia-docker#1253

Closed

9 tasks

2sin18 mentioned this issue Apr 22, 2020

[Feature Request] Add capability for utilizing NVLink across docker containers. NVIDIA/nvidia-docker#1255

Closed

weberxie mentioned this issue Sep 3, 2020

How to use p2p over PCIe between different containers on the same node? #382

Open

sjeaugey mentioned this issue Sep 23, 2021

Performance degrades drastically between two docker in one host NVIDIA/nccl-tests#93

Closed

ShrimpLau mentioned this issue Feb 8, 2022

Maybe duplicated when meeting unknown cpu? #632

Closed

flx42 mentioned this issue Mar 1, 2022

SLURM gpus-per-task issue NVIDIA/pyxis#73

Open

casparvl mentioned this issue Nov 13, 2023

NCCL not working when each rank only sees its own GPU #1066

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why NVLINK doesn't work across the containers in the same node #324

Why NVLINK doesn't work across the containers in the same node #324

cheyang commented Apr 14, 2020

cheyang commented Apr 14, 2020

sjeaugey commented Apr 14, 2020

cheyang commented Apr 15, 2020

sjeaugey commented Apr 15, 2020

2sin18 commented Apr 15, 2020 •

edited

Loading

sjeaugey commented Apr 15, 2020

sjeaugey commented Apr 15, 2020

WencongXiao commented Apr 16, 2020

sjeaugey commented Apr 16, 2020

2sin18 commented Apr 16, 2020

2sin18 commented Apr 17, 2020 •

edited

Loading

2sin18 commented Apr 17, 2020 •

edited

Loading

sjeaugey commented Apr 17, 2020

WencongXiao commented Apr 18, 2020

sjeaugey commented Apr 20, 2020

zrss commented May 18, 2020

zrss commented May 23, 2020 •

edited

Loading

Xiaoaier-Z-L commented Jun 9, 2023

freelizhun commented Aug 16, 2024

Why NVLINK doesn't work across the containers in the same node #324

Why NVLINK doesn't work across the containers in the same node #324

Comments

cheyang commented Apr 14, 2020

1. Issue or feature description

2. Steps to reproduce the issue

cheyang commented Apr 14, 2020

sjeaugey commented Apr 14, 2020

cheyang commented Apr 15, 2020

sjeaugey commented Apr 15, 2020

2sin18 commented Apr 15, 2020 • edited Loading

sjeaugey commented Apr 15, 2020

sjeaugey commented Apr 15, 2020

WencongXiao commented Apr 16, 2020

sjeaugey commented Apr 16, 2020

2sin18 commented Apr 16, 2020

2sin18 commented Apr 17, 2020 • edited Loading

2sin18 commented Apr 17, 2020 • edited Loading

Step 1

Step 2

Step 3

Results

sjeaugey commented Apr 17, 2020

WencongXiao commented Apr 18, 2020

sjeaugey commented Apr 20, 2020

zrss commented May 18, 2020

zrss commented May 23, 2020 • edited Loading

Xiaoaier-Z-L commented Jun 9, 2023

freelizhun commented Aug 16, 2024

2sin18 commented Apr 15, 2020 •

edited

Loading

2sin18 commented Apr 17, 2020 •

edited

Loading

2sin18 commented Apr 17, 2020 •

edited

Loading

zrss commented May 23, 2020 •

edited

Loading