Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why NVLINK doesn't work across the containers in the same node #324

Open
cheyang opened this issue Apr 14, 2020 · 19 comments
Open

Why NVLINK doesn't work across the containers in the same node #324

cheyang opened this issue Apr 14, 2020 · 19 comments

Comments

@cheyang
Copy link

cheyang commented Apr 14, 2020

1. Issue or feature description

I run two containers which shares ipc, net,uts, pid namespace, but they are still connect through the net instead P2P. But if I run in the same container, they are able to use P2P

2. Steps to reproduce the issue

  1. The docker run command
docker run --env NVIDIA_VISIBLE_DEVICES=0,1 \
-tid \
-v /dev:/dev  \
--name=pytorch_test_0 \
--net=host \
--ipc=host \
--uts=host \
--pid=host \
-v /root:/root \
pytorch/pytorch:1.3-cuda10.1-cudnn7-runtime bash

docker run --env NVIDIA_VISIBLE_DEVICES=0,1 \
-v /dev:/dev  \
-tid --name=pytorch_test_1 \
--net=host \
--ipc=host \
--uts=host \
--pid=host \
-v /root:/root \
pytorch/pytorch:1.3-cuda10.1-cudnn7-runtime \
bash
  1. and I run dist_cifar10.py in the both containers,
torch.cuda.set_device(gpu_id)
torch.distributed.init_process_group(backend='nccl')

And the start command in two containers are

DATA_DIR=/root/pytorch_test/cifar10 VISIBLE_DEVICE_LIST=0 NCCL_DEBUG=INFO RANK=0 WORLD_SIZE=2 MASTER_ADDR=127.0.0.1 MASTER_PORT=30000 python example.py 2>&1 | tee /root/nccl_1.log

DATA_DIR=/root/pytorch_test/cifar10 VISIBLE_DEVICE_LIST=1 NCCL_DEBUG=INFO RANK=1 WORLD_SIZE=2 MASTER_ADDR=127.0.0.1 MASTER_PORT=30000 python example.py 2>&1 | tee /root/nccl_2.log
  1. From the logs,
iZuf6a4vs2nipcmiuh1pjwZ:45679:45951 [1] NCCL INFO Ring 01 : 1 -> 0 [send] via NET/Socket/1
iZuf6a4vs2nipcmiuh1pjwZ:45679:45951 [1] NCCL INFO Ring 02 : 0 -> 1 [receive] via NET/Socket/0
iZuf6a4vs2nipcmiuh1pjwZ:45679:45951 [1] NCCL INFO NET/Socket: Using 1 threads and 1 sockets per thread
iZuf6a4vs2nipcmiuh1pjwZ:45679:45951 [1] NCCL INFO Ring 02 : 1 -> 0 [send] via NET/Socket/0
  1. But If I run the command in step2 in the same container, I'm able to see the P2P enabled.
iZuf6a4vs2nipcmiuh1pjwZ:73287:73984 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
iZuf6a4vs2nipcmiuh1pjwZ:73287:73984 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
@cheyang
Copy link
Author

cheyang commented Apr 14, 2020

In each container, I can get the result.

nvidia-smi topo -m
	GPU0	GPU1	CPU Affinity
GPU0	 X 	NV1	0-81
GPU1	NV1	 X 	0-81

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

@sjeaugey
Copy link
Member

Which NCCL version are you using ? If running NCCL >= 2.5 can you set NCCL_DEBUG_SUBSYS=GRAPH NCCL_DEBUG=INFO and post the log ?

@cheyang
Copy link
Author

cheyang commented Apr 15, 2020

The NCCL VERSION is version 2.4.8+cuda10.1

@sjeaugey
Copy link
Member

Thanks. In NCCL 2.4.8, we were reading files in /proc/self/ns/ to determine whether we were on the same node or not (see getHostHash in src/misc/utils.cc). Depending on your node setup it might incorrectly consider the two containers as different nodes.

In recent versions of NCCL, we now read /proc/sys/kernel/random/boot_id which is more reliable. So you might want to try a newer version of NCCL or change your namespace configuration.

@2sin18
Copy link

2sin18 commented Apr 15, 2020

What's the suggested way to launch nvidia-docker with NCCL?

Using latest NCCL, I found it will fail to use NCCL across 2 docker instances running on same node with different --gpus argument, and below errors reported:

misc/nvmlwrap.cc:144 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found 

But if I use --gpus all -e CUDA_VISIBLE_DEVICES= to launch nvidia-docker, it works.

I do not find notes about this in neither NCCL nor nvidia-docker documents. Is it by design but undocumented or an unexpected problem? I will create an issue ticket if it's a real problem.

@sjeaugey
Copy link
Member

I guess when using --gpus you're effectively hiding the other GPU completely, so the other GPU can no longer find it through NVML and determine it has an NVLink connection to it.

@sjeaugey
Copy link
Member

It's actually a bit more complicated than that. We could, in theory, determine that our local GPU has NVLink connectivity to the other GPU, but we would not be able determine what NVLinks connections that other GPU has to other GPUs, which means we would not be able to draw the whole graph and look for rings properly. To solve that we'd need each rank to not only share with others its busId/cudaDev/... but pretty much all of its local information.

Using CUDA_VISIBLE_DEVICES is also not ideal, since it would only work on NVLink platforms and would probably be detrimental to PCI systems.

@WencongXiao
Copy link

@sjeaugey Hi, the case mentioned by @2sin18 seems to happen only in the latest version nccl. My test with the version of 2.4.8 goes fine. The local 2 containers can communicate through socket.
Do you have any idea when this feature was introduced? Only in the latest version, or maybe also in the previous release?

@sjeaugey
Copy link
Member

Hum, indeed, before 2.5, each GPU was exchanging its connectivity matrix with the others, so that case could work.

The topology detection (including NVLink) has been completely rewritten in NCCL 2.5 and is now way more advanced, but indeed, it makes it harder to manage this precise case. To solve that in recent versions would require some work though and additional out-of-band communication between the ranks.

@2sin18
Copy link

2sin18 commented Apr 16, 2020

Hum, indeed, before 2.5, each GPU was exchanging its connectivity matrix with the others, so that case could work.

The topology detection (including NVLink) has been completely rewritten in NCCL 2.5 and is now way more advanced, but indeed, it makes it harder to manage this precise case. To solve that in recent versions would require some work though and additional out-of-band communication between the ranks.

I find the problem is an issue since NCCL 2.6. See #326

@2sin18
Copy link

2sin18 commented Apr 17, 2020

Which NCCL version are you using ? If running NCCL >= 2.5 can you set NCCL_DEBUG_SUBSYS=GRAPH NCCL_DEBUG=INFO and post the log ?

@cheyang I can repro this issue using the NCCL 2.5. I can give my repro steps later. NCCL 2.6 and later it will crash (#326) for now.

@2sin18
Copy link

2sin18 commented Apr 17, 2020

Step 1

Clone nccl-tests modifed to be able to run without MPI and build binaries.

Step 2

Launch two docker containers using below commands:

docker run \
--gpus "device=0" --ipc host --network host --pid host --uts host --ulimit memlock=-1:-1 \
--rm -it \
-e NCCL_DEBUG=INFO -e NCCL_DEBUG_SUBSYS=INIT,P2P,GRAPH \
-v /sys:/sys -v /path/to/my/workspace:/workspace -w /workspace \
my_docker_with_cuda_and_nccl_2507 bash

and

docker run \
--gpus "device=1" --ipc host --network host --pid host --uts host --ulimit memlock=-1:-1 \
--rm -it \
-e NCCL_DEBUG=INFO -e NCCL_DEBUG_SUBSYS=INIT,P2P,GRAPH \
-v /sys:/sys -v /path/to/my/workspace:/workspace -w /workspace \
my_docker_with_cuda_and_nccl_2507 bash

Step 3

Run below commands in corresponding containers (The container is based on nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04 and replaces the original NCCL by NCCL 2.5.7):

NCCL_COMM_ID=myip:myport WORLD_SIZE=2 RANK=0 ./all_reduce_perf

and

NCCL_COMM_ID=myip:myport WORLD_SIZE=2 RANK=1 ./all_reduce_perf

Results

# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid 413820 on paidist001 device  0 [0x4f] Tesla V100-SXM2-16GB
paidist001:413820:413820 [0] NCCL INFO Bootstrap : Using [0]eth0:192.168.1.120<0>
paidist001:413820:413820 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

paidist001:413820:413820 [0] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed

paidist001:413820:413820 [0] transport/net_ib.cc:120 NCCL WARN NET/IB : Unable to open device mlx5_bond_0
paidist001:413820:413820 [0] NCCL INFO NET/IB : No device found.
paidist001:413820:413820 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.1.120<0>
NCCL version 2.5.7ali+cuda10.0
paidist001:413820:413826 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff

paidist001:413820:413826 [0] misc/nvmlwrap.cc:144 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found 
paidist001:413820:413826 [0] NCCL INFO Intel CPU (PCI 12, InterCpu 8)
paidist001:413820:413826 [0] NCCL INFO /sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0/0000:19:04.0/0000:1a:00.0/0000:1b:11.0/0000:2d:00.0/virtio1 -> 0/0/0/0
paidist001:413820:413826 [0] NCCL INFO === System : maxWidth 12 maxSpeed 12 ===
paidist001:413820:413826 [0] NCCL INFO CPU/0
paidist001:413820:413826 [0] NCCL INFO + PCI[12] - PCI/18000
paidist001:413820:413826 [0] NCCL INFO             + PCI[12] - GPU/4F000 (0)
paidist001:413820:413826 [0] NCCL INFO + QPI[ 8] - CPU/FFFFFFFFFFFFFFFF
paidist001:413820:413826 [0] NCCL INFO CPU/FFFFFFFFFFFFFFFF
paidist001:413820:413826 [0] NCCL INFO + PCI[12] - PCI/17000
paidist001:413820:413826 [0] NCCL INFO             + PCI[12] - PCI/19040
paidist001:413820:413826 [0] NCCL INFO                         + PCI[12] - PCI/1B110
paidist001:413820:413826 [0] NCCL INFO                                     + PCI[12] - NIC/0
paidist001:413820:413826 [0] NCCL INFO                                                 + NET[12] - NET/0 (0)
paidist001:413820:413826 [0] NCCL INFO + QPI[ 8] - CPU/0
paidist001:413820:413826 [0] NCCL INFO ==========================================
paidist001:413820:413826 [0] NCCL INFO GPU/4F000 :GPU/4F000 (0/5000/0) CPU/0 (2/12/2) CPU/FFFFFFFFFFFFFFFF (3/8/3) NET/0 (8/8/3) 
paidist001:413820:413826 [0] NCCL INFO NET/0 :GPU/4F000 (8/8/3) CPU/0 (6/8/3) CPU/FFFFFFFFFFFFFFFF (5/12/2) NET/0 (0/5000/0) 
paidist001:413820:413826 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 12/6, nvlink 1, type 3, sameChannels 1
paidist001:413820:413826 [0] NCCL INFO  0 : NET/0 GPU/0 NET/0
paidist001:413820:413826 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 6/6, nvlink 1, type 3, sameChannels 1
paidist001:413820:413826 [0] NCCL INFO  0 : NET/0 GPU/0 NET/0
paidist001:413820:413826 [0] NCCL INFO NCCL_MAX_NRINGS set by environment to 12.
paidist001:413820:413826 [0] NCCL INFO NCCL_MIN_NRINGS set by environment to 4.
paidist001:413820:413826 [0] NCCL INFO Channel 00/04 :    0   1
paidist001:413820:413826 [0] NCCL INFO Channel 01/04 :    0   1
paidist001:413820:413826 [0] NCCL INFO Channel 02/04 :    0   1
paidist001:413820:413826 [0] NCCL INFO Channel 03/04 :    0   1
paidist001:413820:413826 [0] NCCL INFO Threads per block : 512/640/256
paidist001:413820:413826 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64
paidist001:413820:413826 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] -1/-1/-1->0->1|1->0->-1/-1/-1 [2] 1/-1/-1->0->-1|-1->0->1/-1/-1 [3] -1/-1/-1->0->1|1->0->-1/-1/-1
paidist001:413820:413826 [0] NCCL INFO Ring 00 : 0[4f000] -> 1[50000] via direct shared memory
paidist001:413820:413826 [0] NCCL INFO Ring 01 : 0[4f000] -> 1[50000] via direct shared memory
paidist001:413820:413826 [0] NCCL INFO Ring 02 : 0[4f000] -> 1[50000] via direct shared memory
paidist001:413820:413826 [0] NCCL INFO Ring 03 : 0[4f000] -> 1[50000] via direct shared memory
paidist001:413820:413826 [0] NCCL INFO comm 0x7fd2d8001990 rank 0 nranks 2 cudaDev 0 busId 4f000 - Init COMPLETE
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
paidist001:413820:413820 [0] NCCL INFO Launch mode Parallel
    33554432       8388608   float     sum   8783.6    3.82    3.82  0e+00   8779.3    3.82    3.82  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.82105 
#

@sjeaugey
Copy link
Member

@2sin18 I think there are two different use cases.

  1. All GPUs are visible in each container, i.e. docker run --gpus=all
    I believe this is what the original bug report is about (@cheyang can confirm). I could verify this works, at least with NCCL 2.6, with full performance.

  2. One GPU is visible in each container, i.e. docker run --gpus="device=N"
    This is your example, which happens to crash with 2.6.4 (Topology detection crashes for all_reduce_perf across the docker containers in same node. #326) unless you apply the fix.
    In this situation, we cannot use CUDA IPCs since we don't see the other GPU, and therefore we cannot use NVLink. I tried injecting the full topology to force NCCL to use NVLink but the CUDA IPC mapping failed.

@WencongXiao I'm not sure how to understand your comment above about 2.4.8 working better than 2.5 or 2.6 ... it should not be the case, except for the crash on 2.6.4 (which has a fix now). Please let me know if I'm missing a use case.

@WencongXiao
Copy link

My previous comment just want to share the information that 2.4.8 won't fail in case 2. Thanks @sjeaugey for your help to quickly identify the bug.

Do you think it is possible to support the NVLINK communication in the mode of case 2 for multiple containers in the future? And one more step, in non-host mode containers?

@sjeaugey
Copy link
Member

@WencongXiao That would need to come from CUDA, since even when I try to force CUDA IPCs between containers, we get an error when opening the IPC handle.

@zrss
Copy link

zrss commented May 18, 2020

/cc

@zrss
Copy link

zrss commented May 23, 2020

by the way, currently, i set the the same NCCL_HOSTID and mount the same dir /dev/shm for containers in one node, can make NCCL (built from src of master branch) comm via SHM cross containers (without using --net=host)

@Xiaoaier-Z-L
Copy link

by the way, currently, i set the the same NCCL_HOSTID and mount the same dir /dev/shm for containers in one node, can make NCCL (built from src of master branch) comm via SHM cross containers (without using --net=host)

when i do it with two nodes, it err: #879

@freelizhun
Copy link

Hi, @cheyang. Have you solved this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants