-
Notifications
You must be signed in to change notification settings - Fork 826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why NVLINK doesn't work across the containers in the same node #324
Comments
In each container, I can get the result.
|
Which NCCL version are you using ? If running NCCL >= 2.5 can you set |
The NCCL VERSION is version 2.4.8+cuda10.1 |
Thanks. In NCCL 2.4.8, we were reading files in /proc/self/ns/ to determine whether we were on the same node or not (see getHostHash in src/misc/utils.cc). Depending on your node setup it might incorrectly consider the two containers as different nodes. In recent versions of NCCL, we now read /proc/sys/kernel/random/boot_id which is more reliable. So you might want to try a newer version of NCCL or change your namespace configuration. |
What's the suggested way to launch nvidia-docker with NCCL? Using latest NCCL, I found it will fail to use NCCL across 2 docker instances running on same node with different
But if I use I do not find notes about this in neither NCCL nor nvidia-docker documents. Is it by design but undocumented or an unexpected problem? I will create an issue ticket if it's a real problem. |
I guess when using |
It's actually a bit more complicated than that. We could, in theory, determine that our local GPU has NVLink connectivity to the other GPU, but we would not be able determine what NVLinks connections that other GPU has to other GPUs, which means we would not be able to draw the whole graph and look for rings properly. To solve that we'd need each rank to not only share with others its busId/cudaDev/... but pretty much all of its local information. Using CUDA_VISIBLE_DEVICES is also not ideal, since it would only work on NVLink platforms and would probably be detrimental to PCI systems. |
@sjeaugey Hi, the case mentioned by @2sin18 seems to happen only in the latest version nccl. My test with the version of 2.4.8 goes fine. The local 2 containers can communicate through socket. |
Hum, indeed, before 2.5, each GPU was exchanging its connectivity matrix with the others, so that case could work. The topology detection (including NVLink) has been completely rewritten in NCCL 2.5 and is now way more advanced, but indeed, it makes it harder to manage this precise case. To solve that in recent versions would require some work though and additional out-of-band communication between the ranks. |
I find the problem is an issue since NCCL 2.6. See #326 |
Step 1Clone nccl-tests modifed to be able to run without MPI and build binaries. Step 2Launch two docker containers using below commands: docker run \
--gpus "device=0" --ipc host --network host --pid host --uts host --ulimit memlock=-1:-1 \
--rm -it \
-e NCCL_DEBUG=INFO -e NCCL_DEBUG_SUBSYS=INIT,P2P,GRAPH \
-v /sys:/sys -v /path/to/my/workspace:/workspace -w /workspace \
my_docker_with_cuda_and_nccl_2507 bash and docker run \
--gpus "device=1" --ipc host --network host --pid host --uts host --ulimit memlock=-1:-1 \
--rm -it \
-e NCCL_DEBUG=INFO -e NCCL_DEBUG_SUBSYS=INIT,P2P,GRAPH \
-v /sys:/sys -v /path/to/my/workspace:/workspace -w /workspace \
my_docker_with_cuda_and_nccl_2507 bash Step 3Run below commands in corresponding containers (The container is based on nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04 and replaces the original NCCL by NCCL 2.5.7): NCCL_COMM_ID=myip:myport WORLD_SIZE=2 RANK=0 ./all_reduce_perf and NCCL_COMM_ID=myip:myport WORLD_SIZE=2 RANK=1 ./all_reduce_perf Results# nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 413820 on paidist001 device 0 [0x4f] Tesla V100-SXM2-16GB
paidist001:413820:413820 [0] NCCL INFO Bootstrap : Using [0]eth0:192.168.1.120<0>
paidist001:413820:413820 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
paidist001:413820:413820 [0] misc/ibvwrap.cc:212 NCCL WARN Call to ibv_open_device failed
paidist001:413820:413820 [0] transport/net_ib.cc:120 NCCL WARN NET/IB : Unable to open device mlx5_bond_0
paidist001:413820:413820 [0] NCCL INFO NET/IB : No device found.
paidist001:413820:413820 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.1.120<0>
NCCL version 2.5.7ali+cuda10.0
paidist001:413820:413826 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff
paidist001:413820:413826 [0] misc/nvmlwrap.cc:144 NCCL WARN nvmlDeviceGetHandleByPciBusId() failed: Not Found
paidist001:413820:413826 [0] NCCL INFO Intel CPU (PCI 12, InterCpu 8)
paidist001:413820:413826 [0] NCCL INFO /sys/devices/pci0000:17/0000:17:00.0/0000:18:00.0/0000:19:04.0/0000:1a:00.0/0000:1b:11.0/0000:2d:00.0/virtio1 -> 0/0/0/0
paidist001:413820:413826 [0] NCCL INFO === System : maxWidth 12 maxSpeed 12 ===
paidist001:413820:413826 [0] NCCL INFO CPU/0
paidist001:413820:413826 [0] NCCL INFO + PCI[12] - PCI/18000
paidist001:413820:413826 [0] NCCL INFO + PCI[12] - GPU/4F000 (0)
paidist001:413820:413826 [0] NCCL INFO + QPI[ 8] - CPU/FFFFFFFFFFFFFFFF
paidist001:413820:413826 [0] NCCL INFO CPU/FFFFFFFFFFFFFFFF
paidist001:413820:413826 [0] NCCL INFO + PCI[12] - PCI/17000
paidist001:413820:413826 [0] NCCL INFO + PCI[12] - PCI/19040
paidist001:413820:413826 [0] NCCL INFO + PCI[12] - PCI/1B110
paidist001:413820:413826 [0] NCCL INFO + PCI[12] - NIC/0
paidist001:413820:413826 [0] NCCL INFO + NET[12] - NET/0 (0)
paidist001:413820:413826 [0] NCCL INFO + QPI[ 8] - CPU/0
paidist001:413820:413826 [0] NCCL INFO ==========================================
paidist001:413820:413826 [0] NCCL INFO GPU/4F000 :GPU/4F000 (0/5000/0) CPU/0 (2/12/2) CPU/FFFFFFFFFFFFFFFF (3/8/3) NET/0 (8/8/3)
paidist001:413820:413826 [0] NCCL INFO NET/0 :GPU/4F000 (8/8/3) CPU/0 (6/8/3) CPU/FFFFFFFFFFFFFFFF (5/12/2) NET/0 (0/5000/0)
paidist001:413820:413826 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 1, speed 12/6, nvlink 1, type 3, sameChannels 1
paidist001:413820:413826 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0
paidist001:413820:413826 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 1, speed 6/6, nvlink 1, type 3, sameChannels 1
paidist001:413820:413826 [0] NCCL INFO 0 : NET/0 GPU/0 NET/0
paidist001:413820:413826 [0] NCCL INFO NCCL_MAX_NRINGS set by environment to 12.
paidist001:413820:413826 [0] NCCL INFO NCCL_MIN_NRINGS set by environment to 4.
paidist001:413820:413826 [0] NCCL INFO Channel 00/04 : 0 1
paidist001:413820:413826 [0] NCCL INFO Channel 01/04 : 0 1
paidist001:413820:413826 [0] NCCL INFO Channel 02/04 : 0 1
paidist001:413820:413826 [0] NCCL INFO Channel 03/04 : 0 1
paidist001:413820:413826 [0] NCCL INFO Threads per block : 512/640/256
paidist001:413820:413826 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64
paidist001:413820:413826 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] -1/-1/-1->0->1|1->0->-1/-1/-1 [2] 1/-1/-1->0->-1|-1->0->1/-1/-1 [3] -1/-1/-1->0->1|1->0->-1/-1/-1
paidist001:413820:413826 [0] NCCL INFO Ring 00 : 0[4f000] -> 1[50000] via direct shared memory
paidist001:413820:413826 [0] NCCL INFO Ring 01 : 0[4f000] -> 1[50000] via direct shared memory
paidist001:413820:413826 [0] NCCL INFO Ring 02 : 0[4f000] -> 1[50000] via direct shared memory
paidist001:413820:413826 [0] NCCL INFO Ring 03 : 0[4f000] -> 1[50000] via direct shared memory
paidist001:413820:413826 [0] NCCL INFO comm 0x7fd2d8001990 rank 0 nranks 2 cudaDev 0 busId 4f000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
paidist001:413820:413820 [0] NCCL INFO Launch mode Parallel
33554432 8388608 float sum 8783.6 3.82 3.82 0e+00 8779.3 3.82 3.82 0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 3.82105
# |
@2sin18 I think there are two different use cases.
@WencongXiao I'm not sure how to understand your comment above about 2.4.8 working better than 2.5 or 2.6 ... it should not be the case, except for the crash on 2.6.4 (which has a fix now). Please let me know if I'm missing a use case. |
My previous comment just want to share the information that 2.4.8 won't fail in case 2. Thanks @sjeaugey for your help to quickly identify the bug. Do you think it is possible to support the NVLINK communication in the mode of case 2 for multiple containers in the future? And one more step, in non-host mode containers? |
@WencongXiao That would need to come from CUDA, since even when I try to force CUDA IPCs between containers, we get an error when opening the IPC handle. |
/cc |
by the way, currently, i set the the same |
when i do it with two nodes, it err: #879 |
Hi, @cheyang. Have you solved this problem? |
1. Issue or feature description
I run two containers which shares ipc, net,uts, pid namespace, but they are still connect through the net instead P2P. But if I run in the same container, they are able to use P2P
2. Steps to reproduce the issue
And the start command in two containers are
The text was updated successfully, but these errors were encountered: