Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl connection abort(between kubernetes pods): WARN NET/IB: read failed in ncclIbRoceGetVersionNum: Invalid argument #1573

Closed
limu713 opened this issue Jan 10, 2025 · 9 comments

Comments

@limu713
Copy link

limu713 commented Jan 10, 2025

We use nccl in kubernetes + rdma-device-plugin.
Pods communicate by macvlan sub interface of roce hca. Different pod has different gid index. When run miprun between two pods, connection aborts. We trace nccl code and find that nccl tries to read file /sys/class/infiniband/$device/ports/$port_num/gid_attrs/types/$index which does not exist. Actually, its relative gid is 0000:0000:0000:0000:0000:0000:0000:0000 (cat /sys/class/infiniband/$device/ports/$port_num/gids/$index ).

Here are show_gids results in one pod. Each device has existing gid of index 4,5,6,7.

root@test-macvlan-pod-2:/# show_gids
DEV PORT INDEX GID IPv4 VER DEV


mlx5_0 1 4 fe80:0000:0000:0000:a4ed:ccff:fe8b:1994 v1 net1
mlx5_0 1 5 fe80:0000:0000:0000:a4ed:ccff:fe8b:1994 v2 net1
mlx5_0 1 6 0000:0000:0000:0000:0000:ffff:0a98:000c 10.152.0.12 v1 net1
mlx5_0 1 7 0000:0000:0000:0000:0000:ffff:0a98:000c 10.152.0.12 v2 net1
mlx5_1 1 4 fe80:0000:0000:0000:9459:9bff:fe54:7704 v1 net2
mlx5_1 1 5 fe80:0000:0000:0000:9459:9bff:fe54:7704 v2 net2
mlx5_1 1 6 0000:0000:0000:0000:0000:ffff:0a98:040c 10.152.4.12 v1 net2
mlx5_1 1 7 0000:0000:0000:0000:0000:ffff:0a98:040c 10.152.4.12 v2 net2
mlx5_2 1 4 fe80:0000:0000:0000:90ea:b5ff:fec5:3f24 v1 net3
mlx5_2 1 5 fe80:0000:0000:0000:90ea:b5ff:fec5:3f24 v2 net3
mlx5_2 1 6 0000:0000:0000:0000:0000:ffff:0a98:080c 10.152.8.12 v1 net3
mlx5_2 1 7 0000:0000:0000:0000:0000:ffff:0a98:080c 10.152.8.12 v2 net3
mlx5_3 1 4 fe80:0000:0000:0000:44aa:80ff:fea7:0c99 v1 net4
mlx5_3 1 5 fe80:0000:0000:0000:44aa:80ff:fea7:0c99 v2 net4
mlx5_3 1 6 0000:0000:0000:0000:0000:ffff:0a98:0c0c 10.152.12.12 v1 net4
mlx5_3 1 7 0000:0000:0000:0000:0000:ffff:0a98:0c0c 10.152.12.12 v2 net4
mlx5_4 1 4 fe80:0000:0000:0000:68be:c8ff:feaa:39b3 v1 net5
mlx5_4 1 5 fe80:0000:0000:0000:68be:c8ff:feaa:39b3 v2 net5
mlx5_4 1 6 0000:0000:0000:0000:0000:ffff:0a98:100c 10.152.16.12 v1 net5
mlx5_4 1 7 0000:0000:0000:0000:0000:ffff:0a98:100c 10.152.16.12 v2 net5
mlx5_5 1 4 fe80:0000:0000:0000:b82d:d1ff:fef4:35fe v1 net6
mlx5_5 1 5 fe80:0000:0000:0000:b82d:d1ff:fef4:35fe v2 net6
mlx5_5 1 6 0000:0000:0000:0000:0000:ffff:0a98:140c 10.152.20.12 v1 net6
mlx5_5 1 7 0000:0000:0000:0000:0000:ffff:0a98:140c 10.152.20.12 v2 net6
mlx5_6 1 4 fe80:0000:0000:0000:4802:daff:fedf:6783 v1 net7
mlx5_6 1 5 fe80:0000:0000:0000:4802:daff:fedf:6783 v2 net7
mlx5_6 1 6 0000:0000:0000:0000:0000:ffff:0a98:180c 10.152.24.12 v1 net7
mlx5_6 1 7 0000:0000:0000:0000:0000:ffff:0a98:180c 10.152.24.12 v2 net7
mlx5_7 1 4 fe80:0000:0000:0000:5034:7aff:fea5:3dea v1 net8
mlx5_7 1 5 fe80:0000:0000:0000:5034:7aff:fea5:3dea v2 net8
mlx5_7 1 6 0000:0000:0000:0000:0000:ffff:0a98:1c0c 10.152.28.12 v1 net8
mlx5_7 1 7 0000:0000:0000:0000:0000:ffff:0a98:1c0c 10.152.28.12 v2 net8
n_gids_found=32

Gid of Other index is 0000:0000:0000:0000:0000:0000:0000:0000. For example device mlx5_2
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/0
0000:0000:0000:0000:0000:0000:0000:0000
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/1
0000:0000:0000:0000:0000:0000:0000:0000
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/2
0000:0000:0000:0000:0000:0000:0000:0000
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/3
0000:0000:0000:0000:0000:0000:0000:0000
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/4
fe80:0000:0000:0000:90ea:b5ff:fec5:3f24
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/5
fe80:0000:0000:0000:90ea:b5ff:fec5:3f24
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/6
0000:0000:0000:0000:0000:ffff:0a98:080c
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/7
0000:0000:0000:0000:0000:ffff:0a98:080c
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gids/8
0000:0000:0000:0000:0000:0000:0000:0000

If gid is 0000:0000:0000:0000:0000:0000:0000:0000, then it's gid_attrs file can not read and returns 'Invalid argument'.
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/0
cat: /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/0: Invalid argument
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/1
cat: /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/1: Invalid argument
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/2
cat: /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/2: Invalid argument
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/3
cat: /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/3: Invalid argument
root@test-macvlan-pod-2:/# cat /sys/class/infiniband/mlx5_2/ports/1/gid_attrs/types/4
IB/RoCE v1

mpirun logs mpirun_logs.txt

@limu713
Copy link
Author

limu713 commented Jan 10, 2025

Nccl get gid by ibv_query_gid and also check where gid is valid (not all zero and link local gid ) before ncclIbRoceGetVersionNum. If gid is invalid, then ncclIbRoceGetVersionNum will not execute which is expected. But ibv_query_gid returns the real gid which is valid. This gid should not be seen in pod, because it doesn't belong to pod's namespace.
Two solutions:

  1. Get gid from sysfs which is expected.
  2. in function ncclIbRoceGetVersionNum, return ncclSuccess if get 'Invalid argument' errno.

The second solution is easier and we have tested
dc445aa#diff-9d8bca27b23828e4b2e27640179cf6dea6e774a21317f5b434aaec3a67d8983eR255

@gcongiu
Copy link
Contributor

gcongiu commented Jan 10, 2025

Looks like the same issue as #1538 (comment). We might indeed need to avoid using ibv_query_gid in containers.

@limu713
Copy link
Author

limu713 commented Jan 13, 2025

Looks like the same issue as #1538 (comment). We might indeed need to avoid using ibv_query_gid in containers.

Yes. or use __ibv_query_gid_ex.
But just return ncclSuccess when get 'Invalid argument' errno is more simple than using '__ibv_query_gid_ex'.
Ofed has similar solution below.

Image

@gcongiu
Copy link
Contributor

gcongiu commented Jan 14, 2025

Thank you @limu713. The following patch adds GID index initialization to your fix and (if working) will be included in the 2.26 release. Could you verify this still works?
containers-gid.patch
The patch is based on the latest GitHub master

@limu713
Copy link
Author

limu713 commented Jan 15, 2025

Thank you @limu713. The following patch adds GID index initialization to your fix and (if working) will be included in the 2.26 release. Could you verify this still works? containers-gid.patch The patch is based on the latest GitHub master

Hi, I just test this patch with master branch, it works well. Result is below.

Image

Image

@gcongiu
Copy link
Contributor

gcongiu commented Jan 16, 2025

Great! Thank you @limu713

@jlamanna
Copy link

I'm not a huge fan of returning success from a function that returns EINVAL. While this may work in the context of the detection loop, it doesn't make sense to call this function on a random GID and have it return success and a RoCE version of 0.

Instead, it makes more sense to call validGid() first before calling RoCE version functions if using NCCLCHECK()

@gcongiu
Copy link
Contributor

gcongiu commented Jan 29, 2025

@jlamanna validGid() just checks that the GID format is valid. The GID itself is obtained using ibv_query_gid(), which uses the host GID table, not the GID table visible to the container Pod. I agree returning success when there is an EINVAL error is not the most elegant solution but it works for now. Ideally, ibv_query_gid() should be replaced with a function that queries the GID from the container Pod GID table rather than the host system GID table.

kiskra-nvidia added a commit that referenced this issue Mar 13, 2025
Profiler improvements
 * Add events for CUDA kernel start and end.
 * Allow network plugins to generate profiling events
 * Enable profiling on a per-operation basis, rather than per-communicator.
 * Add support for graph capturing.

Add implicit launch order
 * Allow to prevent deadlocks when using multiple NCCL communicators per
   device by implicitly ordering NCCL operations using the host program
   order. Disabled by default, set NCCL_LAUNCH_ORDER_IMPLICIT=1 to enable.
 * Add a complementary mechanism to detect host threads racing to launch
   to the same device. Enabled by default, set NCCL_LAUNCH_RACE_FATAL=0 to
   disable.

Optimize the PAT algorithm
 * Separate the computation and execution of PAT steps on different warps,
   allowing to run up to 16 PAT steps in parallel to significantly
   accelerate PAT and reduce its linear part.

Add support for setting QoS per communicator
 * Add a new trafficClass field to the communicator configuration, to
   allow the application to select a particular traffic class for a
   given communicator. The meaning of the traffic class is
   network-specific and should be set in accordance with the network
   configuration.
 * For the IB/RoCE plugin, existing config variables such as NCCL_IB_SL
   and NCCL_IB_TC take precedence.

Allow to enable GPU Direct RDMA specifically on C2C platforms
 * Disabled by default, set NCCL_NET_GDR_C2C=1 to enable.

Do not disable user buffer registration unless PXN is really used
 * Only disable UB when a communicator has more than one rank per
   node on any node.

RAS subsystem improvements
 * Report operation counts separately for each collective operation type.
 * Provide details about missing communicator ranks and reliably
   distinguish ranks that are no longer a given communicator's members
   (now reported as NOCOMM) from those that failed to respond.

Add support for timestamps to NCCL diagnostic messages
 * On by default for WARN messages; NCCL_DEBUG_TIMESTAMP_LEVELS can be
   used to enable them for other debug levels as well.
 * The format can be changed using the NCCL_DEBUG_TIMESTAMP_FORMAT config
   variable.

Reduce the memory usage with NVLink SHARP (NVLS)
 * Potentially save hundreds of MBs of device memory, considering the
   multicast buffer size granularity separately from the address alignment.

Update performance tuning for recent Intel CPUs
 * Improve algorithm/protocol selection on recent CPUs such as Emerald
   Rapids and Sapphire Rapids.

Improve channel scheduling when mixing LL and Simple operations.
 * Make LL operations account for 4x more traffic to ensure LL and simple
   operations complete at the same time.

Refactor the plugin code
 * Clean up and harmonize the support code across the network, tuner,
   and profiler plugins.

Add support for comment lines (starting with #) in the nccl.conf file
* Issue #1540.

Make user buffer registration problems print an INFO instead of a WARN.

Drop support for network plugin interface version 5.

Fix a race condition with split-shared communicators
 * NCCL could hang during connection setup if multiple communicators
   were grouped together that share resources.

Fix a performance regression when using NCCL_CROSS_NIC=1
 * NCCL would unnecessarily alternate rings, breaking the GPU-NIC
   associations.

Make GID index detection code more resilient
 * Dynamic GID detection code was giving up too soon if the
   detected index was not available (e.g., wasn't mapped to the
   container's sysfs).
 * Issues #1538, #1573.

Fix a race condition with non-blocking operation
 * Fix issue when creating a non-blocking communicator after a non-
   blocking collective operation on another communicator.

Fix shared memory usage on recent Blackwell GPUs.
 * Issues NVIDIA/nccl-tests#287, NVIDIA/nccl-tests#291, #1637.

Fix an error with NIC fusion and IB SHARP when recreating communicators
 * Disable the unloading of network plugins

Make the auto-merge failures in the NIC fusion non-fatal
 * This could happen when trying to merge IB and RoCE devices.

Fixes to ncclCommAbort
 * Fix hangs due to the progress thread spinning indefinitely on the
   network progress.
 * Reduce the abort time by up to two orders of magnitude.

Fix a crash when libnccl.so was dynamically unloaded
 * The RAS subsystem was missing a clean-up handler.

Fix a hang if the network plugin's test() call returns an error.

Fix a hang on heterogeneous architectures
 * Ensure we harmonize the tuning to avoid different tuning choices,
   causing a hang.

Fix double-free on failed ncclCommInitRank and ncclCommFinalize.

Fix a potential list traversal bug during a group launch of multiple
communicators
 * Issue #1599.

Unify the handling of NCCL configuration variables
 * Under rare circumstances, some variables specified in the config file
   could be ignored.
@kiskra-nvidia
Copy link
Member

NCCL 2.26.2, which includes the fix for this bug, has been released. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants