Skip to content

Running low_latency test on RoCE get IBGDA error #139

@TheRainstorm

Description

@TheRainstorm

I can run test_internode.py and test_intranode.py correctly, but I cannot run the test_low_latency .py script. The error reported is mainly related to IBGDA(more complete output is at the end):

/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA

I found related issues 38, but it doesn't seem relevant to my error.

How should I run low_latency on RoCE? Are any extra settings required? Any help you can provide would be greatly appreciated.

Environment

  • Two nodes, each with 8 H20 GPUs, connected by CX6 network (200Gb, ~25GB/s) cards using RoCE mode.
  • Using the latest DeepEP code.
  • nvshmem: 3.2.5-1
  • gdrcopy: 2.4.4

I have followed the NVSHMEM install guide, manually compiled and loaded the gdrdrv module, set the nvidia driver IBGDA-related options in /etc/modprobe.d (Even though I use RoCE not IB), and my command to compile nvshmem is as follows:

CUDA_HOME=/opt/lib/cuda-12.4.1_normal/ \
GDRCOPY_HOME=/repo/gdrcopy-2.4.4 \
NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_MPI_SUPPORT=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
NVSHMEM_USE_GDRCOPY=1 \
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/repo/nvshmem_src/install

ibv_devinfo output

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         20.35.2000
        node_guid:                      946d:ae03:009c:f3cc
        sys_image_guid:                 946d:ae03:009c:f3cc
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000223
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         20.35.2000
        node_guid:                      946d:ae03:009c:f454
        sys_image_guid:                 946d:ae03:009c:f454
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000223
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

...
(omit mlx5_2 - mlx5_9)

Log

$ MASTER_ADDR=xxx MASTER_PORT=8362 WORLD_SIZE=2 RANK=0 python tests/test_low_latency.py

Allocating buffer size: 2116.290944 MB ...
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.

WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.

WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_0. Skipping...

...


WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_3. Skipping...

WARN: GPU cannot map UAR of device mlx5_8. Skipping...

WARN: GPU cannot map UAR of device mlx5_1. Skipping...

WARN: GPU cannot map UAR of device mlx5_7. Skipping...

WARN: GPU cannot map UAR of device mlx5_6. Skipping...

WARN: GPU cannot map UAR of device mlx5_bond_0. Skipping...

/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
WARN: GPU cannot map UAR of device mlx5_bond_0. Skipping...

/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.



WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.


...

/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/repo/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_net_recv:99: Message truncated : received 16 bytes instead of 8

/repo/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 /repo/nvshmem_src/src/host/mem/mem_heap.cpp:222: non-zero status: -3 allgather of heap base for all PE failed

/repo/nvshmem_src/src/host/mem/mem_heap.cpp:588: non-zero status: 7 Failed to allgather PEs peer_base values

/repo/nvshmem_src/src/host/init/init.cu:1011: non-zero status: 7 nvshmem register static heaps failed

/repo/nvshmem_src/src/host/team/team.cu:nvshmem_team_split_strided:63: NVSHMEM API called before NVSHMEM initialization has completed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions