Skip to content

Test test_low_latency.py failed on H100 with ROCE #38

@ImbaPlayer

Description

@ImbaPlayer

Issue Description

I'm working on an H100 GPU cluster with RoCE drivers properly installed on the network interface cards.
While the test_intranode.py script runs successfully and produces expected results, the test_low_latency consistently fails with errors.
Technical Details:

  • NVSHMEM version installed: 3.1.7-1 (following the README instructions)
  • Suspected compatibility issue: Potential mismatch between NVSHMEM version and RoCE configuration

I would greatly appreciate any assistance or insights to resolve this. Below are the specific error messages for reference:

Actual Result

(base) root@ubuntu:/work/DeepEP-main# python tests/test_low_latency.py
local_rank:1, ip:127.0.0.1 port:3004
world_size:2, rank:1
local_rank:0, ip:127.0.0.1 port:3004
world_size:2, rank:0
setting....
setting....
setted...
rank:0, num_ranks:2
Allocating buffer size: 2116.292096 MB ...

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed
ibv_modify_qp failed


/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7 ep_connect failed
ep_connect failed


/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7 transport create connect failed
transport create connect failed


/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
connect EPS failed


/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 nvshmem setup connections failed
nvshmem setup connections failed


/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7 ep_connect failed

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7 transport create connect failed

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110
ibv_modify_qp failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7
nvshmem setup connections failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7
ep_connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074:
nvshmem initialization failed, exiting
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7
transport create connect failed

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 nvshmem setup connections failed

/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting

W0303 20:33:45.904000 12712 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 12777 via signal SIGTERM
Traceback (most recent call last):
  File "/work/DeepEP-main/tests/test_low_latency.py", line 164, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with exit code 255

Env

Ubuntu2204

Linux ubuntu 5.15.0-25-generic #25-Ubuntu SMP Wed Mar 30 15:54:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:18:00.0 Off |                    0 |
| N/A   22C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:2A:00.0 Off |                    0 |
| N/A   25C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off |   00000000:3A:00.0 Off |                    0 |
| N/A   24C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off |   00000000:5D:00.0 Off |                    0 |
| N/A   22C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off |   00000000:9A:00.0 Off |                    0 |
| N/A   24C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off |   00000000:AB:00.0 Off |                    0 |
| N/A   25C    P0             72W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off |   00000000:BA:00.0 Off |                    0 |
| N/A   23C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:DB:00.0 Off |                    0 |
| N/A   22C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

ibv_devinfo

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:0059:910c
        sys_image_guid:                 a088:c203:0059:910c
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:0050:a72c
        sys_image_guid:                 a088:c203:0050:a72c
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_2
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:007e:1dba
        sys_image_guid:                 a088:c203:007e:1dba
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_3
        transport:                      InfiniBand (0)
        fw_ver:                         16.35.4030
        node_guid:                      e8eb:d303:0055:750a
        sys_image_guid:                 e8eb:d303:0055:750a
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000425
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_4
        transport:                      InfiniBand (0)
        fw_ver:                         16.35.4030
        node_guid:                      e8eb:d303:0055:750b
        sys_image_guid:                 e8eb:d303:0055:750a
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000425
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_5
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:0060:27e6
        sys_image_guid:                 a088:c203:0060:27e6
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_6
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:007e:1c3a
        sys_image_guid:                 a088:c203:007e:1c3a
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_7
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:0060:2b1e
        sys_image_guid:                 a088:c203:0060:2b1e
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_8
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:007d:ab62
        sys_image_guid:                 a088:c203:007d:ab62
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_9
        transport:                      InfiniBand (0)
        fw_ver:                         28.42.1000
        node_guid:                      a088:c203:007d:ab9a
        sys_image_guid:                 a088:c203:007d:ab9a
        vendor_id:                      0x02c9
        vendor_part_id:                 4129
        hw_ver:                         0x0
        board_id:                       MT_0000000838
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions