-
Notifications
You must be signed in to change notification settings - Fork 965
Open
Description
Issue Description
I'm working on an H100 GPU cluster with RoCE drivers properly installed on the network interface cards.
While the test_intranode.py script runs successfully and produces expected results, the test_low_latency consistently fails with errors.
Technical Details:
- NVSHMEM version installed: 3.1.7-1 (following the README instructions)
- Suspected compatibility issue: Potential mismatch between NVSHMEM version and RoCE configuration
I would greatly appreciate any assistance or insights to resolve this. Below are the specific error messages for reference:
Actual Result
(base) root@ubuntu:/work/DeepEP-main# python tests/test_low_latency.py
local_rank:1, ip:127.0.0.1 port:3004
world_size:2, rank:1
local_rank:0, ip:127.0.0.1 port:3004
world_size:2, rank:0
setting....
setting....
setted...
rank:0, num_ranks:2
Allocating buffer size: 2116.292096 MB ...
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed
ibv_modify_qp failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7 ep_connect failed
ep_connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7 transport create connect failed
transport create connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
connect EPS failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 /work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 nvshmem setup connections failed
nvshmem setup connections failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110 ibv_modify_qp failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7 ep_connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7 transport create connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:417: non-zero status: 110
ibv_modify_qp failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7
nvshmem setup connections failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1432: non-zero status: 7
ep_connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074:
nvshmem initialization failed, exiting
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1499: non-zero status: 7
transport create connect failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/transport/transport.cpp:400: non-zero status: 7 connect EPS failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:1001: non-zero status: 7 nvshmem setup connections failed
/work/nvshmem_src_3.1.7-1/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting
W0303 20:33:45.904000 12712 site-packages/torch/multiprocessing/spawn.py:169] Terminating process 12777 via signal SIGTERM
Traceback (most recent call last):
File "/work/DeepEP-main/tests/test_low_latency.py", line 164, in <module>
torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 204, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with exit code 255
Env
Ubuntu2204
Linux ubuntu 5.15.0-25-generic #25-Ubuntu SMP Wed Mar 30 15:54:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linuxnvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 Off | 00000000:18:00.0 Off | 0 |
| N/A 22C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 Off | 00000000:2A:00.0 Off | 0 |
| N/A 25C P0 71W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 Off | 00000000:3A:00.0 Off | 0 |
| N/A 24C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 Off | 00000000:5D:00.0 Off | 0 |
| N/A 22C P0 70W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 Off | 00000000:9A:00.0 Off | 0 |
| N/A 24C P0 71W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 Off | 00000000:AB:00.0 Off | 0 |
| N/A 25C P0 72W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 Off | 00000000:BA:00.0 Off | 0 |
| N/A 23C P0 70W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 Off | 00000000:DB:00.0 Off | 0 |
| N/A 22C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:0059:910c
sys_image_guid: a088:c203:0059:910c
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:0050:a72c
sys_image_guid: a088:c203:0050:a72c
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_2
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:007e:1dba
sys_image_guid: a088:c203:007e:1dba
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_3
transport: InfiniBand (0)
fw_ver: 16.35.4030
node_guid: e8eb:d303:0055:750a
sys_image_guid: e8eb:d303:0055:750a
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000425
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_4
transport: InfiniBand (0)
fw_ver: 16.35.4030
node_guid: e8eb:d303:0055:750b
sys_image_guid: e8eb:d303:0055:750a
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000425
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_5
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:0060:27e6
sys_image_guid: a088:c203:0060:27e6
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_6
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:007e:1c3a
sys_image_guid: a088:c203:007e:1c3a
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_7
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:0060:2b1e
sys_image_guid: a088:c203:0060:2b1e
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_8
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:007d:ab62
sys_image_guid: a088:c203:007d:ab62
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_9
transport: InfiniBand (0)
fw_ver: 28.42.1000
node_guid: a088:c203:007d:ab9a
sys_image_guid: a088:c203:007d:ab9a
vendor_id: 0x02c9
vendor_part_id: 4129
hw_ver: 0x0
board_id: MT_0000000838
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: EthernetMetadata
Metadata
Assignees
Labels
No labels