- 
                Notifications
    You must be signed in to change notification settings 
- Fork 971
Open
Description
I am experiencing an issue with NVSHMEM failing to initialize due to transport errors. The error message indicates that NVSHMEM is unable to detect the system topology and cannot initialize any transport layers. However, test_intranode.py passed successfully...
I would like to know how to resolve this problem.
System Information
GPU Model: H100 (8 GPUs, single node)
OS: Ubuntu 22.04
CUDA Version: 12.5
NVSHMEM Version: 3.2.5
Error Log
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDAinit failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error./workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: 
Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 /workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
nvshmem detect topo failed 
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDAinit failed for transport: IBGDA/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: 
init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: init failed for transport: IBGDAUnable to initialize any transports. returning error.init failed for transport: IBGDAUnable to initialize any transports. returning error./workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: init failed for transport: IBGDAinit failed for transport: IBGDA
nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 
nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: 
nvshmem detect topo failed 
nvshmem initialization failed, exiting 
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: /workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: 
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 /workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: /workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: 
Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
W0307 07:36:56.817000 22906 torch/multiprocessing/spawn.py:169] Terminating process 22985 via signal SIGTERM
W0307 07:36:56.817000 22906 torch/multiprocessing/spawn.py:169] Terminating process 22987 via signal SIGTERMMetadata
Metadata
Assignees
Labels
No labels