Skip to content

test_low_latency failed #55

@hyesung84

Description

@hyesung84

I am experiencing an issue with NVSHMEM failing to initialize due to transport errors. The error message indicates that NVSHMEM is unable to detect the system topology and cannot initialize any transport layers. However, test_intranode.py passed successfully...
I would like to know how to resolve this problem.

System Information
GPU Model: H100 (8 GPUs, single node)
OS: Ubuntu 22.04
CUDA Version: 12.5
NVSHMEM Version: 3.2.5

Error Log

WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDAinit failed for transport: IBGDA

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error./workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: 
Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 /workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
nvshmem detect topo failed 


WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDAinit failed for transport: IBGDA/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: 
init failed for transport: IBGDA

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: init failed for transport: IBGDAUnable to initialize any transports. returning error.init failed for transport: IBGDAUnable to initialize any transports. returning error./workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: Unable to initialize any transports. returning error.


/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: init failed for transport: IBGDAinit failed for transport: IBGDA
nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 Unable to initialize any transports. returning error.


/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 
nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: 
nvshmem detect topo failed 

nvshmem initialization failed, exiting 
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: /workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: 

/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 



/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.Unable to initialize any transports. returning error.

/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 /workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
nvshmem detect topo failed 


/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: /workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 


/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: 
Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 

W0307 07:36:56.817000 22906 torch/multiprocessing/spawn.py:169] Terminating process 22985 via signal SIGTERM
W0307 07:36:56.817000 22906 torch/multiprocessing/spawn.py:169] Terminating process 22987 via signal SIGTERM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions