[Bug]: Since version 0.9.2 comes with nccl built-in, using PCIE causes sys errors. How to disable nccl in vllm for versions after 0.9.2?

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Your output of `python collect_env.py` here
```

</details>

<img width="833" height="138" alt="Image" src="https://github.com/user-attachments/assets/a42c415b-8c5b-4698-aa6f-879edc44d512" />


### 🐛 Describe the bug

 sh 06_startVllmAPI.sh 
INFO 09-30 10:30:16 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=1599676) INFO 09-30 10:30:17 [api_server.py:1896] vLLM API server version 0.10.2
(APIServer pid=1599676) INFO 09-30 10:30:17 [utils.py:328] non-default args: {'port': 6006, 'model': './autodl-tmp/modelscope/models/GeoGPT/Qwen2.5-72B-GeoGPT', 'tokenizer': './autodl-tmp/modelscope/models/GeoGPT/Qwen2.5-72B-GeoGPT', 'trust_remote_code': True, 'dtype': 'bfloat16', 'served_model_name': ['Qwen2.5-72B-GeoGPT'], 'tensor_parallel_size': 8, 'gpu_memory_utilization': 0.5}
(APIServer pid=1599676) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1599676) INFO 09-30 10:30:24 [__init__.py:742] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1599676) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=1599676) INFO 09-30 10:30:24 [__init__.py:1815] Using max model len 131072
(APIServer pid=1599676) INFO 09-30 10:30:24 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 09-30 10:30:29 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=1600151) INFO 09-30 10:30:31 [core.py:654] Waiting for init message from front-end.
(EngineCore_DP0 pid=1600151) INFO 09-30 10:30:31 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='./autodl-tmp/modelscope/models/GeoGPT/Qwen2.5-72B-GeoGPT', speculative_config=None, tokenizer='./autodl-tmp/modelscope/models/GeoGPT/Qwen2.5-72B-GeoGPT', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen2.5-72B-GeoGPT, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=1600151) WARNING 09-30 10:30:31 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=1600151) INFO 09-30 10:30:31 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3, 4, 5, 6, 7], buffer_handle=(8, 16777216, 10, 'psm_7e0498ff'), local_subscribe_addr='ipc:///tmp/33a7ec3b-72b3-4984-9ed3-6fc1fb572c4a', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:35 [__init__.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [__init__.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [__init__.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [__init__.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [__init__.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [__init__.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [__init__.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [__init__.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:40 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1413bf45'), local_subscribe_addr='ipc:///tmp/a417b752-641f-4aae-8394-1b6890f41865', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:40 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_86d0ac7d'), local_subscribe_addr='ipc:///tmp/40c402a5-5da9-4149-9821-be9f6d8ad6b7', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:41 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_300b46ad'), local_subscribe_addr='ipc:///tmp/8b89fe13-abba-4f19-a431-d71b4693c686', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:41 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f0014612'), local_subscribe_addr='ipc:///tmp/a39f19a5-21d1-468a-b2b4-c0e801ba48d6', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:41 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_aa19af51'), local_subscribe_addr='ipc:///tmp/3b678b21-f56e-47ec-bebe-3eacdee7077b', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:41 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_27518fca'), local_subscribe_addr='ipc:///tmp/2826b2ab-cb62-4f60-9f1f-bcec16a9fc9f', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:41 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_2406d3ac'), local_subscribe_addr='ipc:///tmp/55ce5ab8-a2e6-4a89-b08c-290e36e533fd', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:41 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_98467391'), local_subscribe_addr='ipc:///tmp/912f36f5-4c5d-49d7-85cc-bd2132307955', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W930 10:30:42.426405084 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:42.623626057 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:43.658089176 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:43.664745472 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:43.673248881 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:43.715528862 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:43.767965371 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:43.773198530 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
INFO 09-30 10:30:43 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
ERROR 09-30 10:30:43 [multiproc_executor.py:585] WorkerProc failed to start.
ERROR 09-30 10:30:43 [multiproc_executor.py:585] Traceback (most recent call last):
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 559, in worker_main
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     worker = WorkerProc(*args, **kwargs)
ERROR 09-30 10:30:43 [multiproc_executor.py:585]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 420, in __init__
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     self.worker.init_device()
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 611, in init_device
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     self.worker.init_device()  # type: ignore
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 193, in init_device
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     init_worker_distributed_environment(self.vllm_config, self.rank,
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 692, in init_worker_distributed_environment
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     ensure_model_parallel_initialized(
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1185, in ensure_model_parallel_initialized
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     initialize_model_parallel(tensor_model_parallel_size,
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1109, in initialize_model_parallel
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     _TP = init_model_parallel_group(group_ranks,
ERROR 09-30 10:30:43 [multiproc_executor.py:585]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 883, in init_model_parallel_group
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     return GroupCoordinator(
ERROR 09-30 10:30:43 [multiproc_executor.py:585]            ^^^^^^^^^^^^^^^^^
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 262, in __init__
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     self.device_communicator = device_comm_cls(
ERROR 09-30 10:30:43 [multiproc_executor.py:585]                                ^^^^^^^^^^^^^^^^
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 52, in __init__
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     self.pynccl_comm = PyNcclCommunicator(
ERROR 09-30 10:30:43 [multiproc_executor.py:585]                        ^^^^^^^^^^^^^^^^^^^
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 106, in __init__
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     self.all_reduce(data)
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 127, in all_reduce
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     self.nccl.ncclAllReduce(buffer_type(in_tensor.data_ptr()),
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 314, in ncclAllReduce
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     self.NCCL_CHECK(self._funcs["ncclAllReduce"](sendbuff, recvbuff, count,
ERROR 09-30 10:30:43 [multiproc_executor.py:585]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 272, in NCCL_CHECK
ERROR 09-30 10:30:43 [multiproc_executor.py:585]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 09-30 10:30:43 [multiproc_executor.py:585] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
[rank0]:[W930 10:30:44.306044047 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank6]:[W930 10:30:45.435765354 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=119, addr=[localhost]:52424, remote=[localhost]:57511): Connection reset by peer
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:679 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7395c197eeb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7395a55694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d6a933 (0x7395a556a933 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b47a (0x7395a556b47a in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x7395a556619e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x739564a3db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdbbf4 (0x739547edbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x7395c2c9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x129c6c (0x7395c2d29c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[W930 10:30:45.443134778 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 6] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Connection reset by peer
[rank5]:[W930 10:30:46.675183583 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=119, addr=[localhost]:52462, remote=[localhost]:57511): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7bebaa37eeb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7beb8df694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d6a8cd (0x7beb8df6a8cd in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b47a (0x7beb8df6b47a in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x7beb8df6619e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7beb4d43db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdbbf4 (0x7beb308dbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x7bebab69caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x129c6c (0x7bebab729c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[W930 10:30:46.682463569 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 5] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
[rank1]:[W930 10:30:46.717334754 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=119, addr=[localhost]:52472, remote=[localhost]:57511): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x704aa297eeb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x704a8d1694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d6a8cd (0x704a8d16a8cd in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b47a (0x704a8d16b47a in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x704a8d16619e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x704a4c63db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdbbf4 (0x704a2fadbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x704aaa69caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x129c6c (0x704aaa729c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W930 10:30:46.724776102 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
[rank2]:[W930 10:30:46.769994353 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=119, addr=[localhost]:52486, remote=[localhost]:57511): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7b47ba6d9eb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7b479e3694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d6a8cd (0x7b479e36a8cd in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b47a (0x7b479e36b47a in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x31e (0x7b479e36619e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7b475d83db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdbbf4 (0x7b4740cdbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x7b47bb89caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x129c6c (0x7b47bb929c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[W930 10:30:46.777868633 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
[rank6]:[W930 10:30:46.443382116 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=119, addr=[localhost]:52424, remote=[localhost]:57511): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7395c197eeb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7395a55694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x7395a5569d62 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x7395a556b86e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7395a556618e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x739564a3db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdbbf4 (0x739547edbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x7395c2c9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x129c6c (0x7395c2d29c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[W930 10:30:46.450711936 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 6] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank5]:[W930 10:30:47.682704709 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=119, addr=[localhost]:52462, remote=[localhost]:57511): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7bebaa37eeb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7beb8df694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x7beb8df69d62 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x7beb8df6b86e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7beb8df6618e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7beb4d43db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdbbf4 (0x7beb308dbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x7bebab69caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x129c6c (0x7bebab729c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[W930 10:30:47.689803927 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 5] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W930 10:30:47.724923988 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=119, addr=[localhost]:52472, remote=[localhost]:57511): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x704aa297eeb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x704a8d1694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x704a8d169d62 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x704a8d16b86e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x704a8d16618e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x704a4c63db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdbbf4 (0x704a2fadbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x704aaa69caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x129c6c (0x704aaa729c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W930 10:30:47.731997748 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank2]:[W930 10:30:47.778032870 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=119, addr=[localhost]:52486, remote=[localhost]:57511): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7b47ba6d9eb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5d694d1 (0x7b479e3694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5d69d62 (0x7b479e369d62 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5d6b86e (0x7b479e36b86e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x30e (0x7b479e36618e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7b475d83db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdbbf4 (0x7b4740cdbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x7b47bb89caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x129c6c (0x7b47bb929c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[W930 10:30:47.785116037 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] EngineCore failed to start.
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] Traceback (most recent call last):
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 505, in __init__
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]     self._init_executor()
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 99, in _init_executor
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 497, in wait_for_ready
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718]     raise e from None
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=1600151) Process EngineCore_DP0:
(EngineCore_DP0 pid=1600151) Traceback (most recent call last):
(EngineCore_DP0 pid=1600151)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=1600151)     self.run()
(EngineCore_DP0 pid=1600151)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=1600151)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=1600151)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_DP0 pid=1600151)     raise e
(EngineCore_DP0 pid=1600151)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=1600151)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=1600151)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1600151)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 505, in __init__
(EngineCore_DP0 pid=1600151)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=1600151)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_DP0 pid=1600151)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=1600151)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1600151)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=1600151)     self._init_executor()
(EngineCore_DP0 pid=1600151)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 99, in _init_executor
(EngineCore_DP0 pid=1600151)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=1600151)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1600151)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 497, in wait_for_ready
(EngineCore_DP0 pid=1600151)     raise e from None
(EngineCore_DP0 pid=1600151) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=1599676) Traceback (most recent call last):
(APIServer pid=1599676)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=1599676)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 2011, in <module>
(APIServer pid=1599676)     uvloop.run(run_server(args))
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=1599676)     return __asyncio.run(
(APIServer pid=1599676)            ^^^^^^^^^^^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1599676)     return runner.run(main)
(APIServer pid=1599676)            ^^^^^^^^^^^^^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1599676)     return self._loop.run_until_complete(task)
(APIServer pid=1599676)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=1599676)     return await main
(APIServer pid=1599676)            ^^^^^^^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
(APIServer pid=1599676)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
(APIServer pid=1599676)     async with build_async_engine_client(
(APIServer pid=1599676)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1599676)     return await anext(self.gen)
(APIServer pid=1599676)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
(APIServer pid=1599676)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1599676)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1599676)     return await anext(self.gen)
(APIServer pid=1599676)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
(APIServer pid=1599676)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1599676)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/utils/__init__.py", line 1589, in inner
(APIServer pid=1599676)     return fn(*args, **kwargs)
(APIServer pid=1599676)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 212, in from_vllm_config
(APIServer pid=1599676)     return cls(
(APIServer pid=1599676)            ^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 136, in __init__
(APIServer pid=1599676)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1599676)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=1599676)     return AsyncMPClient(*client_args)
(APIServer pid=1599676)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__
(APIServer pid=1599676)     super().__init__(
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 448, in __init__
(APIServer pid=1599676)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=1599676)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1599676)     next(self.gen)
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 729, in launch_core_engines
(APIServer pid=1599676)     wait_for_engine_startup(
(APIServer pid=1599676)   File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
(APIServer pid=1599676)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=1599676) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/home/aigeohub/.conda/envs/vllm00/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 4 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Since version 0.9.2 comes with nccl built-in, using PCIE causes sys errors. How to disable nccl in vllm for versions after 0.9.2? #26607

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Since version 0.9.2 comes with nccl built-in, using PCIE causes sys errors. How to disable nccl in vllm for versions after 0.9.2? #26607

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions