Skip to content

[Bug]: Since version 0.9.2 comes with nccl built-in, using PCIE causes sys errors. How to disable nccl in vllm for versions after 0.9.2? #26607

@tina0852

Description

@tina0852

Your current environment

The output of python collect_env.py
Your output of `python collect_env.py` here
Image

🐛 Describe the bug

sh 06_startVllmAPI.sh
INFO 09-30 10:30:16 [init.py:216] Automatically detected platform cuda.
(APIServer pid=1599676) INFO 09-30 10:30:17 [api_server.py:1896] vLLM API server version 0.10.2
(APIServer pid=1599676) INFO 09-30 10:30:17 [utils.py:328] non-default args: {'port': 6006, 'model': './autodl-tmp/modelscope/models/GeoGPT/Qwen2.5-72B-GeoGPT', 'tokenizer': './autodl-tmp/modelscope/models/GeoGPT/Qwen2.5-72B-GeoGPT', 'trust_remote_code': True, 'dtype': 'bfloat16', 'served_model_name': ['Qwen2.5-72B-GeoGPT'], 'tensor_parallel_size': 8, 'gpu_memory_utilization': 0.5}
(APIServer pid=1599676) The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1599676) INFO 09-30 10:30:24 [init.py:742] Resolved architecture: Qwen2ForCausalLM
(APIServer pid=1599676) torch_dtype is deprecated! Use dtype instead!
(APIServer pid=1599676) INFO 09-30 10:30:24 [init.py:1815] Using max model len 131072
(APIServer pid=1599676) INFO 09-30 10:30:24 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 09-30 10:30:29 [init.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=1600151) INFO 09-30 10:30:31 [core.py:654] Waiting for init message from front-end.
(EngineCore_DP0 pid=1600151) INFO 09-30 10:30:31 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='./autodl-tmp/modelscope/models/GeoGPT/Qwen2.5-72B-GeoGPT', speculative_config=None, tokenizer='./autodl-tmp/modelscope/models/GeoGPT/Qwen2.5-72B-GeoGPT', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen2.5-72B-GeoGPT, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=1600151) WARNING 09-30 10:30:31 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=1600151) INFO 09-30 10:30:31 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3, 4, 5, 6, 7], buffer_handle=(8, 16777216, 10, 'psm_7e0498ff'), local_subscribe_addr='ipc:///tmp/33a7ec3b-72b3-4984-9ed3-6fc1fb572c4a', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:35 [init.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [init.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [init.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [init.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [init.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [init.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [init.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:35 [init.py:216] Automatically detected platform cuda.
INFO 09-30 10:30:40 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1413bf45'), local_subscribe_addr='ipc:///tmp/a417b752-641f-4aae-8394-1b6890f41865', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:40 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_86d0ac7d'), local_subscribe_addr='ipc:///tmp/40c402a5-5da9-4149-9821-be9f6d8ad6b7', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:41 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_300b46ad'), local_subscribe_addr='ipc:///tmp/8b89fe13-abba-4f19-a431-d71b4693c686', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:41 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f0014612'), local_subscribe_addr='ipc:///tmp/a39f19a5-21d1-468a-b2b4-c0e801ba48d6', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:41 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_aa19af51'), local_subscribe_addr='ipc:///tmp/3b678b21-f56e-47ec-bebe-3eacdee7077b', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:41 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_27518fca'), local_subscribe_addr='ipc:///tmp/2826b2ab-cb62-4f60-9f1f-bcec16a9fc9f', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:41 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_2406d3ac'), local_subscribe_addr='ipc:///tmp/55ce5ab8-a2e6-4a89-b08c-290e36e533fd', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-30 10:30:41 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_98467391'), local_subscribe_addr='ipc:///tmp/912f36f5-4c5d-49d7-85cc-bd2132307955', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W930 10:30:42.426405084 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:42.623626057 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:43.658089176 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:43.664745472 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:43.673248881 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:43.715528862 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:43.767965371 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W930 10:30:43.773198530 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
INFO 09-30 10:30:43 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-30 10:30:43 [init.py:1433] Found nccl from library libnccl.so.2
INFO 09-30 10:30:43 [pynccl.py:70] vLLM is using nccl==2.27.3
ERROR 09-30 10:30:43 [multiproc_executor.py:585] WorkerProc failed to start.
ERROR 09-30 10:30:43 [multiproc_executor.py:585] Traceback (most recent call last):
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 559, in worker_main
ERROR 09-30 10:30:43 [multiproc_executor.py:585] worker = WorkerProc(*args, **kwargs)
ERROR 09-30 10:30:43 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 420, in init
ERROR 09-30 10:30:43 [multiproc_executor.py:585] self.worker.init_device()
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 611, in init_device
ERROR 09-30 10:30:43 [multiproc_executor.py:585] self.worker.init_device() # type: ignore
ERROR 09-30 10:30:43 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 193, in init_device
ERROR 09-30 10:30:43 [multiproc_executor.py:585] init_worker_distributed_environment(self.vllm_config, self.rank,
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 692, in init_worker_distributed_environment
ERROR 09-30 10:30:43 [multiproc_executor.py:585] ensure_model_parallel_initialized(
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1185, in ensure_model_parallel_initialized
ERROR 09-30 10:30:43 [multiproc_executor.py:585] initialize_model_parallel(tensor_model_parallel_size,
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 1109, in initialize_model_parallel
ERROR 09-30 10:30:43 [multiproc_executor.py:585] _TP = init_model_parallel_group(group_ranks,
ERROR 09-30 10:30:43 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 883, in init_model_parallel_group
ERROR 09-30 10:30:43 [multiproc_executor.py:585] return GroupCoordinator(
ERROR 09-30 10:30:43 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 262, in init
ERROR 09-30 10:30:43 [multiproc_executor.py:585] self.device_communicator = device_comm_cls(
ERROR 09-30 10:30:43 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 52, in init
ERROR 09-30 10:30:43 [multiproc_executor.py:585] self.pynccl_comm = PyNcclCommunicator(
ERROR 09-30 10:30:43 [multiproc_executor.py:585] ^^^^^^^^^^^^^^^^^^^
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 106, in init
ERROR 09-30 10:30:43 [multiproc_executor.py:585] self.all_reduce(data)
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl.py", line 127, in all_reduce
ERROR 09-30 10:30:43 [multiproc_executor.py:585] self.nccl.ncclAllReduce(buffer_type(in_tensor.data_ptr()),
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 314, in ncclAllReduce
ERROR 09-30 10:30:43 [multiproc_executor.py:585] self.NCCL_CHECK(self._funcs["ncclAllReduce"](sendbuff, recvbuff, count,
ERROR 09-30 10:30:43 [multiproc_executor.py:585] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 272, in NCCL_CHECK
ERROR 09-30 10:30:43 [multiproc_executor.py:585] raise RuntimeError(f"NCCL error: {error_str}")
ERROR 09-30 10:30:43 [multiproc_executor.py:585] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
INFO 09-30 10:30:43 [multiproc_executor.py:546] Parent process exited, terminating worker
[rank0]:[W930 10:30:44.306044047 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank6]:[W930 10:30:45.435765354 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=119, addr=[localhost]:52424, remote=[localhost]:57511): Connection reset by peer
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:679 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7395c197eeb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x5d694d1 (0x7395a55694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5d6a933 (0x7395a556a933 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5d6b47a (0x7395a556b47a in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x31e (0x7395a556619e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x739564a3db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdbbf4 (0x739547edbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: + 0x9caa4 (0x7395c2c9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c6c (0x7395c2d29c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[W930 10:30:45.443134778 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 6] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Connection reset by peer
[rank5]:[W930 10:30:46.675183583 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=119, addr=[localhost]:52462, remote=[localhost]:57511): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7bebaa37eeb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x5d694d1 (0x7beb8df694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5d6a8cd (0x7beb8df6a8cd in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5d6b47a (0x7beb8df6b47a in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x31e (0x7beb8df6619e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7beb4d43db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdbbf4 (0x7beb308dbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: + 0x9caa4 (0x7bebab69caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c6c (0x7bebab729c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[W930 10:30:46.682463569 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 5] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
[rank1]:[W930 10:30:46.717334754 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=119, addr=[localhost]:52472, remote=[localhost]:57511): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x704aa297eeb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x5d694d1 (0x704a8d1694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5d6a8cd (0x704a8d16a8cd in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5d6b47a (0x704a8d16b47a in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x31e (0x704a8d16619e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x704a4c63db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdbbf4 (0x704a2fadbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: + 0x9caa4 (0x704aaa69caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c6c (0x704aaa729c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W930 10:30:46.724776102 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
[rank2]:[W930 10:30:46.769994353 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=119, addr=[localhost]:52486, remote=[localhost]:57511): Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
Exception raised from recvBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:682 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7b47ba6d9eb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x5d694d1 (0x7b479e3694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5d6a8cd (0x7b479e36a8cd in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5d6b47a (0x7b479e36b47a in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x31e (0x7b479e36619e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7b475d83db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdbbf4 (0x7b4740cdbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: + 0x9caa4 (0x7b47bb89caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c6c (0x7b47bb929c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[W930 10:30:46.777868633 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Failed to recv, got 0 bytes. Connection was likely closed. Did the remote server shutdown or crash?
[rank6]:[W930 10:30:46.443382116 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=119, addr=[localhost]:52424, remote=[localhost]:57511): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7395c197eeb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x5d694d1 (0x7395a55694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5d69d62 (0x7395a5569d62 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5d6b86e (0x7395a556b86e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x30e (0x7395a556618e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x739564a3db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdbbf4 (0x739547edbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: + 0x9caa4 (0x7395c2c9caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c6c (0x7395c2d29c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[W930 10:30:46.450711936 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 6] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank5]:[W930 10:30:47.682704709 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=119, addr=[localhost]:52462, remote=[localhost]:57511): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7bebaa37eeb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x5d694d1 (0x7beb8df694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5d69d62 (0x7beb8df69d62 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5d6b86e (0x7beb8df6b86e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x30e (0x7beb8df6618e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7beb4d43db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdbbf4 (0x7beb308dbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: + 0x9caa4 (0x7bebab69caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c6c (0x7bebab729c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[W930 10:30:47.689803927 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 5] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank1]:[W930 10:30:47.724923988 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=119, addr=[localhost]:52472, remote=[localhost]:57511): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x704aa297eeb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x5d694d1 (0x704a8d1694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5d69d62 (0x704a8d169d62 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5d6b86e (0x704a8d16b86e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x30e (0x704a8d16618e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x704a4c63db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdbbf4 (0x704a2fadbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: + 0x9caa4 (0x704aaa69caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c6c (0x704aaa729c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[W930 10:30:47.731997748 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank2]:[W930 10:30:47.778032870 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=119, addr=[localhost]:52486, remote=[localhost]:57511): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7b47ba6d9eb0 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: + 0x5d694d1 (0x7b479e3694d1 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: + 0x5d69d62 (0x7b479e369d62 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: + 0x5d6b86e (0x7b479e36b86e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::allocator<std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > const&) + 0x30e (0x7b479e36618e in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::HeartbeatMonitor::runLoop() + 0x398 (0x7b475d83db18 in /home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xdbbf4 (0x7b4740cdbbf4 in /home/aigeohub/.conda/envs/vllm00/bin/../lib/libstdc++.so.6)
frame #7: + 0x9caa4 (0x7b47bb89caa4 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c6c (0x7b47bb929c6c in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[W930 10:30:47.785116037 ProcessGroupNCCL.cpp:1783] [PG ID 0 PG GUID 0 Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] EngineCore failed to start.
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] Traceback (most recent call last):
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 505, in init
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] self._init_executor()
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 99, in _init_executor
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 497, in wait_for_ready
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] raise e from None
(EngineCore_DP0 pid=1600151) ERROR 09-30 10:30:47 [core.py:718] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=1600151) Process EngineCore_DP0:
(EngineCore_DP0 pid=1600151) Traceback (most recent call last):
(EngineCore_DP0 pid=1600151) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=1600151) self.run()
(EngineCore_DP0 pid=1600151) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=1600151) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=1600151) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_DP0 pid=1600151) raise e
(EngineCore_DP0 pid=1600151) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=1600151) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=1600151) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1600151) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 505, in init
(EngineCore_DP0 pid=1600151) super().init(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=1600151) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in init
(EngineCore_DP0 pid=1600151) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=1600151) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1600151) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in init
(EngineCore_DP0 pid=1600151) self._init_executor()
(EngineCore_DP0 pid=1600151) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 99, in _init_executor
(EngineCore_DP0 pid=1600151) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=1600151) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1600151) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 497, in wait_for_ready
(EngineCore_DP0 pid=1600151) raise e from None
(EngineCore_DP0 pid=1600151) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=1599676) Traceback (most recent call last):
(APIServer pid=1599676) File "", line 198, in _run_module_as_main
(APIServer pid=1599676) File "", line 88, in _run_code
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 2011, in
(APIServer pid=1599676) uvloop.run(run_server(args))
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
(APIServer pid=1599676) return __asyncio.run(
(APIServer pid=1599676) ^^^^^^^^^^^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1599676) return runner.run(main)
(APIServer pid=1599676) ^^^^^^^^^^^^^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1599676) return self._loop.run_until_complete(task)
(APIServer pid=1599676) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
(APIServer pid=1599676) return await main
(APIServer pid=1599676) ^^^^^^^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
(APIServer pid=1599676) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
(APIServer pid=1599676) async with build_async_engine_client(
(APIServer pid=1599676) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1599676) return await anext(self.gen)
(APIServer pid=1599676) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
(APIServer pid=1599676) async with build_async_engine_client_from_engine_args(
(APIServer pid=1599676) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=1599676) return await anext(self.gen)
(APIServer pid=1599676) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
(APIServer pid=1599676) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1599676) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/utils/init.py", line 1589, in inner
(APIServer pid=1599676) return fn(*args, **kwargs)
(APIServer pid=1599676) ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 212, in from_vllm_config
(APIServer pid=1599676) return cls(
(APIServer pid=1599676) ^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 136, in init
(APIServer pid=1599676) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1599676) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=1599676) return AsyncMPClient(*client_args)
(APIServer pid=1599676) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in init
(APIServer pid=1599676) super().init(
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 448, in init
(APIServer pid=1599676) with launch_core_engines(vllm_config, executor_class,
(APIServer pid=1599676) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=1599676) next(self.gen)
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 729, in launch_core_engines
(APIServer pid=1599676) wait_for_engine_startup(
(APIServer pid=1599676) File "/home/aigeohub/.conda/envs/vllm00/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
(APIServer pid=1599676) raise RuntimeError("Engine core initialization failed. "
(APIServer pid=1599676) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/home/aigeohub/.conda/envs/vllm00/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 4 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions