-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /home/lc/work/models/GLM-4-32B-0414
--port 8000
--trust-remote-code
--max-model-len 32768
--tensor-parallel-size 2
--gpu_memory_utilization 0.8
--served-model-name "glm4"
--enable-auto-tool-choice
--tool-call-parser pythonic
--trust-remote-code
got error ,vllm == 0.8.4 , transformers == 4.51.3 , 2x H100 80Gb
🐛 Describe the bug
(llm) (base) lc@ai-h100:~/work/vllm$ CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /home/lc/work/models/GLM-4-32B-0414 --port 8000 --trust-remote-code --max-model-len 32768 --tensor-parallel-size 2 --gpu_memory_utilization 0.8 --s
erved-model-name "glm4" --enable-auto-tool-choice --tool-call-parser pythonic --trust-remote-code
INFO 04-17 02:04:28 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:28 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 02:04:30 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-17 02:04:30 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='/home/lc/work/models/GLM-4-32B-0414', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='pythonic', tool_parser_plugin='', model='/home/lc/work/models/GLM-4-32B-0414', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['glm4'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x70a76c095260>)
INFO 04-17 02:04:37 [config.py:689] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 04-17 02:04:37 [config.py:1713] Defaulting to use mp for distributed inference
INFO 04-17 02:04:37 [api_server.py:246] Started engine process with PID 191738
INFO 04-17 02:04:41 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:41 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 02:04:43 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/home/lc/work/models/GLM-4-32B-0414', speculative_config=None, tokenizer='/home/lc/work/models/GLM-4-32B-0414', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=glm4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 04-17 02:04:43 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-17 02:04:44 [cuda.py:292] Using Flash Attention backend.
INFO 04-17 02:04:47 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:48 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:49 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:50 [cuda.py:292] Using Flash Attention backend.
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:52 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:52 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 04-17 02:04:52 [utils.py:993] Found nccl from library libnccl.so.2
INFO 04-17 02:04:52 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 02:04:53 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 02:04:53 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_838735c5'), local_subscribe_addr='ipc:///tmp/f2503ef8-ae25-4ae3-945c-1f520c747109', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-17 02:04:53 [parallel_state.py:959] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [parallel_state.py:959] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 04-17 02:04:53 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:00<00:12, 1.08it/s]
Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:01<00:11, 1.04it/s]
Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:02<00:07, 1.41it/s]
Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:03<00:08, 1.24it/s]
Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:04<00:07, 1.19it/s]
Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:05<00:06, 1.16it/s]
Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:06<00:06, 1.10it/s]
Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:07<00:05, 1.06it/s]
Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:08<00:04, 1.04it/s]
Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:09<00:03, 1.06it/s]
Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:09<00:02, 1.10it/s]
Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:10<00:01, 1.08it/s]
Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:11<00:00, 1.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00, 1.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00, 1.09it/s]
INFO 04-17 02:05:06 [loader.py:458] Loading weights took 12.85 seconds
(VllmWorkerProcess pid=191902) INFO 04-17 02:05:06 [loader.py:458] Loading weights took 12.86 seconds
INFO 04-17 02:05:06 [model_runner.py:1146] Model loading took 30.4522 GiB and 13.050829 seconds
(VllmWorkerProcess pid=191902) INFO 04-17 02:05:06 [model_runner.py:1146] Model loading took 30.4522 GiB and 13.056882 seconds
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/utils.py", line 2378, in run_method
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] self.model_runner.profile_run()
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1770, in execute_model
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 285, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 360, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] hidden_states, residual = layer(positions, hidden_states, residual)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 92, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] x, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] return F.linear(x, layer.weight, bias)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 02:05:08 [engine.py:448] linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 02:05:08 [engine.py:448] Traceback (most recent call last):
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-17 02:05:08 [engine.py:448] engine = MQLLMEngine.from_vllm_config(
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-17 02:05:08 [engine.py:448] return cls(
ERROR 04-17 02:05:08 [engine.py:448] ^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-17 02:05:08 [engine.py:448] self.engine = LLMEngine(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-17 02:05:08 [engine.py:448] self._initialize_kv_caches()
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-17 02:05:08 [engine.py:448] self.model_executor.determine_num_available_blocks())
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-17 02:05:08 [engine.py:448] results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-17 02:05:08 [engine.py:448] return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-17 02:05:08 [engine.py:448] driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/utils.py", line 2378, in run_method
ERROR 04-17 02:05:08 [engine.py:448] return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448] return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-17 02:05:08 [engine.py:448] self.model_runner.profile_run()
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448] return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-17 02:05:08 [engine.py:448] self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-17 02:05:08 [engine.py:448] self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448] return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-17 02:05:08 [engine.py:448] hidden_or_intermediate_states = model_executable(
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448] return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448] return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 285, in forward
ERROR 04-17 02:05:08 [engine.py:448] hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-17 02:05:08 [engine.py:448] return self.forward(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 360, in forward
ERROR 04-17 02:05:08 [engine.py:448] hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448] return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448] return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward
ERROR 04-17 02:05:08 [engine.py:448] hidden_states = self.mlp(hidden_states)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448] return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448] return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 92, in forward
ERROR 04-17 02:05:08 [engine.py:448] x, _ = self.gate_up_proj(x)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448] return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448] return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
ERROR 04-17 02:05:08 [engine.py:448] output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply
ERROR 04-17 02:05:08 [engine.py:448] return F.linear(x, layer.weight, bias)
ERROR 04-17 02:05:08 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
Traceback (most recent call last):
File "/home/lc/anaconda3/envs/llm/bin/vllm", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 51, in main
args.dispatch_function(args)
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
uvloop.run(run_server(args))
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
return runner.run(wrapper())
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
async with build_async_engine_client(args) as engine_client:
File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
(llm) (base) lc@ai-h100:~/work/vllm$
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working