Skip to content

[Bug]: vllm v0.8.4 serve glm-4-32b-0414 error #16740

@iwaitu

Description

@iwaitu

Your current environment

CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /home/lc/work/models/GLM-4-32B-0414
--port 8000
--trust-remote-code
--max-model-len 32768
--tensor-parallel-size 2
--gpu_memory_utilization 0.8
--served-model-name "glm4"
--enable-auto-tool-choice
--tool-call-parser pythonic
--trust-remote-code

got error ,vllm == 0.8.4 , transformers == 4.51.3 , 2x H100 80Gb

🐛 Describe the bug

(llm) (base) lc@ai-h100:~/work/vllm$ CUDA_VISIBLE_DEVICES=0,1 VLLM_USE_V1=0 vllm serve /home/lc/work/models/GLM-4-32B-0414   --port 8000   --trust-remote-code   --max-model-len 32768   --tensor-parallel-size 2   --gpu_memory_utilization 0.8   --s
erved-model-name "glm4"   --enable-auto-tool-choice   --tool-call-parser pythonic   --trust-remote-code 
INFO 04-17 02:04:28 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:28 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 02:04:30 [api_server.py:1034] vLLM API server version 0.8.4
INFO 04-17 02:04:30 [api_server.py:1035] args: Namespace(subparser='serve', model_tag='/home/lc/work/models/GLM-4-32B-0414', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=True, tool_call_parser='pythonic', tool_parser_plugin='', model='/home/lc/work/models/GLM-4-32B-0414', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config=None, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.8, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['glm4'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_chunked_mm_input=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x70a76c095260>)
INFO 04-17 02:04:37 [config.py:689] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 04-17 02:04:37 [config.py:1713] Defaulting to use mp for distributed inference
INFO 04-17 02:04:37 [api_server.py:246] Started engine process with PID 191738
INFO 04-17 02:04:41 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:41 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
INFO 04-17 02:04:43 [llm_engine.py:243] Initializing a V0 LLM engine (v0.8.4) with config: model='/home/lc/work/models/GLM-4-32B-0414', speculative_config=None, tokenizer='/home/lc/work/models/GLM-4-32B-0414', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=glm4, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
WARNING 04-17 02:04:43 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 64 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 04-17 02:04:44 [cuda.py:292] Using Flash Attention backend.
INFO 04-17 02:04:47 [__init__.py:239] Automatically detected platform cuda.
WARNING 04-17 02:04:48 [cuda.py:409] Detected different devices in the system: NVIDIA H100 80GB HBM3, NVIDIA H100. Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:49 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:50 [cuda.py:292] Using Flash Attention backend.
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:52 [utils.py:993] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:52 [pynccl.py:69] vLLM is using nccl==2.21.5
INFO 04-17 02:04:52 [utils.py:993] Found nccl from library libnccl.so.2
INFO 04-17 02:04:52 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 02:04:53 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/lc/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-17 02:04:53 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_838735c5'), local_subscribe_addr='ipc:///tmp/f2503ef8-ae25-4ae3-945c-1f520c747109', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-17 02:04:53 [parallel_state.py:959] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [parallel_state.py:959] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
INFO 04-17 02:04:53 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
(VllmWorkerProcess pid=191902) INFO 04-17 02:04:53 [model_runner.py:1110] Starting to load model /home/lc/work/models/GLM-4-32B-0414...
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/14 [00:00<00:12,  1.08it/s]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:01<00:11,  1.04it/s]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:02<00:07,  1.41it/s]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:03<00:08,  1.24it/s]
Loading safetensors checkpoint shards:  36% Completed | 5/14 [00:04<00:07,  1.19it/s]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:05<00:06,  1.16it/s]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:06<00:06,  1.10it/s]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:07<00:05,  1.06it/s]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:08<00:04,  1.04it/s]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:09<00:03,  1.06it/s]
Loading safetensors checkpoint shards:  79% Completed | 11/14 [00:09<00:02,  1.10it/s]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:10<00:01,  1.08it/s]
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:11<00:00,  1.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00,  1.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00,  1.09it/s]

INFO 04-17 02:05:06 [loader.py:458] Loading weights took 12.85 seconds
(VllmWorkerProcess pid=191902) INFO 04-17 02:05:06 [loader.py:458] Loading weights took 12.86 seconds
INFO 04-17 02:05:06 [model_runner.py:1146] Model loading took 30.4522 GiB and 13.050829 seconds
(VllmWorkerProcess pid=191902) INFO 04-17 02:05:06 [model_runner.py:1146] Model loading took 30.4522 GiB and 13.056882 seconds
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/utils.py", line 2378, in run_method
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     self.model_runner.profile_run()
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1770, in execute_model
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 285, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 360, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     hidden_states, residual = layer(positions, hidden_states, residual)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 92, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     x, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]     return F.linear(x, layer.weight, bias)
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=191902) ERROR 04-17 02:05:08 [multiproc_worker_utils.py:238] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 02:05:08 [engine.py:448] linear(): argument 'input' (position 1) must be Tensor, not tuple
ERROR 04-17 02:05:08 [engine.py:448] Traceback (most recent call last):
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-17 02:05:08 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-17 02:05:08 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-17 02:05:08 [engine.py:448]     return cls(
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-17 02:05:08 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-17 02:05:08 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-17 02:05:08 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-17 02:05:08 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-17 02:05:08 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-17 02:05:08 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-17 02:05:08 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-17 02:05:08 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-17 02:05:08 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/utils.py", line 2378, in run_method
ERROR 04-17 02:05:08 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-17 02:05:08 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-17 02:05:08 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-17 02:05:08 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-17 02:05:08 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-17 02:05:08 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-17 02:05:08 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 285, in forward
ERROR 04-17 02:05:08 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-17 02:05:08 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-17 02:05:08 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 360, in forward
ERROR 04-17 02:05:08 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-17 02:05:08 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward
ERROR 04-17 02:05:08 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-17 02:05:08 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 92, in forward
ERROR 04-17 02:05:08 [engine.py:448]     x, _ = self.gate_up_proj(x)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-17 02:05:08 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward
ERROR 04-17 02:05:08 [engine.py:448]     output_parallel = self.quant_method.apply(self, input_, bias)
ERROR 04-17 02:05:08 [engine.py:448]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448]   File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply
ERROR 04-17 02:05:08 [engine.py:448]     return F.linear(x, layer.weight, bias)
ERROR 04-17 02:05:08 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-17 02:05:08 [engine.py:448] TypeError: linear(): argument 'input' (position 1) must be Tensor, not tuple
Traceback (most recent call last):
  File "/home/lc/anaconda3/envs/llm/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/cli/main.py", line 51, in main
    args.dispatch_function(args)
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 105, in run
    return runner.run(wrapper())
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/lc/anaconda3/envs/llm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
/home/lc/anaconda3/envs/llm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
(llm) (base) lc@ai-h100:~/work/vllm$ 

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions