[Usage]: RuntimeError: ('Worker failed with error %s, please check the stack trace above for the root cause'

### Your current environment

```
ERROR 04-15 17:47:45 [core.py:387] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-15 17:47:45 [core.py:387]   File "/opt/miniconda3/envs/vLLM-3.12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-15 17:47:45 [core.py:387]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-15 17:47:45 [core.py:387]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-15 17:47:45 [core.py:387]   File "/opt/miniconda3/envs/vLLM-3.12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 320, in __init__
ERROR 04-15 17:47:45 [core.py:387]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-15 17:47:45 [core.py:387]   File "/opt/miniconda3/envs/vLLM-3.12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 04-15 17:47:45 [core.py:387]     self._initialize_kv_caches(vllm_config)
ERROR 04-15 17:47:45 [core.py:387]   File "/opt/miniconda3/envs/vLLM-3.12/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 133, in _initialize_kv_caches
ERROR 04-15 17:47:45 [core.py:387]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-15 17:47:45 [core.py:387]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-15 17:47:45 [core.py:387]   File "/opt/miniconda3/envs/vLLM-3.12/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
ERROR 04-15 17:47:45 [core.py:387]     output = self.collective_rpc("determine_available_memory")
ERROR 04-15 17:47:45 [core.py:387]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-15 17:47:45 [core.py:387]   File "/opt/miniconda3/envs/vLLM-3.12/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 133, in collective_rpc
ERROR 04-15 17:47:45 [core.py:387]     raise e
ERROR 04-15 17:47:45 [core.py:387]   File "/opt/miniconda3/envs/vLLM-3.12/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 122, in collective_rpc
ERROR 04-15 17:47:45 [core.py:387]     raise RuntimeError(
ERROR 04-15 17:47:45 [core.py:387] RuntimeError: ('Worker failed with error %s, please check the stack trace above for the root cause', 'TypeError <built-in function linear>: linear(): argument \'input\' (position 1) must be Tensor, not tuple\n\nfrom user code:\n   File "/opt/miniconda3/envs/vLLM-3.12/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 360, in forward\n    hidden_states, residual = layer(positions, hidden_states, residual)\n  File "/opt/miniconda3/envs/vLLM-3.12/lib/python3.12/site-packages/vllm/model_executor/models/glm4.py", line 204, in forward\n    hidden_states = self.mlp(hidden_states)\n  File "/opt/miniconda3/envs/vLLM-3.12/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 92, in forward\n    x, _ = self.gate_up_proj(x)\n  File "/opt/miniconda3/envs/vLLM-3.12/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 474, in forward\n    output_parallel = self.quant_method.apply(self, input_, bias)\n  File "/opt/miniconda3/envs/vLLM-3.12/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 191, in apply\n    return F.linear(x, layer.weight, bias)\n\nSet TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information\n\n\nYou can suppress this exception and fall back to eager by setting:\n    import torch._dynamo\n    torch._dynamo.config.suppress_errors = True\n')
ERROR 04-15 17:47:45 [core.py:387] 
CRITICAL 04-15 17:47:45 [core_client.py:359] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
```


Python 3.12 + CUDA 12.6 + 560.35.05

```
(vLLM-3.12) root@vllm-h3c-ubuntu:/data# pip show flashinfer-python
Name: flashinfer-python
Version: 0.2.5
```

```
(vLLM-3.12) root@vllm-h3c-ubuntu:/data# pip show vllm
Name: vllm
Version: 0.8.4
```

```
python3 -m vllm.entrypoints.openai.api_server \
  --model /data/modelsfiles/GLM-Z1-32B-0414 \
  --served-model-name GLM-Z1-32B-0414 \
  --api-key 5fb39e308f50212515fde13b7d20f14d \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.95 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --trust-remote-code \
  --host 0.0.0.0 \
  --port 8000

```

### How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Usage]: RuntimeError: ('Worker failed with error %s, please check the stack trace above for the root cause' #16655

Your current environment

How would you like to use vllm

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Usage]: RuntimeError: ('Worker failed with error %s, please check the stack trace above for the root cause' #16655

Description

Your current environment

How would you like to use vllm

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions