[Bug][Help]: I used vllm 0.8.0 to deploy whisper-large-v3-turbo . The model is only 1.6G , and i use a 3060 with 12G. However, right after the service started, it encountered an out-of-memory (OOM) error.

### Your current environment

<details>



</details>


### 🐛 Describe the bug



```text
root@8b927132d0a8:~/whisper-large-v3-turbo# pip list |  grep vllm
vllm                              0.8.0
root@8b927132d0a8:~/whisper-large-v3-turbo# nvidia-smi
Thu Mar 20 05:18:02 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:03:00.0 Off |                  N/A |
| 30%   30C    P8              11W / 170W |      4MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@8b927132d0a8:~/whisper-large-v3-turbo# ls -lh
total 1.6G
-rw-r--r-- 1 root root   21K Mar 18 20:26 README.md
-rw-r--r-- 1 root root   34K Mar 18 20:26 added_tokens.json
-rw-r--r-- 1 root root  1.3K Mar 18 20:26 config.json
-rw-r--r-- 1 root root    86 Mar 18 20:26 configuration.json
-rw-r--r-- 1 root root  3.7K Mar 18 20:26 generation_config.json
-rw-r--r-- 1 root root  483K Mar 18 20:26 merges.txt
-rw-r--r-- 1 root root  1.6G Mar 18 20:30 model.safetensors
-rw-r--r-- 1 root root   52K Mar 18 20:26 normalizer.json
-rw-r--r-- 1 root root   340 Mar 18 20:26 preprocessor_config.json
-rw-r--r-- 1 root root  2.2K Mar 18 20:26 special_tokens_map.json
-rw-r--r-- 1 root root  2.6M Mar 18 20:26 tokenizer.json
-rw-r--r-- 1 root root  277K Mar 18 20:26 tokenizer_config.json
-rw-r--r-- 1 root root 1013K Mar 18 20:26 vocab.json

```


Startup command：
## 
python3 -m vllm.entrypoints.openai.api_server --gpu-memory-utilization 0.5   --model /root/whisper-large-v3-turbo --served-model-name  whisper-large-v3-turbo  --task transcription --port 28037

###

error log：
##
```
WARNING 03-20 05:26:39 [arg_utils.py:1754] --task transcription is not supported by the V1 Engine. Falling back to V0. 
INFO 03-20 05:26:39 [api_server.py:241] Started engine process with PID 1454
INFO 03-20 05:26:43 [__init__.py:256] Automatically detected platform cuda.
INFO 03-20 05:26:44 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.0) with config: model='/root/whisper-large-v3-turbo', speculative_config=None, tokenizer='/root/whisper-large-v3-turbo', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=448, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=whisper-large-v3-turbo, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 03-20 05:26:46 [cuda.py:285] Using Flash Attention backend.
INFO 03-20 05:26:47 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-20 05:26:47 [model_runner.py:1110] Starting to load model /root/whisper-large-v3-turbo...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.01it/s]

INFO 03-20 05:26:48 [loader.py:429] Loading weights took 0.55 seconds
INFO 03-20 05:26:48 [model_runner.py:1146] Model loading took 1.5077 GB and 0.716281 seconds
INFO 03-20 05:26:49 [enc_dec_model_runner.py:278] Starting profile run for multi-modal models.
ERROR 03-20 05:26:59 [engine.py:448] CUDA out of memory. Tried to allocate 3.66 GiB. GPU 0 has a total capacity of 11.76 GiB of which 2.98 GiB is free. Process 32641 has 8.78 GiB memory in use. Of the allocated memory 8.49 GiB is allocated by PyTorch, and 161.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
ERROR 03-20 05:26:59 [engine.py:448] Traceback (most recent call last):
ERROR 03-20 05:26:59 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 03-20 05:26:59 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 03-20 05:26:59 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug][Help]: I used vllm 0.8.0 to deploy whisper-large-v3-turbo . The model is only 1.6G , and i use a 3060 with 12G. However, right after the service started, it encountered an out-of-memory (OOM) error. #15216

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug][Help]: I used vllm 0.8.0 to deploy whisper-large-v3-turbo . The model is only 1.6G , and i use a 3060 with 12G. However, right after the service started, it encountered an out-of-memory (OOM) error. #15216

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions