Skip to content

[Bug][Help]: I used vllm 0.8.0 to deploy whisper-large-v3-turbo . The model is only 1.6G , and i use a 3060 with 12G. However, right after the service started, it encountered an out-of-memory (OOM) error. #15216

@liuzhipengchd

Description

@liuzhipengchd

Your current environment

🐛 Describe the bug

root@8b927132d0a8:~/whisper-large-v3-turbo# pip list |  grep vllm
vllm                              0.8.0
root@8b927132d0a8:~/whisper-large-v3-turbo# nvidia-smi
Thu Mar 20 05:18:02 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:03:00.0 Off |                  N/A |
| 30%   30C    P8              11W / 170W |      4MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@8b927132d0a8:~/whisper-large-v3-turbo# ls -lh
total 1.6G
-rw-r--r-- 1 root root   21K Mar 18 20:26 README.md
-rw-r--r-- 1 root root   34K Mar 18 20:26 added_tokens.json
-rw-r--r-- 1 root root  1.3K Mar 18 20:26 config.json
-rw-r--r-- 1 root root    86 Mar 18 20:26 configuration.json
-rw-r--r-- 1 root root  3.7K Mar 18 20:26 generation_config.json
-rw-r--r-- 1 root root  483K Mar 18 20:26 merges.txt
-rw-r--r-- 1 root root  1.6G Mar 18 20:30 model.safetensors
-rw-r--r-- 1 root root   52K Mar 18 20:26 normalizer.json
-rw-r--r-- 1 root root   340 Mar 18 20:26 preprocessor_config.json
-rw-r--r-- 1 root root  2.2K Mar 18 20:26 special_tokens_map.json
-rw-r--r-- 1 root root  2.6M Mar 18 20:26 tokenizer.json
-rw-r--r-- 1 root root  277K Mar 18 20:26 tokenizer_config.json
-rw-r--r-- 1 root root 1013K Mar 18 20:26 vocab.json

Startup command:

python3 -m vllm.entrypoints.openai.api_server --gpu-memory-utilization 0.5 --model /root/whisper-large-v3-turbo --served-model-name whisper-large-v3-turbo --task transcription --port 28037

error log:

WARNING 03-20 05:26:39 [arg_utils.py:1754] --task transcription is not supported by the V1 Engine. Falling back to V0. 
INFO 03-20 05:26:39 [api_server.py:241] Started engine process with PID 1454
INFO 03-20 05:26:43 [__init__.py:256] Automatically detected platform cuda.
INFO 03-20 05:26:44 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.0) with config: model='/root/whisper-large-v3-turbo', speculative_config=None, tokenizer='/root/whisper-large-v3-turbo', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=448, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=whisper-large-v3-turbo, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 
INFO 03-20 05:26:46 [cuda.py:285] Using Flash Attention backend.
INFO 03-20 05:26:47 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-20 05:26:47 [model_runner.py:1110] Starting to load model /root/whisper-large-v3-turbo...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.01it/s]

INFO 03-20 05:26:48 [loader.py:429] Loading weights took 0.55 seconds
INFO 03-20 05:26:48 [model_runner.py:1146] Model loading took 1.5077 GB and 0.716281 seconds
INFO 03-20 05:26:49 [enc_dec_model_runner.py:278] Starting profile run for multi-modal models.
ERROR 03-20 05:26:59 [engine.py:448] CUDA out of memory. Tried to allocate 3.66 GiB. GPU 0 has a total capacity of 11.76 GiB of which 2.98 GiB is free. Process 32641 has 8.78 GiB memory in use. Of the allocated memory 8.49 GiB is allocated by PyTorch, and 161.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
ERROR 03-20 05:26:59 [engine.py:448] Traceback (most recent call last):
ERROR 03-20 05:26:59 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 03-20 05:26:59 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 03-20 05:26:59 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions