-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
🐛 Describe the bug
root@8b927132d0a8:~/whisper-large-v3-turbo# pip list | grep vllm
vllm 0.8.0
root@8b927132d0a8:~/whisper-large-v3-turbo# nvidia-smi
Thu Mar 20 05:18:02 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:03:00.0 Off | N/A |
| 30% 30C P8 11W / 170W | 4MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@8b927132d0a8:~/whisper-large-v3-turbo# ls -lh
total 1.6G
-rw-r--r-- 1 root root 21K Mar 18 20:26 README.md
-rw-r--r-- 1 root root 34K Mar 18 20:26 added_tokens.json
-rw-r--r-- 1 root root 1.3K Mar 18 20:26 config.json
-rw-r--r-- 1 root root 86 Mar 18 20:26 configuration.json
-rw-r--r-- 1 root root 3.7K Mar 18 20:26 generation_config.json
-rw-r--r-- 1 root root 483K Mar 18 20:26 merges.txt
-rw-r--r-- 1 root root 1.6G Mar 18 20:30 model.safetensors
-rw-r--r-- 1 root root 52K Mar 18 20:26 normalizer.json
-rw-r--r-- 1 root root 340 Mar 18 20:26 preprocessor_config.json
-rw-r--r-- 1 root root 2.2K Mar 18 20:26 special_tokens_map.json
-rw-r--r-- 1 root root 2.6M Mar 18 20:26 tokenizer.json
-rw-r--r-- 1 root root 277K Mar 18 20:26 tokenizer_config.json
-rw-r--r-- 1 root root 1013K Mar 18 20:26 vocab.json
Startup command:
python3 -m vllm.entrypoints.openai.api_server --gpu-memory-utilization 0.5 --model /root/whisper-large-v3-turbo --served-model-name whisper-large-v3-turbo --task transcription --port 28037
error log:
WARNING 03-20 05:26:39 [arg_utils.py:1754] --task transcription is not supported by the V1 Engine. Falling back to V0.
INFO 03-20 05:26:39 [api_server.py:241] Started engine process with PID 1454
INFO 03-20 05:26:43 [__init__.py:256] Automatically detected platform cuda.
INFO 03-20 05:26:44 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.0) with config: model='/root/whisper-large-v3-turbo', speculative_config=None, tokenizer='/root/whisper-large-v3-turbo', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=448, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=whisper-large-v3-turbo, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 03-20 05:26:46 [cuda.py:285] Using Flash Attention backend.
INFO 03-20 05:26:47 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-20 05:26:47 [model_runner.py:1110] Starting to load model /root/whisper-large-v3-turbo...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.01it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.01it/s]
INFO 03-20 05:26:48 [loader.py:429] Loading weights took 0.55 seconds
INFO 03-20 05:26:48 [model_runner.py:1146] Model loading took 1.5077 GB and 0.716281 seconds
INFO 03-20 05:26:49 [enc_dec_model_runner.py:278] Starting profile run for multi-modal models.
ERROR 03-20 05:26:59 [engine.py:448] CUDA out of memory. Tried to allocate 3.66 GiB. GPU 0 has a total capacity of 11.76 GiB of which 2.98 GiB is free. Process 32641 has 8.78 GiB memory in use. Of the allocated memory 8.49 GiB is allocated by PyTorch, and 161.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
ERROR 03-20 05:26:59 [engine.py:448] Traceback (most recent call last):
ERROR 03-20 05:26:59 [engine.py:448] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 03-20 05:26:59 [engine.py:448] engine = MQLLMEngine.from_vllm_config(
ERROR 03-20 05:26:59 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
zyfy29 and virtualmartire
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working