[Whisper] Fix audio feature device placement in encoder forward#22296
[Whisper] Fix audio feature device placement in encoder forward#22296shenxiul wants to merge 1 commit intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
`features.to(dtype)` only converts the dtype without moving the tensor to GPU, causing `RuntimeError: Expected all tensors to be on the same device` when `conv1` weights are on CUDA but `input_features` remain on CPU. This happens by default because `keep_mm_feature_on_device=False` moves features to CPU after preprocessing. Similarly, `encoder_position_ids` is created on `features.device` (CPU) via `torch.arange(...).to(features.device)`, so it also needs explicit device placement. Fix: use `features.to(device=..., dtype=...)` and `encoder_position_ids.to(device)` to explicitly move both tensors to the model's device before calling the encoder. ## Bug introduced in sgl-project#21190 (feat: enable CUDA graph and timestamp for whisper model) ## Repro ```bash # Launch server (default config, no special flags needed) python -m sglang.launch_server --model-path openai/whisper-large-v3 --port 30000 # Send a transcription request — server crashes with: # RuntimeError: Expected all tensors to be on the same device, # but got weight is on cuda:0, different from other tensors on cpu curl http://localhost:30000/v1/audio/transcriptions \ -F file=@test.wav -F model=openai/whisper-large-v3 ``` The crash occurs on the first request because the multimodal processor returns `input_features` as a CPU tensor (via `return_tensors="pt"` from HuggingFace's feature extractor), and the default `keep_mm_feature_on_device=False` keeps them on CPU. The encoder's `conv1` weights are on CUDA, so `F.conv1d` fails with a device mismatch. ## Benchmark (after fix) Tested on NVIDIA GB300, openai/whisper-large-v3, D4nt3/esb-datasets-earnings22-validation-tiny-filtered (511 samples), concurrency=1: | | CUDA graph | No CUDA graph | |---------------------|---------------|---------------| | default | 4.74 req/s | 1.01 req/s | | keep_mm_on_device | 2.29 req/s | 1.19 req/s | - WER: 12.78% across all configs (matches sgl-project#21190) - CUDA graph gives 4.7x throughput improvement
6d2e157 to
4961f0f
Compare
|
Duplicated with #22293 |
Verified & Root Cause AnalysisTested on B200 — the fix is correct. The bug is 100% reproducible on current main. Root CauseThe device placement issue was introduced by #22038 ([VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer). Before #22038: After #22038: That global transfer was removed (-15 lines from Problem: Whisper does not go through the VLM embedding path — it directly accesses Verification
The fix is the right approach — models should not assume the caller has already moved tensors to the correct device. |
|
My bad, and Whisper will be added to CI afterwards. |
|
QQ for 511 sample test runs, I'm getting following: All of the numbers are req/s. Looks quite different from what you're getting, mind sharing precise command so that I can double check? |
|
The benchmark commands used during #21190 development: Server launch: CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server \
--model-path openai/whisper-large-v3 --port 30000No extra flags — all defaults ( Benchmark: # High concurrency (511 samples)
python benchmark/asr/bench_sglang.py \
--base-url http://localhost:30000 \
--model openai/whisper-large-v3 \
--api-type transcription \
--language en \
--concurrency 64
# Single request (50 samples)
python benchmark/asr/bench_sglang.py \
--base-url http://localhost:30000 \
--model openai/whisper-large-v3 \
--api-type transcription \
--language en \
--concurrency 1 \
--n-examples 50Hardware: Single NVIDIA B200 (183 GB). Our c=64 numbers were ~48 req/s on B200 — the difference from your ~25 req/s on GB300 is likely hardware-dependent (memory bandwidth, clock speeds, etc.). |
|
Verified on GB300 as well. Here are the numbers side by side: SGLang (main + this fix), openai/whisper-large-v3, 511 samples
WER stays at ~12.77% across all configs on both GPUs. |
|
solved in #22293 |
Motivation
After #21190 enabled CUDA graph for Whisper, the server crashes on the first transcription request with:
This happens because
features.to(dtype)at line 462 ofwhisper.pyonly converts the dtype without moving the tensor to GPU. The multimodal processor returnsinput_featuresas a CPU tensor (via HuggingFace's feature extractor withreturn_tensors="pt"), and the defaultkeep_mm_feature_on_device=Falsekeeps them on CPU. When the encoder'sconv1(on CUDA) receives a CPU input,F.conv1dfails.Similarly,
encoder_position_idsis created onfeatures.device(CPU) viatorch.arange(...).to(features.device), so it also lands on CPU.Modifications
features.to(dtype)tofeatures.to(device=device, dtype=dtype)wheredeviceis obtained fromself.encoder.conv1.weight.deviceencoder_position_ids.to(device)to move position IDs to the same deviceRepro
Server crashes with:
Benchmark (after fix)
Tested on NVIDIA GB300,
openai/whisper-large-v3, datasetD4nt3/esb-datasets-earnings22-validation-tiny-filtered(511 samples), concurrency=1:Checklist