Whisper model support & /v1/audio/transcriptions endpoint & benchmark#16983
Whisper model support & /v1/audio/transcriptions endpoint & benchmark#16983Kangyan-Zhou merged 36 commits intosgl-project:mainfrom
/v1/audio/transcriptions endpoint & benchmark#16983Conversation
…gration - Updated ModelConfig to handle unique layer ID scheme for Whisper architecture. - Modified TokenizerManager to accommodate audio-only requests by using empty placeholders for input_ids. - Improved WhisperAttention to manage cross-attention with encoder outputs and added masking for batched requests. - Enhanced WhisperForConditionalGeneration to cache encoder outputs per request and manage input IDs for transcription. - Added support for Whisper-specific conversation templates in the parser. These changes optimize the handling of audio inputs and improve the integration of Whisper within the multimodal framework. Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds comprehensive support for the Whisper model, including a new OpenAI-compatible transcription endpoint, a benchmark script, and the necessary model and processor implementations. The changes are well-structured and cover all the required aspects from the server entrypoint to the model-specific logic.
I've identified a few areas for improvement:
- A potential regression in the
is_encoder_decoder_modelcheck that could affect existing models. - Minor improvements in the new benchmark script for accuracy in metric calculation and code style.
- A small code simplification in the new transcription serving handler.
Overall, this is a great contribution that significantly extends the capabilities of the server. My detailed comments are below.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
| logger.warning( | ||
| "Cuda graph is disabled for encoder-decoder models (e.g., Whisper)" | ||
| ) | ||
| self.disable_cuda_graph = True |
There was a problem hiding this comment.
Just curious why disable cuda graph for encoder-decoder models?
There was a problem hiding this comment.
some compatible issue with the current code path
/v1/audio/transcriptions endpoint & benchmark
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
| kv, _ = self.kv_proj(cross_hidden_states) | ||
| k, v = kv.split([self.kv_size, self.kv_size], dim=-1) | ||
| else: | ||
| k = torch.zeros_like(q) |
There was a problem hiding this comment.
Why we are having zero k and v. We will hit with junk output with this case, as k, v are not from encoder cache output.
With flashinfer backend, are you getting any useful transcript output?
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
…rk (sgl-project#16983) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…o xverse_moe * 'xverse_moe' of https://github.com/xiaobaicxy/sglang: (275 commits) fix: add missing blank line after docstring in serving_transcription.py (sgl-project#19206) Whisper model support & `/v1/audio/transcriptions` endpoint & benchmark (sgl-project#16983) fix: patch docker image fixes (sgl-project#19100) [PD-Disagg] Unify prefill info data transition flow, all with `PrefillServerInfo` (sgl-project#19195) [CI] Tiny enhance the dp attention load blance benchmark (sgl-project#19194) add new ci user (sgl-project#19133) [CI] fix the teardown output of disaggregation test (sgl-project#19193) [PD-Disagg] Support query dp rank from bootstrap server. (sgl-project#19168) [Kernel Slimming] Migrate AWQ marlin repack kernel to JIT (sgl-project#18949) [Diffusion] Match rotary_embedding module name style (sgl-project#19179) [Refactor] Split rotary_embedding.py into a modular package (sgl-project#19144) [NPU] bump sgl-kernel-npu to 2026.02.01.post2 (sgl-project#19178) Use single mma warp group for short q_len in FA to optimize decoding performance (sgl-project#18985) Reorganize topk logic to clean up code and expose logical experts (sgl-project#16945) [ROCm] Use unreg path for custom all-reduce during CUDA graph capture (sgl-project#19162) [diffusion] feat: detect Flux2 custom VAE path from component_paths (sgl-project#19170) [AMD] ENV flags tuning and cleanup (sgl-project#19176) Fix bench_one_batch_server by moving the print statements (sgl-project#19175) Update rocm7.2 Dockerfile to install amdsmi for QuickReduce Initialization (sgl-project#19091) Revert "Refactor graph input buffers (sgl-project#18991)" (sgl-project#19173) ...
…rk (sgl-project#16983) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…rk (sgl-project#16983) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…nscriptions (sgl-project#16983) Resolved conflict in fastapi import — kept HEAD's Query + upstream's File/Form/UploadFile. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rk (sgl-project#16983) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Motivation
/v1/audio/transcriptionsendpoint (currently for whisper only).Modifications
Further TODOs:
--attention-backend flashinferAccuracy Tests
Launch Server:
Run benchmark script:
cd sglang/benchmark/asr python bench_sglang.py --api-type transcription --concurrency 128 --show-predictionsOutput:
Benchmarking and Profiling
As above.
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci