Whisper model support & `/v1/audio/transcriptions` endpoint & benchmark by JustinTong0323 · Pull Request #16983 · sgl-project/sglang

JustinTong0323 · 2026-01-13T00:00:41Z

Motivation

Follow up model: support OpenAI Whisper #8064. Thanks @MahmoudAshraf97 for the brilliant work.
This PR follows up the original PR and makes a workaround on the encoder-decoder code path.
Adds the implementation of the /v1/audio/transcriptions endpoint (currently for whisper only).
Also adds a benchmark script for the ASR task.

Modifications

Support whisper model
Add a ASR benchmark

Further TODOs:

Adapt to sglang's native encoder-decoder code path
fix precision issue when using --attention-backend flashinfer

Accuracy Tests

Launch Server:

sglang serve --model-path openai/whisper-large-v3

Run benchmark script:

cd sglang/benchmark/asr
python bench_sglang.py --api-type transcription --concurrency 128 --show-predictions

Output:

Loading dataset: D4nt3/esb-datasets-earnings22-validation-tiny-filtered...
Using API type: transcription
Repo card metadata block was not found. Setting CardData to empty.
WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.
Performing warmup...
Processing 511 samples...
------------------------------
Results for openai/whisper-large-v3:
Total Requests: 511
WER: 12.7690
Average Latency: 1.3602s
Median Latency: 1.2090s
95th Latency: 2.9986s
Throughput: 19.02 req/s
Token Throughput: 354.19 tok/s
Total Test Time: 26.8726s
------------------------------

==================== Sample Predictions ====================
Sample 1:
  REF: on the use of taxonomy i you know i think it is it is early days for us to to make any clear indications to the market about the proportion that would fall under that requirement
  PRED: on the eu taxonomy i think it is early days for us to make any clear indications to the market about the proportion that would fall under that requirement
----------------------------------------
Sample 2:
  REF: so within fiscal year 2021 say 120 a 100 depending on what the micro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like
  PRED: so within fiscal year 2021 say $120000 $100000 depending on what the macro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like
----------------------------------------
Sample 3:
  REF: we talked about 4.7 gigawatts
  PRED: we talked about 4.7 gigawatts
----------------------------------------
Sample 4:
  REF: and you know depending on that working capital build we will we will see what that yields
  PRED: and depending on that working capital build we will see what that yields what
----------------------------------------
Sample 5:
  REF: so on on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexs are distributed out 30 70%
  PRED: so on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexes are distributed out 30% 70%
----------------------------------------
============================================================

Benchmarking and Profiling

As above.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

…gration - Updated ModelConfig to handle unique layer ID scheme for Whisper architecture. - Modified TokenizerManager to accommodate audio-only requests by using empty placeholders for input_ids. - Improved WhisperAttention to manage cross-attention with encoder outputs and added masking for batched requests. - Enhanced WhisperForConditionalGeneration to cache encoder outputs per request and manage input IDs for transcription. - Added support for Whisper-specific conversation templates in the parser. These changes optimize the handling of audio inputs and improve the integration of Whisper within the multimodal framework. Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

gemini-code-assist · 2026-01-13T00:00:44Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

python/sglang/srt/models/whisper.py

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

JustinTong0323 · 2026-01-17T10:12:51Z

/tag-and-rerun-ci

JustinTong0323 · 2026-01-17T10:13:01Z

/gemini review

gemini-code-assist

Code Review

This pull request adds comprehensive support for the Whisper model, including a new OpenAI-compatible transcription endpoint, a benchmark script, and the necessary model and processor implementations. The changes are well-structured and cover all the required aspects from the server entrypoint to the model-specific logic.

I've identified a few areas for improvement:

A potential regression in the is_encoder_decoder_model check that could affect existing models.
Minor improvements in the new benchmark script for accuracy in metric calculation and code style.
A small code simplification in the new transcription serving handler.

Overall, this is a great contribution that significantly extends the capabilities of the server. My detailed comments are below.

python/sglang/srt/configs/model_config.py

benchmark/asr/bench_sglang.py

python/sglang/srt/entrypoints/openai/serving_transcription.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

yuan-luo · 2026-01-22T02:18:11Z

python/sglang/srt/server_args.py

+            logger.warning(
+                "Cuda graph is disabled for encoder-decoder models (e.g., Whisper)"
+            )
+            self.disable_cuda_graph = True


Just curious why disable cuda graph for encoder-decoder models?

some compatible issue with the current code path

JustinTong0323 · 2026-01-26T06:36:22Z

/tag-and-rerun-ci

JustinTong0323 · 2026-01-27T16:55:23Z

/rerun-failed-ci

JustinTong0323 · 2026-01-27T21:55:21Z

/rerun-failed-ci

vshekhawat-hlab · 2026-01-28T13:21:03Z

python/sglang/srt/models/whisper.py

+                kv, _ = self.kv_proj(cross_hidden_states)
+                k, v = kv.split([self.kv_size, self.kv_size], dim=-1)
+            else:
+                k = torch.zeros_like(q)


Why we are having zero k and v. We will hit with junk output with this case, as k, v are not from encoder cache output.

With flashinfer backend, are you getting any useful transcript output?

JustinTong0323 · 2026-01-29T23:00:23Z

/tag-and-rerun-ci

JustinTong0323 · 2026-02-02T04:59:16Z

/rerun-failed-ci

…rk (sgl-project#16983) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…o xverse_moe * 'xverse_moe' of https://github.com/xiaobaicxy/sglang: (275 commits) fix: add missing blank line after docstring in serving_transcription.py (sgl-project#19206) Whisper model support & `/v1/audio/transcriptions` endpoint & benchmark (sgl-project#16983) fix: patch docker image fixes (sgl-project#19100) [PD-Disagg] Unify prefill info data transition flow, all with `PrefillServerInfo` (sgl-project#19195) [CI] Tiny enhance the dp attention load blance benchmark (sgl-project#19194) add new ci user (sgl-project#19133) [CI] fix the teardown output of disaggregation test (sgl-project#19193) [PD-Disagg] Support query dp rank from bootstrap server. (sgl-project#19168) [Kernel Slimming] Migrate AWQ marlin repack kernel to JIT (sgl-project#18949) [Diffusion] Match rotary_embedding module name style (sgl-project#19179) [Refactor] Split rotary_embedding.py into a modular package (sgl-project#19144) [NPU] bump sgl-kernel-npu to 2026.02.01.post2 (sgl-project#19178) Use single mma warp group for short q_len in FA to optimize decoding performance (sgl-project#18985) Reorganize topk logic to clean up code and expose logical experts (sgl-project#16945) [ROCm] Use unreg path for custom all-reduce during CUDA graph capture (sgl-project#19162) [diffusion] feat: detect Flux2 custom VAE path from component_paths (sgl-project#19170) [AMD] ENV flags tuning and cleanup (sgl-project#19176) Fix bench_one_batch_server by moving the print statements (sgl-project#19175) Update rocm7.2 Dockerfile to install amdsmi for QuickReduce Initialization (sgl-project#19091) Revert "Refactor graph input buffers (sgl-project#18991)" (sgl-project#19173) ...

…rk (sgl-project#16983) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…nscriptions (sgl-project#16983) Resolved conflict in fastapi import — kept HEAD's Query + upstream's File/Form/UploadFile. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rk (sgl-project#16983) Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

MahmoudAshraf97 and others added 19 commits September 8, 2025 18:06

initial commit

878d33f

fix data loading

3a6e4e4

working prefill

9177f4d

reimplement using SGL modules

0867860

pad_input_ids

bdec36d

use vocabparallelembedding

cfd87e2

disable kvcache for the encoder

79c1a40

change lm head class

b852ffd

fix output signature

22e15ab

correct permutations for hidden state

9bc44c0

use correct processor output shapes

0cea116

set as enc-dec model

981389f

use encoder position ids and encoder lengths

258eda0

update processor init

061886c

fix proj_out weight

6392558

update processor to accept input ids

a816ea1

Merge branch 'main' into whisper

569ae68

Merge branch 'main' into whisper

97b7629

JustinTong0323 requested review from Ying1123, hnyls2002, merrymercy, mickqian, xiezhq-hermann and yhyang201 as code owners January 13, 2026 00:00

github-actions bot added the documentation Improvements or additions to documentation label Jan 13, 2026

mickqian reviewed Jan 13, 2026

View reviewed changes

python/sglang/srt/models/whisper.py Show resolved Hide resolved

yuan-luo self-requested a review January 13, 2026 08:15

MahmoudAshraf97 reviewed Jan 13, 2026

View reviewed changes

python/sglang/srt/models/whisper.py Outdated Show resolved Hide resolved

github-actions bot added the run-ci label Jan 14, 2026

JustinTong0323 and others added 3 commits January 14, 2026 20:23

cleanup

43dbe0e

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

cleanup

8156daa

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

Merge branch 'main' into whisper-model-support

f3a390a

JustinTong0323 added high priority Multi-modal multi-modal language model labels Jan 17, 2026

Merge branch 'main' into whisper-model-support

c2cb758

gemini-code-assist bot reviewed Jan 17, 2026

View reviewed changes

python/sglang/srt/configs/model_config.py Show resolved Hide resolved

benchmark/asr/bench_sglang.py Outdated Show resolved Hide resolved

benchmark/asr/bench_sglang.py Show resolved Hide resolved

python/sglang/srt/entrypoints/openai/serving_transcription.py Outdated Show resolved Hide resolved

JustinTong0323 and others added 3 commits January 17, 2026 18:18

Update benchmark/asr/bench_sglang.py

f23c4c0

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Simplify audio duration calculation

dade399

fix gemini suggestion: import np

610eb98

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

yuan-luo reviewed Jan 22, 2026

View reviewed changes

JustinTong0323 changed the title ~~Whisper model support~~ Whisper model support & /v1/audio/transcriptions endpoint & benchmark Jan 26, 2026

Merge branch 'main' into whisper-model-support

134d3b6

yhyang201 approved these changes Jan 26, 2026

View reviewed changes

vshekhawat-hlab reviewed Jan 28, 2026

View reviewed changes

Kangyan-Zhou merged commit 581bf53 into sgl-project:main Feb 24, 2026
282 of 302 checks passed

Conversation

JustinTong0323 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 13, 2026

Uh oh!

Uh oh!

Uh oh!

JustinTong0323 commented Jan 17, 2026

Uh oh!

JustinTong0323 commented Jan 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuan-luo Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 commented Jan 26, 2026

Uh oh!

JustinTong0323 commented Jan 27, 2026

Uh oh!

JustinTong0323 commented Jan 27, 2026

Uh oh!

vshekhawat-hlab Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 commented Jan 29, 2026

Uh oh!

JustinTong0323 commented Feb 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

JustinTong0323 commented Jan 13, 2026 •

edited

Loading