Skip to content

Whisper model support & /v1/audio/transcriptions endpoint & benchmark#16983

Merged
Kangyan-Zhou merged 36 commits intosgl-project:mainfrom
JustinTong0323:whisper-model-support
Feb 24, 2026
Merged

Whisper model support & /v1/audio/transcriptions endpoint & benchmark#16983
Kangyan-Zhou merged 36 commits intosgl-project:mainfrom
JustinTong0323:whisper-model-support

Conversation

@JustinTong0323
Copy link
Copy Markdown
Collaborator

@JustinTong0323 JustinTong0323 commented Jan 13, 2026

Motivation

  • Follow up model: support OpenAI Whisper #8064. Thanks @MahmoudAshraf97 for the brilliant work.
  • This PR follows up the original PR and makes a workaround on the encoder-decoder code path.
  • Adds the implementation of the /v1/audio/transcriptions endpoint (currently for whisper only).
  • Also adds a benchmark script for the ASR task.

Modifications

  1. Support whisper model
  2. Add a ASR benchmark

Further TODOs:

  • Adapt to sglang's native encoder-decoder code path
  • fix precision issue when using --attention-backend flashinfer

Accuracy Tests

Launch Server:

sglang serve --model-path openai/whisper-large-v3

Run benchmark script:

cd sglang/benchmark/asr
python bench_sglang.py --api-type transcription --concurrency 128 --show-predictions

Output:

Loading dataset: D4nt3/esb-datasets-earnings22-validation-tiny-filtered...
Using API type: transcription
Repo card metadata block was not found. Setting CardData to empty.
WARNING:huggingface_hub.repocard:Repo card metadata block was not found. Setting CardData to empty.
Performing warmup...
Processing 511 samples...
------------------------------
Results for openai/whisper-large-v3:
Total Requests: 511
WER: 12.7690
Average Latency: 1.3602s
Median Latency: 1.2090s
95th Latency: 2.9986s
Throughput: 19.02 req/s
Token Throughput: 354.19 tok/s
Total Test Time: 26.8726s
------------------------------

==================== Sample Predictions ====================
Sample 1:
  REF: on the use of taxonomy i you know i think it is it is early days for us to to make any clear indications to the market about the proportion that would fall under that requirement
  PRED: on the eu taxonomy i think it is early days for us to make any clear indications to the market about the proportion that would fall under that requirement
----------------------------------------
Sample 2:
  REF: so within fiscal year 2021 say 120 a 100 depending on what the micro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like
  PRED: so within fiscal year 2021 say $120000 $100000 depending on what the macro will do and next year it is not necessarily payable in q one is we will look at what the cash flows for 2022 look like
----------------------------------------
Sample 3:
  REF: we talked about 4.7 gigawatts
  PRED: we talked about 4.7 gigawatts
----------------------------------------
Sample 4:
  REF: and you know depending on that working capital build we will we will see what that yields
  PRED: and depending on that working capital build we will see what that yields what
----------------------------------------
Sample 5:
  REF: so on on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexs are distributed out 30 70%
  PRED: so on sinopec what we have agreed with sinopec way back then is that free cash flows after paying all capexes are distributed out 30% 70%
----------------------------------------
============================================================

Benchmarking and Profiling

As above.

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

MahmoudAshraf97 and others added 19 commits September 8, 2025 18:06
…gration

- Updated ModelConfig to handle unique layer ID scheme for Whisper architecture.
- Modified TokenizerManager to accommodate audio-only requests by using empty placeholders for input_ids.
- Improved WhisperAttention to manage cross-attention with encoder outputs and added masking for batched requests.
- Enhanced WhisperForConditionalGeneration to cache encoder outputs per request and manage input IDs for transcription.
- Added support for Whisper-specific conversation templates in the parser.

These changes optimize the handling of audio inputs and improve the integration of Whisper within the multimodal framework.

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 13, 2026
@yuan-luo yuan-luo self-requested a review January 13, 2026 08:15
JustinTong0323 and others added 3 commits January 14, 2026 20:23
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
@JustinTong0323 JustinTong0323 added high priority Multi-modal multi-modal language model labels Jan 17, 2026
@JustinTong0323
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@JustinTong0323
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds comprehensive support for the Whisper model, including a new OpenAI-compatible transcription endpoint, a benchmark script, and the necessary model and processor implementations. The changes are well-structured and cover all the required aspects from the server entrypoint to the model-specific logic.

I've identified a few areas for improvement:

  • A potential regression in the is_encoder_decoder_model check that could affect existing models.
  • Minor improvements in the new benchmark script for accuracy in metric calculation and code style.
  • A small code simplification in the new transcription serving handler.

Overall, this is a great contribution that significantly extends the capabilities of the server. My detailed comments are below.

JustinTong0323 and others added 3 commits January 17, 2026 18:18
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
logger.warning(
"Cuda graph is disabled for encoder-decoder models (e.g., Whisper)"
)
self.disable_cuda_graph = True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why disable cuda graph for encoder-decoder models?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some compatible issue with the current code path

@JustinTong0323 JustinTong0323 changed the title Whisper model support Whisper model support & /v1/audio/transcriptions endpoint & benchmark Jan 26, 2026
@JustinTong0323
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@JustinTong0323
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

1 similar comment
@JustinTong0323
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

kv, _ = self.kv_proj(cross_hidden_states)
k, v = kv.split([self.kv_size, self.kv_size], dim=-1)
else:
k = torch.zeros_like(q)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we are having zero k and v. We will hit with junk output with this case, as k, v are not from encoder cache output.

With flashinfer backend, are you getting any useful transcript output?

@JustinTong0323
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@JustinTong0323
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@Kangyan-Zhou Kangyan-Zhou merged commit 581bf53 into sgl-project:main Feb 24, 2026
282 of 302 checks passed
zhuxinjie-nz pushed a commit to zhuxinjie-nz/sglang that referenced this pull request Feb 24, 2026
…rk (sgl-project#16983)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
xiaobaicxy added a commit to xiaobaicxy/sglang that referenced this pull request Feb 24, 2026
…o xverse_moe

* 'xverse_moe' of https://github.com/xiaobaicxy/sglang: (275 commits)
  fix: add missing blank line after docstring in serving_transcription.py (sgl-project#19206)
  Whisper model support & `/v1/audio/transcriptions` endpoint & benchmark (sgl-project#16983)
  fix: patch docker image fixes (sgl-project#19100)
  [PD-Disagg] Unify prefill info data transition flow, all with `PrefillServerInfo` (sgl-project#19195)
  [CI] Tiny enhance the dp attention load blance benchmark (sgl-project#19194)
  add new ci user (sgl-project#19133)
  [CI] fix the teardown output of disaggregation test (sgl-project#19193)
  [PD-Disagg] Support query dp rank from bootstrap server. (sgl-project#19168)
  [Kernel Slimming] Migrate AWQ marlin repack kernel to JIT (sgl-project#18949)
  [Diffusion] Match rotary_embedding module name style (sgl-project#19179)
  [Refactor] Split rotary_embedding.py into a modular package (sgl-project#19144)
  [NPU] bump sgl-kernel-npu to 2026.02.01.post2 (sgl-project#19178)
  Use single mma warp group for short q_len in FA to optimize decoding performance (sgl-project#18985)
  Reorganize topk logic to clean up code and expose logical experts (sgl-project#16945)
  [ROCm] Use unreg path for custom all-reduce during CUDA graph capture (sgl-project#19162)
  [diffusion] feat: detect Flux2 custom VAE path from component_paths (sgl-project#19170)
  [AMD] ENV flags tuning and cleanup (sgl-project#19176)
  Fix bench_one_batch_server by moving the print statements (sgl-project#19175)
  Update rocm7.2 Dockerfile to install amdsmi for QuickReduce Initialization (sgl-project#19091)
  Revert "Refactor graph input buffers (sgl-project#18991)" (sgl-project#19173)
  ...
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
…rk (sgl-project#16983)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
…rk (sgl-project#16983)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
KHAEntertainment pushed a commit to Clarit-AI/Engram that referenced this pull request Mar 31, 2026
…nscriptions (sgl-project#16983)

Resolved conflict in fastapi import — kept HEAD's Query + upstream's File/Form/UploadFile.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JustinTong0323 added a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
…rk (sgl-project#16983)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: MahmoudAshraf97 <hassouna97.ma@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation high priority Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants