[diffusion] quant: Add flag for runtime quantization by avjves · Pull Request #23373 · sgl-project/sglang

avjves · 2026-04-21T12:09:57Z

Motivation

Currently there's no way to quantize the model weights (linear layers) during runtime. Instead, only pre-quantized transformers are supported. This PR adds a new flag --quantization that can be used to do quantization on-the-fly.

Currently it only supports FP8 quantization, as out of the currently supported techniques, FP8 is the only one suited for runtime quantization.

Modifications

Adds a new flag --quantization for quantizing the model during runtime.
Fixes the quantization config fetching so that NVFP4 configs don't take precedence over CLI arguments.

Accuracy Tests

No quantization (bf16):

Offline quantization before running (fp8):

Add_a_cool_hat_to_the_cat_20260421-113220_32a8d58d_fp8_pre

Runtime quantization (fp8):

This PR doesn't add any new quantization techniques, but the perf report below highlights that the quantization is used.

1. High-level Summary

Metric	Baseline	New	Diff	Status
E2E Latency	6482.07 ms	4994.77 ms	-1487.30 ms (-22.9%)	✅
Throughput	0.15 req/s	0.20 req/s	-	-

2. Stage Breakdown

Stage Name	Baseline (ms)	New (ms)	Diff (ms)	Diff (%)	Status
InputValidationStage	15.14	15.29	+0.15	+1.0%	⚪️
TextEncodingStage	17.76	17.78	+0.02	+0.1%	⚪️
ImageVAEEncodingStage	131.87	132.05	+0.18	+0.1%	⚪️
LatentPreparationStage	0.17	0.21	+0.04	+21.6%	⚪️
TimestepPreparationStage	0.27	0.29	+0.02	+6.7%	⚪️
DenoisingStage	6290.57	4796.93	-1493.65	-23.7%	🟢
DecodingStage	5.24	7.71	+2.47	+47.1%	⚪️

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request introduces runtime FP8 quantization support by adding a --quantization CLI argument and updating the quantization configuration resolution process. Feedback indicates that the refactoring of _resolve_quant_config contains regressions, including the use of stale configuration names and incorrect early returns that skip essential NVFP4 inference and metadata probing.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

mickqian · 2026-04-21T14:39:39Z

/tag-and-rerun-ci

mickqian

could you update the doc as a follow up

… feature/runtime_quant

avjves · 2026-04-24T14:07:05Z

@mickqian I updated the docs now :)

mickqian · 2026-04-24T16:08:07Z

/rerun-failed-ci

avjves · 2026-05-11T07:24:58Z

Closing, as #20922 was merged that already adds this support.

[diffusion] quant: Add flag for runtime quantization

8f3ae00

avjves requested review from mickqian, ping1jing2 and yhyang201 as code owners April 21, 2026 12:09

github-actions Bot added quant LLM Quantization diffusion SGLang Diffusion labels Apr 21, 2026

gemini-code-assist Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread python/sglang/multimodal_gen/runtime/loader/transformer_load_utils.py

avjves mentioned this pull request Apr 21, 2026

[Feature] diffusion: dynamic quantization #22884

Open

2 tasks

avjves and others added 2 commits April 21, 2026 15:17

Apply suggestion from @gemini-code-assist[bot]

ade2684

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Remove comment

e66de4a

mickqian approved these changes Apr 21, 2026

View reviewed changes

github-actions Bot added the run-ci label Apr 21, 2026

BowenBao mentioned this pull request Apr 22, 2026

[Diffusion] [AMD] Online MXFP4 and FP8 Quantization for Multimodal Generation #21431

Merged

Merge branch 'main' into feature/runtime_quant

b73a4fe

mickqian reviewed Apr 23, 2026

View reviewed changes

avjves added 3 commits April 24, 2026 08:57

Update docs

7ff9157

Rename runtime quant to online quant

aa69fc7

Merge branch 'feature/runtime_quant' of github.com:avjves/sglang into…

eb8fade

… feature/runtime_quant

avjves requested a review from wisclmy0611 as a code owner April 24, 2026 14:06

github-actions Bot added the documentation Improvements or additions to documentation label Apr 24, 2026

Merge branch 'main' into feature/runtime_quant

76b1594

avjves and others added 3 commits April 27, 2026 14:23

Merge branch 'main' into feature/runtime_quant

7878e8f

Merge branch 'main' into feature/runtime_quant

a05e35e

Merge branch 'main' into feature/runtime_quant

6c3a518

avjves closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[diffusion] quant: Add flag for runtime quantization#23373

[diffusion] quant: Add flag for runtime quantization#23373
avjves wants to merge 11 commits into
sgl-project:mainfrom
avjves:feature/runtime_quant

avjves commented Apr 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

mickqian commented Apr 21, 2026

Uh oh!

mickqian left a comment

Uh oh!

avjves commented Apr 24, 2026

Uh oh!

mickqian commented Apr 24, 2026

Uh oh!

avjves commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

avjves commented Apr 21, 2026

Motivation

Modifications

Accuracy Tests

No quantization (bf16):

Offline quantization before running (fp8):

Runtime quantization (fp8):

1. High-level Summary

2. Stage Breakdown

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mickqian commented Apr 21, 2026

Uh oh!

mickqian left a comment

Choose a reason for hiding this comment

Uh oh!

avjves commented Apr 24, 2026

Uh oh!

mickqian commented Apr 24, 2026

Uh oh!

avjves commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants