Skip to content

[diffusion] quant: Add flag for runtime quantization#23373

Closed
avjves wants to merge 11 commits into
sgl-project:mainfrom
avjves:feature/runtime_quant
Closed

[diffusion] quant: Add flag for runtime quantization#23373
avjves wants to merge 11 commits into
sgl-project:mainfrom
avjves:feature/runtime_quant

Conversation

@avjves
Copy link
Copy Markdown
Contributor

@avjves avjves commented Apr 21, 2026

Motivation

Currently there's no way to quantize the model weights (linear layers) during runtime. Instead, only pre-quantized transformers are supported. This PR adds a new flag --quantization that can be used to do quantization on-the-fly.

Currently it only supports FP8 quantization, as out of the currently supported techniques, FP8 is the only one suited for runtime quantization.

Modifications

  1. Adds a new flag --quantization for quantizing the model during runtime.
  2. Fixes the quantization config fetching so that NVFP4 configs don't take precedence over CLI arguments.

Accuracy Tests

No quantization (bf16):

Add_a_cool_hat_to_the_cat_bf16

Offline quantization before running (fp8):

Add_a_cool_hat_to_the_cat_20260421-113220_32a8d58d_fp8_pre

Runtime quantization (fp8):

Add_a_cool_hat_to_the_cat_runtime

This PR doesn't add any new quantization techniques, but the perf report below highlights that the quantization is used.

1. High-level Summary

Metric Baseline New Diff Status
E2E Latency 6482.07 ms 4994.77 ms -1487.30 ms (-22.9%)
Throughput 0.15 req/s 0.20 req/s - -

2. Stage Breakdown

Stage Name Baseline (ms) New (ms) Diff (ms) Diff (%) Status
InputValidationStage 15.14 15.29 +0.15 +1.0% ⚪️
TextEncodingStage 17.76 17.78 +0.02 +0.1% ⚪️
ImageVAEEncodingStage 131.87 132.05 +0.18 +0.1% ⚪️
LatentPreparationStage 0.17 0.21 +0.04 +21.6% ⚪️
TimestepPreparationStage 0.27 0.29 +0.02 +6.7% ⚪️
DenoisingStage 6290.57 4796.93 -1493.65 -23.7% 🟢
DecodingStage 5.24 7.71 +2.47 +47.1% ⚪️

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@github-actions github-actions Bot added quant LLM Quantization diffusion SGLang Diffusion labels Apr 21, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces runtime FP8 quantization support by adding a --quantization CLI argument and updating the quantization configuration resolution process. Feedback indicates that the refactoring of _resolve_quant_config contains regressions, including the use of stale configuration names and incorrect early returns that skip essential NVFP4 inference and metadata probing.

avjves and others added 2 commits April 21, 2026 15:17
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@mickqian
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

Copy link
Copy Markdown
Collaborator

@mickqian mickqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you update the doc as a follow up

@avjves avjves requested a review from wisclmy0611 as a code owner April 24, 2026 14:06
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Apr 24, 2026
@avjves
Copy link
Copy Markdown
Contributor Author

avjves commented Apr 24, 2026

@mickqian I updated the docs now :)

@mickqian
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@avjves
Copy link
Copy Markdown
Contributor Author

avjves commented May 11, 2026

Closing, as #20922 was merged that already adds this support.

@avjves avjves closed this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants