[Fix] Handle online FP8 quantization config for unquantized diffusion models by yichiche · Pull Request #26261 · sgl-project/sglang

yichiche · 2026-05-25T06:50:14Z

Motivation

When running diffusion models with --quantization fp8 on an unquantized model (online FP8 quantization), the server crashes during model loading with:

ValueError: Cannot find any of ['quant_method'] in the model's quantization config

This happens because _resolve_quant_config() calls Fp8Config.from_config({}) with an empty dict — unquantized models have no embedded quantization metadata, so the empty config lacks the required quant_method key.

Modifications

File: python/sglang/multimodal_gen/runtime/loader/transformer_load_utils.py

Add a try/except fallback around quant_cls.from_config({}). When it raises ValueError or KeyError (no quant metadata in config), fall back to direct Fp8Config(ignored_layers=...) construction with default parameters — the correct path for online FP8 quantization of unquantized model weights.

Benchmarking

Model: Wan2.2-T2V-A14B | GPU: MI355X (gfx950) | Config: 1 GPU, --quantization fp8, layerwise offload

Attention Kernel Comparison

Metric	Before (BF16 attn)	After (FP8 attn)	Delta
Kernel	`aiter::fmha_fwd_hd128_bf16`	`aiter::mla_pfl_qh192_vh128_m32x8_n128x1_causal0`	—
Avg per call (us)	66,417	44,166	-33.5%
Total attention time (us)	85,014,040	56,532,228	-33.5%
% of total kernel time	70.1%	55.2%	-14.9pp

End-to-End Pipeline Timing

Stage	Before (s)	After (s)	Delta
DenoisingStage (avg/step)	13.72	11.48	-16.3%
DenoisingStage (total)	109.83	91.87	-16.4%
DecodingStage	5.61	5.87	+4.6%
E2E (warmup excluded)	166.76	164.60	-1.3%

Total Kernel Time

Metric	Before (us)	After (us)	Delta
Total kernel time	121,247,239	102,388,825	-15.6%
Attention category	85,530,218 (70.5%)	57,053,099 (55.7%)	-33.3%
GEMM (unchanged)	14,567,915 (12.0%)	14,570,532 (14.2%)	+0.0%

Note: The FP8 attention path introduces additional FP8 quantization/dequantization elementwise kernels (per-tensor quant, abs, max-reduce, fp8 copy), which add ~6.6M us of overhead. Net kernel time savings remain 15.6% after accounting for this overhead.

Accuracy Tests

Validated with Wan2.2-T2V-A14B model on MI355X (gfx950):

Video generation completed successfully with --quantization fp8
Generated video quality matches baseline

Checklist

Format: pre-commit run --all-files
Tested on AMD ROCm MI355X
Backwards compatible: models with embedded quant config still use from_config() path
No changes to public API

… models When --quantization fp8 is used on a model that has no embedded quantization metadata (online quantization), Fp8Config.from_config({}) crashes with ValueError because the empty dict lacks the required quant_method key. Fall back to direct Fp8Config construction with default parameters, which is the correct path for online FP8 quantization of unquantized model weights.

gemini-code-assist · 2026-05-25T06:50:18Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ColinZ22 · 2026-05-26T19:31:52Z

+        try:
+            return quant_cls.from_config({})
+        except (ValueError, KeyError):
+            ignored = getattr(server_args, "quantization_ignored_layers", None)


All default values for online quantization are already set for all supported online quantization methods (mxfp4 and fp8), no need to add try catch, 1-line fix is sufficient as detailed here: #26415

Hi @ColinZ22, thanks for the cleaner fix — agreed, quant_cls() is the right abstraction and generalizes to mxfp4, so it's strictly better than the FP8-only try/except here. I'll close this PR in favor of #26415.

github-actions Bot and others added 4 commits May 7, 2026 14:27

docs: sync LMSYS SGLang blog cards

472b94e

Merge branch 'sgl-project:main' into main

09c8c9e

Merge branch 'sgl-project:main' into main

25688fc

yichiche requested review from BBuf, mickqian, ping1jing2, yhyang201 and yingluosanqian as code owners May 25, 2026 06:50

github-actions Bot added the diffusion SGLang Diffusion label May 25, 2026

yichiche added run-ci amd labels May 25, 2026

yichiche mentioned this pull request May 25, 2026

[Diffusion] [AMD] Online MXFP4 and FP8 Quantization for Multimodal Generation #21431

Merged

yichiche force-pushed the fix/diffusion-fp8-aiter-attention branch from e75b492 to 93e01ee Compare May 25, 2026 08:44

yichiche changed the title ~~[Fix] Auto-enable FP8 attention in AITER backend with --quantization fp8~~ [Fix] Enable FP8 attention in AITER backend with --quantization fp8 May 25, 2026

yichiche changed the title ~~[Fix] Enable FP8 attention in AITER backend with --quantization fp8~~ [Fix] Handle online FP8 quantization config for unquantized diffusion models May 25, 2026

HaiShaw approved these changes May 26, 2026

View reviewed changes

ColinZ22 reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Handle online FP8 quantization config for unquantized diffusion models#26261

[Fix] Handle online FP8 quantization config for unquantized diffusion models#26261
yichiche wants to merge 4 commits into
sgl-project:mainfrom
yichiche:fix/diffusion-fp8-aiter-attention

yichiche commented May 25, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented May 25, 2026

Uh oh!

ColinZ22 May 26, 2026

Uh oh!

yichiche May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yichiche commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmarking

Attention Kernel Comparison

End-to-End Pipeline Timing

Total Kernel Time

Accuracy Tests

Checklist

Uh oh!

gemini-code-assist Bot commented May 25, 2026

Uh oh!

ColinZ22 May 26, 2026

Choose a reason for hiding this comment

Uh oh!

yichiche May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yichiche commented May 25, 2026 •

edited

Loading