Skip to content

[Fix] Handle online FP8 quantization config for unquantized diffusion models#26261

Open
yichiche wants to merge 4 commits into
sgl-project:mainfrom
yichiche:fix/diffusion-fp8-aiter-attention
Open

[Fix] Handle online FP8 quantization config for unquantized diffusion models#26261
yichiche wants to merge 4 commits into
sgl-project:mainfrom
yichiche:fix/diffusion-fp8-aiter-attention

Conversation

@yichiche
Copy link
Copy Markdown
Collaborator

@yichiche yichiche commented May 25, 2026

Motivation

When running diffusion models with --quantization fp8 on an unquantized model (online FP8 quantization), the server crashes during model loading with:

ValueError: Cannot find any of ['quant_method'] in the model's quantization config

This happens because _resolve_quant_config() calls Fp8Config.from_config({}) with an empty dict — unquantized models have no embedded quantization metadata, so the empty config lacks the required quant_method key.

Modifications

File: python/sglang/multimodal_gen/runtime/loader/transformer_load_utils.py

Add a try/except fallback around quant_cls.from_config({}). When it raises ValueError or KeyError (no quant metadata in config), fall back to direct Fp8Config(ignored_layers=...) construction with default parameters — the correct path for online FP8 quantization of unquantized model weights.

Benchmarking

Model: Wan2.2-T2V-A14B | GPU: MI355X (gfx950) | Config: 1 GPU, --quantization fp8, layerwise offload

Attention Kernel Comparison

Metric Before (BF16 attn) After (FP8 attn) Delta
Kernel aiter::fmha_fwd_hd128_bf16 aiter::mla_pfl_qh192_vh128_m32x8_n128x1_causal0
Avg per call (us) 66,417 44,166 -33.5%
Total attention time (us) 85,014,040 56,532,228 -33.5%
% of total kernel time 70.1% 55.2% -14.9pp

End-to-End Pipeline Timing

Stage Before (s) After (s) Delta
DenoisingStage (avg/step) 13.72 11.48 -16.3%
DenoisingStage (total) 109.83 91.87 -16.4%
DecodingStage 5.61 5.87 +4.6%
E2E (warmup excluded) 166.76 164.60 -1.3%

Total Kernel Time

Metric Before (us) After (us) Delta
Total kernel time 121,247,239 102,388,825 -15.6%
Attention category 85,530,218 (70.5%) 57,053,099 (55.7%) -33.3%
GEMM (unchanged) 14,567,915 (12.0%) 14,570,532 (14.2%) +0.0%

Note: The FP8 attention path introduces additional FP8 quantization/dequantization elementwise kernels (per-tensor quant, abs, max-reduce, fp8 copy), which add ~6.6M us of overhead. Net kernel time savings remain 15.6% after accounting for this overhead.

Accuracy Tests

Validated with Wan2.2-T2V-A14B model on MI355X (gfx950):

  • Video generation completed successfully with --quantization fp8
  • Generated video quality matches baseline

Checklist

  • Format: pre-commit run --all-files
  • Tested on AMD ROCm MI355X
  • Backwards compatible: models with embedded quant config still use from_config() path
  • No changes to public API

github-actions Bot and others added 4 commits May 7, 2026 14:27
… models

When --quantization fp8 is used on a model that has no embedded
quantization metadata (online quantization), Fp8Config.from_config({})
crashes with ValueError because the empty dict lacks the required
quant_method key.  Fall back to direct Fp8Config construction with
default parameters, which is the correct path for online FP8
quantization of unquantized model weights.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the diffusion SGLang Diffusion label May 25, 2026
@yichiche yichiche force-pushed the fix/diffusion-fp8-aiter-attention branch from e75b492 to 93e01ee Compare May 25, 2026 08:44
@yichiche yichiche changed the title [Fix] Auto-enable FP8 attention in AITER backend with --quantization fp8 [Fix] Enable FP8 attention in AITER backend with --quantization fp8 May 25, 2026
@yichiche yichiche changed the title [Fix] Enable FP8 attention in AITER backend with --quantization fp8 [Fix] Handle online FP8 quantization config for unquantized diffusion models May 25, 2026
Comment on lines +500 to +503
try:
return quant_cls.from_config({})
except (ValueError, KeyError):
ignored = getattr(server_args, "quantization_ignored_layers", None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All default values for online quantization are already set for all supported online quantization methods (mxfp4 and fp8), no need to add try catch, 1-line fix is sufficient as detailed here: #26415

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ColinZ22, thanks for the cleaner fix — agreed, quant_cls() is the right abstraction and generalizes to mxfp4, so it's strictly better than the FP8-only try/except here. I'll close this PR in favor of #26415.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants