[Fix] Handle online FP8 quantization config for unquantized diffusion models#26261
Open
yichiche wants to merge 4 commits into
Open
[Fix] Handle online FP8 quantization config for unquantized diffusion models#26261yichiche wants to merge 4 commits into
yichiche wants to merge 4 commits into
Conversation
… models
When --quantization fp8 is used on a model that has no embedded
quantization metadata (online quantization), Fp8Config.from_config({})
crashes with ValueError because the empty dict lacks the required
quant_method key. Fall back to direct Fp8Config construction with
default parameters, which is the correct path for online FP8
quantization of unquantized model weights.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
e75b492 to
93e01ee
Compare
HaiShaw
approved these changes
May 26, 2026
ColinZ22
reviewed
May 26, 2026
Comment on lines
+500
to
+503
| try: | ||
| return quant_cls.from_config({}) | ||
| except (ValueError, KeyError): | ||
| ignored = getattr(server_args, "quantization_ignored_layers", None) |
Contributor
There was a problem hiding this comment.
All default values for online quantization are already set for all supported online quantization methods (mxfp4 and fp8), no need to add try catch, 1-line fix is sufficient as detailed here: #26415
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
When running diffusion models with
--quantization fp8on an unquantized model (online FP8 quantization), the server crashes during model loading with:This happens because
_resolve_quant_config()callsFp8Config.from_config({})with an empty dict — unquantized models have no embedded quantization metadata, so the empty config lacks the requiredquant_methodkey.Modifications
File:
python/sglang/multimodal_gen/runtime/loader/transformer_load_utils.pyAdd a try/except fallback around
quant_cls.from_config({}). When it raisesValueErrororKeyError(no quant metadata in config), fall back to directFp8Config(ignored_layers=...)construction with default parameters — the correct path for online FP8 quantization of unquantized model weights.Benchmarking
Model: Wan2.2-T2V-A14B | GPU: MI355X (gfx950) | Config: 1 GPU,
--quantization fp8, layerwise offloadAttention Kernel Comparison
aiter::fmha_fwd_hd128_bf16aiter::mla_pfl_qh192_vh128_m32x8_n128x1_causal0End-to-End Pipeline Timing
Total Kernel Time
Accuracy Tests
Validated with Wan2.2-T2V-A14B model on MI355X (gfx950):
--quantization fp8Checklist
pre-commit run --all-filesfrom_config()path