[Bugfix] Fix broken fp8 quantisation on Z-Image-Turbo, Qwen-Image, FLUX.1-dev#2795
Conversation
…r FP8 (vllm-project#2728) FP8 online quantization on Z-Image-Turbo produced pure pixel noise (LPIPS 0.74 vs BF16). Root cause: small precision-sensitive layers — TimestepEmbedder MLP, x_embedder, cap_embedder, per-block adaLN_modulation, and FinalLayer's output/modulation — were being FP8-quantized. Errors on these layers feed the scale chain that multiplies the residual stream every block, so small per-layer drift turns into catastrophic magnitude blow-up by layer 30. Mirrors the earlier OmniGen2 FP8 fix (dbf8b7c). Swap these 6 layers from to plain . Main-path matmuls (to_qkv, to_out, feed_forward.w13, feed_forward.w2) stay FP8, so the memory win is preserved. After fix: LPIPS 0.0659 (PASS, threshold 0.1). Signed-off-by: Zhang <jianmusings@gmail.com>
…under FP8 (vllm-project#2728) FP8 online quantization on Qwen-Image produced pure pixel noise (LPIPS 0.95 vs BF16). Same root cause as the Z-Image fix: precision- sensitive small layers (time embedder, img_in/txt_in entry, per-block img_mod/txt_mod modulation, norm_out.linear, proj_out) feed the shift/scale/gate chain that multiplies the residual stream every block, so small per-layer drift blows up into noise. After fix: LPIPS 0.32 (PASS, Qwen-Image threshold 0.35). Main-path matmuls (to_qkv, to_out, add_kv_proj, to_add_out, img_mlp, txt_mlp) remain FP8 for memory savings — peak ~41 GB vs ~59 GB BF16. Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
…er FP8 (vllm-project#2728) FP8 online quantization on FLUX.1-dev produced pure pixel noise (LPIPS 0.93 vs BF16). Unlike Z-Image/Qwen where the modulation/embedder pattern was enough, FLUX's dual-stream blocks (19 FluxTransformerBlock) run joint attention over concatenated [text, image] tokens — the mixed-distribution activations don't tolerate FP8 per-token quant, and neither the attn nor ff sub-layers can individually take FP8. Keep dual blocks fully BF16 and keep per-block modulation and final norm_out unquantized. Single blocks (38 of them, ~2x more param than dual) remain FP8, preserving most of the memory saving. After fix: LPIPS 0.1201 (PASS, FLUX threshold 0.20). Peak 33.2 GB vs BF16 36.7 GB (saves ~3.5 GB; less than Z-Image/Qwen because the bulk of dual-block params stays BF16). Co-Authored-By: pjh4993 <pjh4993@naver.com> Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
try Qwen-Image add such as #1034 "ignored_layers": "img_mlp" to test lpips again, thx. |
|
Hey @david6666666, thanks for the review comment! TLDR: For Qwen-image, the image generation quality for this PR is higher than #1034 (0.17 vs. 0.39 for LPIPS score), but memory saving droped from 27% to 16%. From both PRs, we have 9 candidate layers to skip in total, I guess we just need to skip enough layers to let LPIPS go below the 0.35 threshold.
Update: Due to the precision-critical nature of the layers identified in this PR, I decide to hard code these layers as skipped following hunyuan_image_3_transformer.py:1487, instead of allowing users to optionally skip them via cc: @lishunyang12 Qwen-Image ResultsLPIPS score (quality)
peak GPU + runtime on
|
| variant | peak GPU | mem Δ vs BF16 | inference dt |
|---|---|---|---|
| BF16 baseline | 55.81 GiB | — | 4.05 s |
pr1034 |
41.03 GiB | −26.5% | 4.06 s |
pr2795 |
46.77 GiB | −16.2% | 3.85 s |
union |
47.40 GiB | −15.1% | 4.07 s |
Details for each setting
I ran an LPIPS test on both Qwen/Qwen-Image and Qwen/Qwen-Image-2512 comparing three variants. Same prompt, seed 142, 1024×1024, 20 steps, H100, threshold 0.35.
pr1034refers to the setting in [Feature]: FP8 Quantization Support for DiT #1034.ignored_layers = ["img_mlp"]pr2795refers to the setting in this PR.ignored_layers = [ "timestep_embedder.linear_1", "timestep_embedder.linear_2", "img_mod.1", "txt_mod.1", "img_in", "txt_in", "norm_out.linear", "proj_out", ]unionmeans the union set from both settings.
ignored_layers = [
"timestep_embedder.linear_1",
"timestep_embedder.linear_2",
"img_mod.1",
"txt_mod.1",
"img_in",
"txt_in",
"norm_out.linear",
"proj_out",
"img_mlp",
]
|
@lishunyang12 ptal thx |
lishunyang12
left a comment
There was a problem hiding this comment.
Thanks for the thorough investigation — the LPIPS + visual comparison matrix across all three models makes verification trivial, and the per-model table of which layers stay FP8 is super helpful for future maintainers. The quant_config=None pattern on modulation / entry / final-projection layers mirrors the Hunyuan-Image-3 precedent cleanly, and referencing #2728 in every code comment keeps the rationale traceable.
LGTM.
|
LGTM. Thank you for your contribution |
…UX.1-dev (vllm-project#2795) Signed-off-by: Zhang <jianmusings@gmail.com> Co-authored-by: pjh4993 <pjh4993@naver.com>
…eo-1.5 examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps, skips precision-sensitive layers (modulation, embeddings, output proj, token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920). vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter). Signed-off-by: lishunyang <lishunyang12@163.com>
…UX.1-dev (vllm-project#2795) Signed-off-by: Zhang <jianmusings@gmail.com> Co-authored-by: pjh4993 <pjh4993@naver.com>
Purpose
Closes #2728.
Key changes: for precision-sensitive layers, set
quant_config=None. This fix aligns with what was done for Hunyuan-Image-3 (hunyuan_image_3_transformer.py:1487).Comparison
peak 24.5 GB
peak 19.0 GB · LPIPS 0.74 — noise
peak 19.3 GB · LPIPS 0.07 (PASS, threshold 0.10)
peak 58.6 GB
peak 41.4 GB · LPIPS 0.95 — noise
peak ~41 GB · LPIPS 0.32 (PASS, threshold 0.35)
peak 36.7 GB
peak 26.8 GB · LPIPS 0.93 — noise
peak 33.2 GB · LPIPS 0.12 (PASS, threshold 0.20)
Model layers that remain FP8
Z-Image-Turbo
ZImageAttention.to_qkv·to_out[0]FeedForward.w13·w2Qwen-Image
QwenImageCrossAttention.to_qkv·to_out·add_kv_proj·to_add_outFeedForward.net[0].proj(GELU) ·net[2](img_mlp,txt_mlp)FLUX.1-dev
FluxSingleTransformerBlock.proj_mlpFluxSingleTransformerBlock.proj_outdim + 4·dim → dimrecombinerFluxSingleTransformerBlock.attn.to_qkvFLUX.1-dev takes the prompt liberally at this step count and guidance. The BF16 baseline composes a brunch scene (pasta, bread, sauce) with the cup of coffee sitting in the top-right corner, rather than the coffee being the main subject. This is FLUX's own behaviour and unrelated to quantisation.
Test Plan
You can reproduce the images listed above using commands below.
Z-Image-Turbo
Expected: LPIPS ≈ 0.07, PASS.
Qwen-Image
Expected: LPIPS ≈ 0.32, PASS. Add
--ignored-layers img_mlp,txt_mlpfor ~0.003 LPIPS at +9 GB.FLUX.1-dev
Expected: LPIPS ≈ 0.12, PASS.
Re-ran all pytests also works.
For these tests,tests/diffusion/models/flux/test_flux_prefix_propagation.pytests/diffusion/models/qwen_image/test_qwen_image_size_utils.pytests/diffusion/models/z_image/test_zimage_tp_constraints.pyThe result is
Test Result
Tested on a H100 80 GB.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs. —docs/user_guide/diffusion/quantization/fp8.mdshould be updated to drop the "all layers can be quantized" claim for Z-Image / FLUX and to note the new in-source defaults; left out of this PR to keep it focused.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md