Skip to content

[Bugfix] Fix broken fp8 quantisation on Z-Image-Turbo, Qwen-Image, FLUX.1-dev#2795

Merged
david6666666 merged 8 commits intovllm-project:mainfrom
zhangj1an:jian/fp8_zimage
Apr 15, 2026
Merged

[Bugfix] Fix broken fp8 quantisation on Z-Image-Turbo, Qwen-Image, FLUX.1-dev#2795
david6666666 merged 8 commits intovllm-project:mainfrom
zhangj1an:jian/fp8_zimage

Conversation

@zhangj1an
Copy link
Copy Markdown
Contributor

@zhangj1an zhangj1an commented Apr 14, 2026

Purpose

Closes #2728.

Key changes: for precision-sensitive layers, set quant_config=None. This fix aligns with what was done for Hunyuan-Image-3 (hunyuan_image_3_transformer.py:1487).

Comparison

Model BF16 baseline FP8 (before fix) FP8 (after fix) Mem reduction (after fix)
Z-Image-Turbo zimage_bf16
peak 24.5 GB
zimage_fp8_before
peak 19.0 GB · LPIPS 0.74 — noise
zimage_fp8_after
peak 19.3 GB · LPIPS 0.07 (PASS, threshold 0.10)
−21.4 % (24.5 → 19.3 GB)
Qwen-Image qwen_bf16
peak 58.6 GB
qwen_fp8_before
peak 41.4 GB · LPIPS 0.95 — noise
qwen_fp8_after
peak ~41 GB · LPIPS 0.32 (PASS, threshold 0.35)
−29.4 % (58.6 → 41.4 GB)
FLUX.1-dev flux_bf16
peak 36.7 GB
flux_fp8_before
peak 26.8 GB · LPIPS 0.93 — noise
flux_fp8_after
peak 33.2 GB · LPIPS 0.12 (PASS, threshold 0.20)
−9.5 % (36.7 → 33.2 GB)

Model layers that remain FP8

Z-Image-Turbo
Layer Role
ZImageAttention.to_qkv · to_out[0] main attention path
FeedForward.w13 · w2 per-block FFN
Qwen-Image
Layer Role
QwenImageCrossAttention.to_qkv · to_out · add_kv_proj · to_add_out main joint-attention matmuls
FeedForward.net[0].proj (GELU) · net[2] (img_mlp, txt_mlp) per-block FFN on both streams
FLUX.1-dev
Layer Role
FluxSingleTransformerBlock.proj_mlp per-block dim → 4·dim GELU input projection
FluxSingleTransformerBlock.proj_out per-block dim + 4·dim → dim recombiner
FluxSingleTransformerBlock.attn.to_qkv per-block attention QKV

FLUX.1-dev takes the prompt liberally at this step count and guidance. The BF16 baseline composes a brunch scene (pasta, bread, sauce) with the cup of coffee sitting in the top-right corner, rather than the coffee being the main subject. This is FLUX's own behaviour and unrelated to quantisation.

Test Plan

You can reproduce the images listed above using commands below.

Z-Image-Turbo
# BF16 baseline
python examples/offline_inference/text_to_image/text_to_image.py \
    --model Tongyi-MAI/Z-Image-Turbo \
    --prompt "a cup of coffee on the table" \
    --seed 42 --num-inference-steps 50 \
    --height 1024 --width 1024 --cfg-scale 4.0 \
    --output out/zimage_bf16.png

# FP8 (with this fix)
python examples/offline_inference/text_to_image/text_to_image.py \
    --model Tongyi-MAI/Z-Image-Turbo \
    --prompt "a cup of coffee on the table" \
    --seed 42 --num-inference-steps 50 \
    --height 1024 --width 1024 --cfg-scale 4.0 \
    --quantization fp8 \
    --output out/zimage_fp8.png

python tests/e2e/offline_inference/compute_lpips.py \
    --image-dir out --threshold 0.10

Expected: LPIPS ≈ 0.07, PASS.

Qwen-Image
# BF16 baseline
python examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --seed 142 --num-inference-steps 50 \
    --height 1024 --width 1024 --cfg-scale 4.0 \
    --output out/qwen_bf16.png

# FP8 (with this fix)
python examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --seed 142 --num-inference-steps 50 \
    --height 1024 --width 1024 --cfg-scale 4.0 \
    --quantization fp8 \
    --output out/qwen_fp8.png

python tests/e2e/offline_inference/compute_lpips.py \
    --image-dir out --threshold 0.35

Expected: LPIPS ≈ 0.32, PASS. Add
--ignored-layers img_mlp,txt_mlp for ~0.003 LPIPS at +9 GB.

FLUX.1-dev
# BF16 baseline
python examples/offline_inference/text_to_image/text_to_image.py \
    --model black-forest-labs/FLUX.1-dev \
    --prompt "a cup of coffee on the table" \
    --seed 42 --num-inference-steps 20 \
    --height 1024 --width 1024 \
    --guidance-scale 3.5 --cfg-scale 1.0 \
    --output out/flux_bf16.png

# FP8 (with this fix)
python examples/offline_inference/text_to_image/text_to_image.py \
    --model black-forest-labs/FLUX.1-dev \
    --prompt "a cup of coffee on the table" \
    --seed 42 --num-inference-steps 20 \
    --height 1024 --width 1024 \
    --guidance-scale 3.5 --cfg-scale 1.0 \
    --quantization fp8 \
    --output out/flux_fp8.png

python tests/e2e/offline_inference/compute_lpips.py \
    --image-dir out --threshold 0.20

Expected: LPIPS ≈ 0.12, PASS.

Re-ran all pytests also works. For these tests,
  • tests/diffusion/models/flux/test_flux_prefix_propagation.py
  • tests/diffusion/models/qwen_image/test_qwen_image_size_utils.py
  • tests/diffusion/models/z_image/test_zimage_tp_constraints.py

The result is

=============================== warnings summary ===============================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
../../usr/local/lib/python3.11/dist-packages/torch/jit/_script.py:362: 14 warnings
  /usr/local/lib/python3.11/dist-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
12 passed, 16 warnings in 2.36s

Test Result

Tested on a H100 80 GB.

Model LPIPS before LPIPS after Threshold BF16 peak FP8 peak (after) Mem Δ
Z-Image-Turbo 0.7443 0.0659 0.10 24.5 GB 19.3 GB −21.4 %
Qwen-Image 0.9470 0.3193 0.35 58.6 GB ~41 GB −29.4 %
FLUX.1-dev 0.9257 0.1201 0.20 36.7 GB 33.2 GB −9.5 %

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.docs/user_guide/diffusion/quantization/fp8.md should be updated to drop the "all layers can be quantized" claim for Z-Image / FLUX and to note the new in-source defaults; left out of this PR to keep it focused.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md

zhangj1an and others added 4 commits April 14, 2026 08:37
…r FP8

  (vllm-project#2728)

  FP8 online quantization on Z-Image-Turbo produced pure pixel noise
  (LPIPS 0.74 vs BF16). Root cause: small precision-sensitive layers —
  TimestepEmbedder MLP, x_embedder, cap_embedder, per-block
  adaLN_modulation, and FinalLayer's output/modulation — were being
  FP8-quantized. Errors on these layers feed the  scale
  chain that multiplies the residual stream every block, so small
  per-layer drift turns into catastrophic magnitude blow-up by layer 30.

  Mirrors the earlier OmniGen2 FP8 fix (dbf8b7c). Swap these 6 layers
  from  to plain
  . Main-path matmuls (to_qkv, to_out, feed_forward.w13,
  feed_forward.w2) stay FP8, so the memory win is preserved.

  After fix: LPIPS 0.0659 (PASS, threshold 0.1).

Signed-off-by: Zhang <jianmusings@gmail.com>
…under FP8

  (vllm-project#2728)

  FP8 online quantization on Qwen-Image produced pure pixel noise
  (LPIPS 0.95 vs BF16). Same root cause as the Z-Image fix: precision-
  sensitive small layers (time embedder, img_in/txt_in entry, per-block
  img_mod/txt_mod modulation, norm_out.linear, proj_out) feed the
  shift/scale/gate chain that multiplies the residual stream every
  block, so small per-layer drift blows up into noise.

  After fix: LPIPS 0.32 (PASS, Qwen-Image threshold 0.35). Main-path
  matmuls (to_qkv, to_out, add_kv_proj, to_add_out, img_mlp, txt_mlp)
  remain FP8 for memory savings — peak ~41 GB vs ~59 GB BF16.

Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
…er FP8

  (vllm-project#2728)

  FP8 online quantization on FLUX.1-dev produced pure pixel noise (LPIPS
  0.93 vs BF16). Unlike Z-Image/Qwen where the modulation/embedder
  pattern was enough, FLUX's dual-stream blocks (19 FluxTransformerBlock)
  run joint attention over concatenated [text, image] tokens — the
  mixed-distribution activations don't tolerate FP8 per-token quant,
  and neither the attn nor ff sub-layers can individually take FP8.

  Keep dual blocks fully BF16 and keep per-block modulation and final
  norm_out unquantized. Single blocks (38 of them, ~2x more param than
  dual) remain FP8, preserving most of the memory saving.

  After fix: LPIPS 0.1201 (PASS, FLUX threshold 0.20). Peak 33.2 GB vs
  BF16 36.7 GB (saves ~3.5 GB; less than Z-Image/Qwen because the bulk
  of dual-block params stays BF16).

Co-Authored-By: pjh4993 <pjh4993@naver.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
@Gaohan123 Gaohan123 added this to the v0.20.0 milestone Apr 14, 2026
Signed-off-by: Zhang <jianmusings@gmail.com>
@zhangj1an zhangj1an marked this pull request as ready for review April 14, 2026 12:45
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@david6666666
Copy link
Copy Markdown
Collaborator

try Qwen-Image add such as #1034 "ignored_layers": "img_mlp" to test lpips again, thx.

@zhangj1an
Copy link
Copy Markdown
Contributor Author

zhangj1an commented Apr 14, 2026

Hey @david6666666, thanks for the review comment!

TLDR: For Qwen-image, the image generation quality for this PR is higher than #1034 (0.17 vs. 0.39 for LPIPS score), but memory saving droped from 27% to 16%. From both PRs, we have 9 candidate layers to skip in total, I guess we just need to skip enough layers to let LPIPS go below the 0.35 threshold.

Following #1034, I will refactor this PR to use ignored_layers for the 3 models instead of hardcoding the skipped layers tomorrow. Then the updated PR would mainly change doc rather than changing the code files.

Update: Due to the precision-critical nature of the layers identified in this PR, I decide to hard code these layers as skipped following hunyuan_image_3_transformer.py:1487, instead of allowing users to optionally skip them via ignored_layers. Given they are already hardcoded, there is no need to update the recommended layers to optionally skip in vllm-omni/docs/contributing/model/adding_quantization_model.md. So this PR is good as it is.

cc: @lishunyang12


Qwen-Image Results

LPIPS score (quality)

  • I just realised [Feature]: FP8 Quantization Support for DiT  #1034 was testing on Qwen/Qwen-Image-2512 (the most recent version, 2025 Dec), and this PR tested on Qwen/Qwen-Image (an older / deprecated version). The original statement in [Feature]: FP8 Quantization Support for DiT  #1034 is true for the most recent checkpoint. It is just not applicable for the deprecated version (0.39 vs 0.95 with only img_mlp skipped). I will update doc to drop support for the deprecated version.
  • The combination of 8 embedder/modulation/input/output linears carry larger sensitivity on Qwen/Qwen-Image-2512 than img_mlp.
variant Qwen/Qwen-Image Qwen/Qwen-Image-2512 threshold 0.35
pr1034 0.9537 0.3910 ❌ fails both
pr2795 0.1650 0.1739 ✅ passes both
union 0.0145 0.0239 ✅ passes both, large margin

peak GPU + runtime on Qwen/Qwen-Image-2512

variant peak GPU mem Δ vs BF16 inference dt
BF16 baseline 55.81 GiB 4.05 s
pr1034 41.03 GiB −26.5% 4.06 s
pr2795 46.77 GiB −16.2% 3.85 s
union 47.40 GiB −15.1% 4.07 s

Details for each setting

I ran an LPIPS test on both Qwen/Qwen-Image and Qwen/Qwen-Image-2512 comparing three variants. Same prompt, seed 142, 1024×1024, 20 steps, H100, threshold 0.35.

  • pr1034 refers to the setting in [Feature]: FP8 Quantization Support for DiT  #1034.
    ignored_layers = ["img_mlp"]
    
  • pr2795 refers to the setting in this PR.
    ignored_layers = [
        "timestep_embedder.linear_1",
        "timestep_embedder.linear_2",
        "img_mod.1",
        "txt_mod.1",
        "img_in",
        "txt_in",
        "norm_out.linear",
        "proj_out",
    ]
    
    
  • union means the union set from both settings.
  ignored_layers = [
      "timestep_embedder.linear_1",
      "timestep_embedder.linear_2",
      "img_mod.1",
      "txt_mod.1",
      "img_in",
      "txt_in",
      "norm_out.linear",
      "proj_out",
      "img_mlp",
  ]

@david6666666
Copy link
Copy Markdown
Collaborator

@lishunyang12 ptal thx

@lishunyang12 lishunyang12 added ready label to trigger buildkite CI and removed ready label to trigger buildkite CI labels Apr 15, 2026
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough investigation — the LPIPS + visual comparison matrix across all three models makes verification trivial, and the per-model table of which layers stay FP8 is super helpful for future maintainers. The quant_config=None pattern on modulation / entry / final-projection layers mirrors the Hunyuan-Image-3 precedent cleanly, and referencing #2728 in every code comment keeps the rationale traceable.

LGTM.

@lishunyang12 lishunyang12 added ready label to trigger buildkite CI quantization Code related to quantization labels Apr 15, 2026
@david6666666
Copy link
Copy Markdown
Collaborator

LGTM. Thank you for your contribution

@david6666666 david6666666 enabled auto-merge (squash) April 15, 2026 13:56
@david6666666 david6666666 merged commit c6d76d0 into vllm-project:main Apr 15, 2026
7 of 8 checks passed
y123456y78 pushed a commit to y123456y78/vllm-omni that referenced this pull request Apr 15, 2026
…UX.1-dev (vllm-project#2795)

Signed-off-by: Zhang <jianmusings@gmail.com>
Co-authored-by: pjh4993 <pjh4993@naver.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…eo-1.5

examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py:
  Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint
  for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps,
  skips precision-sensitive layers (modulation, embeddings, output proj,
  token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers
  by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920).

vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml:
  Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects
  ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter).

Signed-off-by: lishunyang <lishunyang12@163.com>
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 20, 2026
…UX.1-dev (vllm-project#2795)

Signed-off-by: Zhang <jianmusings@gmail.com>
Co-authored-by: pjh4993 <pjh4993@naver.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quantization Code related to quantization ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: fp8 online quantization produces catastrophic LPIPS regression across diffusion transformers (Z-Image, FLUX.1-dev, Qwen-Image)

4 participants