[Bugfix] Fix broken fp8 quantisation on Z-Image-Turbo, Qwen-Image, FLUX.1-dev by zhangj1an · Pull Request #2795 · vllm-project/vllm-omni

zhangj1an · 2026-04-14T11:48:06Z

Purpose

Closes #2728.

Key changes: for precision-sensitive layers, set quant_config=None. This fix aligns with what was done for Hunyuan-Image-3 (hunyuan_image_3_transformer.py:1487).

Comparison

Model	BF16 baseline	FP8 (before fix)	FP8 (after fix)	Mem reduction (after fix)
Z-Image-Turbo	peak 24.5 GB	peak 19.0 GB · LPIPS 0.74 — noise	peak 19.3 GB · LPIPS 0.07 (PASS, threshold 0.10)	−21.4 % (24.5 → 19.3 GB)
Qwen-Image	peak 58.6 GB	peak 41.4 GB · LPIPS 0.95 — noise	peak ~41 GB · LPIPS 0.32 (PASS, threshold 0.35)	−29.4 % (58.6 → 41.4 GB)
FLUX.1-dev	peak 36.7 GB	peak 26.8 GB · LPIPS 0.93 — noise	peak 33.2 GB · LPIPS 0.12 (PASS, threshold 0.20)	−9.5 % (36.7 → 33.2 GB)

Model layers that remain FP8

Z-Image-Turbo

Layer	Role
`ZImageAttention.to_qkv` · `to_out[0]`	main attention path
`FeedForward.w13` · `w2`	per-block FFN

Qwen-Image

Layer	Role
`QwenImageCrossAttention.to_qkv` · `to_out` · `add_kv_proj` · `to_add_out`	main joint-attention matmuls
`FeedForward.net[0].proj` (GELU) · `net[2]` (`img_mlp`, `txt_mlp`)	per-block FFN on both streams

FLUX.1-dev

Layer	Role
`FluxSingleTransformerBlock.proj_mlp`	per-block dim → 4·dim GELU input projection
`FluxSingleTransformerBlock.proj_out`	per-block `dim + 4·dim → dim` recombiner
`FluxSingleTransformerBlock.attn.to_qkv`	per-block attention QKV

FLUX.1-dev takes the prompt liberally at this step count and guidance. The BF16 baseline composes a brunch scene (pasta, bread, sauce) with the cup of coffee sitting in the top-right corner, rather than the coffee being the main subject. This is FLUX's own behaviour and unrelated to quantisation.

Test Plan

You can reproduce the images listed above using commands below.

Z-Image-Turbo

# BF16 baseline
python examples/offline_inference/text_to_image/text_to_image.py \
    --model Tongyi-MAI/Z-Image-Turbo \
    --prompt "a cup of coffee on the table" \
    --seed 42 --num-inference-steps 50 \
    --height 1024 --width 1024 --cfg-scale 4.0 \
    --output out/zimage_bf16.png

# FP8 (with this fix)
python examples/offline_inference/text_to_image/text_to_image.py \
    --model Tongyi-MAI/Z-Image-Turbo \
    --prompt "a cup of coffee on the table" \
    --seed 42 --num-inference-steps 50 \
    --height 1024 --width 1024 --cfg-scale 4.0 \
    --quantization fp8 \
    --output out/zimage_fp8.png

python tests/e2e/offline_inference/compute_lpips.py \
    --image-dir out --threshold 0.10

Expected: LPIPS ≈ 0.07, PASS.

Qwen-Image

# BF16 baseline
python examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --seed 142 --num-inference-steps 50 \
    --height 1024 --width 1024 --cfg-scale 4.0 \
    --output out/qwen_bf16.png

# FP8 (with this fix)
python examples/offline_inference/text_to_image/text_to_image.py \
    --model Qwen/Qwen-Image \
    --prompt "a cup of coffee on the table" \
    --seed 142 --num-inference-steps 50 \
    --height 1024 --width 1024 --cfg-scale 4.0 \
    --quantization fp8 \
    --output out/qwen_fp8.png

python tests/e2e/offline_inference/compute_lpips.py \
    --image-dir out --threshold 0.35

Expected: LPIPS ≈ 0.32, PASS. Add
--ignored-layers img_mlp,txt_mlp for ~0.003 LPIPS at +9 GB.

FLUX.1-dev

# BF16 baseline
python examples/offline_inference/text_to_image/text_to_image.py \
    --model black-forest-labs/FLUX.1-dev \
    --prompt "a cup of coffee on the table" \
    --seed 42 --num-inference-steps 20 \
    --height 1024 --width 1024 \
    --guidance-scale 3.5 --cfg-scale 1.0 \
    --output out/flux_bf16.png

# FP8 (with this fix)
python examples/offline_inference/text_to_image/text_to_image.py \
    --model black-forest-labs/FLUX.1-dev \
    --prompt "a cup of coffee on the table" \
    --seed 42 --num-inference-steps 20 \
    --height 1024 --width 1024 \
    --guidance-scale 3.5 --cfg-scale 1.0 \
    --quantization fp8 \
    --output out/flux_fp8.png

python tests/e2e/offline_inference/compute_lpips.py \
    --image-dir out --threshold 0.20

Expected: LPIPS ≈ 0.12, PASS.

Re-ran all pytests also works.

For these tests,

tests/diffusion/models/flux/test_flux_prefix_propagation.py
tests/diffusion/models/qwen_image/test_qwen_image_size_utils.py
tests/diffusion/models/z_image/test_zimage_tp_constraints.py

The result is

=============================== warnings summary ===============================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
../../usr/local/lib/python3.11/dist-packages/torch/jit/_script.py:362: 14 warnings
  /usr/local/lib/python3.11/dist-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
12 passed, 16 warnings in 2.36s

Test Result

Tested on a H100 80 GB.

Model	LPIPS before	LPIPS after	Threshold	BF16 peak	FP8 peak (after)	Mem Δ
Z-Image-Turbo	0.7443	0.0659	0.10	24.5 GB	19.3 GB	−21.4 %
Qwen-Image	0.9470	0.3193	0.35	58.6 GB	~41 GB	−29.4 %
FLUX.1-dev	0.9257	0.1201	0.20	36.7 GB	33.2 GB	−9.5 %

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs. — docs/user_guide/diffusion/quantization/fp8.md should be updated to drop the "all layers can be quantized" claim for Z-Image / FLUX and to note the new in-source defaults; left out of this PR to keep it focused.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md

…r FP8 (vllm-project#2728) FP8 online quantization on Z-Image-Turbo produced pure pixel noise (LPIPS 0.74 vs BF16). Root cause: small precision-sensitive layers — TimestepEmbedder MLP, x_embedder, cap_embedder, per-block adaLN_modulation, and FinalLayer's output/modulation — were being FP8-quantized. Errors on these layers feed the scale chain that multiplies the residual stream every block, so small per-layer drift turns into catastrophic magnitude blow-up by layer 30. Mirrors the earlier OmniGen2 FP8 fix (dbf8b7c). Swap these 6 layers from to plain . Main-path matmuls (to_qkv, to_out, feed_forward.w13, feed_forward.w2) stay FP8, so the memory win is preserved. After fix: LPIPS 0.0659 (PASS, threshold 0.1). Signed-off-by: Zhang <jianmusings@gmail.com>

…under FP8 (vllm-project#2728) FP8 online quantization on Qwen-Image produced pure pixel noise (LPIPS 0.95 vs BF16). Same root cause as the Z-Image fix: precision- sensitive small layers (time embedder, img_in/txt_in entry, per-block img_mod/txt_mod modulation, norm_out.linear, proj_out) feed the shift/scale/gate chain that multiplies the residual stream every block, so small per-layer drift blows up into noise. After fix: LPIPS 0.32 (PASS, Qwen-Image threshold 0.35). Main-path matmuls (to_qkv, to_out, add_kv_proj, to_add_out, img_mlp, txt_mlp) remain FP8 for memory savings — peak ~41 GB vs ~59 GB BF16. Signed-off-by: Zhang <jianmusings@gmail.com>

Signed-off-by: Zhang <jianmusings@gmail.com>

…er FP8 (vllm-project#2728) FP8 online quantization on FLUX.1-dev produced pure pixel noise (LPIPS 0.93 vs BF16). Unlike Z-Image/Qwen where the modulation/embedder pattern was enough, FLUX's dual-stream blocks (19 FluxTransformerBlock) run joint attention over concatenated [text, image] tokens — the mixed-distribution activations don't tolerate FP8 per-token quant, and neither the attn nor ff sub-layers can individually take FP8. Keep dual blocks fully BF16 and keep per-block modulation and final norm_out unquantized. Single blocks (38 of them, ~2x more param than dual) remain FP8, preserving most of the memory saving. After fix: LPIPS 0.1201 (PASS, FLUX threshold 0.20). Peak 33.2 GB vs BF16 36.7 GB (saves ~3.5 GB; less than Z-Image/Qwen because the bulk of dual-block params stays BF16). Co-Authored-By: pjh4993 <pjh4993@naver.com> Signed-off-by: Zhang <jianmusings@gmail.com>

Signed-off-by: Zhang <jianmusings@gmail.com>

chatgpt-codex-connector · 2026-04-14T12:45:33Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

david6666666 · 2026-04-14T13:24:40Z

try Qwen-Image add such as #1034 "ignored_layers": "img_mlp" to test lpips again, thx.

zhangj1an · 2026-04-14T17:09:36Z

Hey @david6666666, thanks for the review comment!

TLDR: For Qwen-image, the image generation quality for this PR is higher than #1034 (0.17 vs. 0.39 for LPIPS score), but memory saving droped from 27% to 16%. From both PRs, we have 9 candidate layers to skip in total, I guess we just need to skip enough layers to let LPIPS go below the 0.35 threshold.

Following #1034, I will refactor this PR to use ignored_layers for the 3 models instead of hardcoding the skipped layers tomorrow. Then the updated PR would mainly change doc rather than changing the code files.

Update: Due to the precision-critical nature of the layers identified in this PR, I decide to hard code these layers as skipped following hunyuan_image_3_transformer.py:1487, instead of allowing users to optionally skip them via ignored_layers. Given they are already hardcoded, there is no need to update the recommended layers to optionally skip in vllm-omni/docs/contributing/model/adding_quantization_model.md. So this PR is good as it is.

cc: @lishunyang12

Qwen-Image Results

LPIPS score (quality)

I just realised [Feature]: FP8 Quantization Support for DiT #1034 was testing on Qwen/Qwen-Image-2512 (the most recent version, 2025 Dec), and this PR tested on Qwen/Qwen-Image (an older / deprecated version). The original statement in [Feature]: FP8 Quantization Support for DiT #1034 is true for the most recent checkpoint. It is just not applicable for the deprecated version (0.39 vs 0.95 with only img_mlp skipped). I will update doc to drop support for the deprecated version.
The combination of 8 embedder/modulation/input/output linears carry larger sensitivity on Qwen/Qwen-Image-2512 than img_mlp.

variant	`Qwen/Qwen-Image`	`Qwen/Qwen-Image-2512`	threshold 0.35
`pr1034`	0.9537	0.3910	❌ fails both
`pr2795`	0.1650	0.1739	✅ passes both
`union`	0.0145	0.0239	✅ passes both, large margin

peak GPU + runtime on `Qwen/Qwen-Image-2512`

variant	peak GPU	mem Δ vs BF16	inference dt
BF16 baseline	55.81 GiB	—	4.05 s
`pr1034`	41.03 GiB	−26.5%	4.06 s
`pr2795`	46.77 GiB	−16.2%	3.85 s
`union`	47.40 GiB	−15.1%	4.07 s

Details for each setting

I ran an LPIPS test on both Qwen/Qwen-Image and Qwen/Qwen-Image-2512 comparing three variants. Same prompt, seed 142, 1024×1024, 20 steps, H100, threshold 0.35.

pr1034 refers to the setting in [Feature]: FP8 Quantization Support for DiT #1034.
```
ignored_layers = ["img_mlp"]
```

pr2795 refers to the setting in this PR.

ignored_layers = [
    "timestep_embedder.linear_1",
    "timestep_embedder.linear_2",
    "img_mod.1",
    "txt_mod.1",
    "img_in",
    "txt_in",
    "norm_out.linear",
    "proj_out",
]

union means the union set from both settings.

  ignored_layers = [
      "timestep_embedder.linear_1",
      "timestep_embedder.linear_2",
      "img_mod.1",
      "txt_mod.1",
      "img_in",
      "txt_in",
      "norm_out.linear",
      "proj_out",
      "img_mlp",
  ]

david6666666 · 2026-04-15T01:22:04Z

@lishunyang12 ptal thx

lishunyang12

Thanks for the thorough investigation — the LPIPS + visual comparison matrix across all three models makes verification trivial, and the per-model table of which layers stay FP8 is super helpful for future maintainers. The quant_config=None pattern on modulation / entry / final-projection layers mirrors the Hunyuan-Image-3 precedent cleanly, and referencing #2728 in every code comment keeps the rationale traceable.

LGTM.

david6666666 · 2026-04-15T13:56:21Z

LGTM. Thank you for your contribution

…UX.1-dev (vllm-project#2795) Signed-off-by: Zhang <jianmusings@gmail.com> Co-authored-by: pjh4993 <pjh4993@naver.com>

…eo-1.5 examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps, skips precision-sensitive layers (modulation, embeddings, output proj, token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920). vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter). Signed-off-by: lishunyang <lishunyang12@163.com>

…UX.1-dev (vllm-project#2795) Signed-off-by: Zhang <jianmusings@gmail.com> Co-authored-by: pjh4993 <pjh4993@naver.com>

zhangj1an and others added 4 commits April 14, 2026 08:37

skip layers by quant_config=None

7fccdde

Signed-off-by: Zhang <jianmusings@gmail.com>

zhangj1an mentioned this pull request Apr 14, 2026

[Bug]: fp8 online quantization produces catastrophic LPIPS regression across diffusion transformers (Z-Image, FLUX.1-dev, Qwen-Image) #2728

Closed

1 task

zhangj1an added 2 commits April 14, 2026 12:23

merge main and update docs

2bf58b0

Signed-off-by: Zhang <jianmusings@gmail.com>

docs: align docs/user_guide with main (drop FP8 table edits)

5e695bd

Signed-off-by: Zhang <jianmusings@gmail.com>

Gaohan123 added this to the v0.20.0 milestone Apr 14, 2026

docs: align quantization contributor guide with main

47ded5d

Signed-off-by: Zhang <jianmusings@gmail.com>

zhangj1an marked this pull request as ready for review April 14, 2026 12:45

zhangj1an requested a review from hsliuustc0106 as a code owner April 14, 2026 12:45

lishunyang12 added ready label to trigger buildkite CI and removed ready label to trigger buildkite CI labels Apr 15, 2026

lishunyang12 approved these changes Apr 15, 2026

View reviewed changes

lishunyang12 added ready label to trigger buildkite CI quantization Code related to quantization labels Apr 15, 2026

david6666666 approved these changes Apr 15, 2026

View reviewed changes

david6666666 enabled auto-merge (squash) April 15, 2026 13:56

Merge branch 'main' into jian/fp8_zimage

583d842

david6666666 merged commit c6d76d0 into vllm-project:main Apr 15, 2026
7 of 8 checks passed

lishunyang12 mentioned this pull request Apr 19, 2026

[Quant] Wire quant_config through HunyuanVideo-1.5 and Wan2.2 DiT for online FP8 #2920

Open

10 tasks

lishunyang12 mentioned this pull request Apr 19, 2026

[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2 #2924

Draft

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix broken fp8 quantisation on Z-Image-Turbo, Qwen-Image, FLUX.1-dev#2795

[Bugfix] Fix broken fp8 quantisation on Z-Image-Turbo, Qwen-Image, FLUX.1-dev#2795
david6666666 merged 8 commits intovllm-project:mainfrom
zhangj1an:jian/fp8_zimage

zhangj1an commented Apr 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Apr 14, 2026

Uh oh!

david6666666 commented Apr 14, 2026

Uh oh!

zhangj1an commented Apr 14, 2026 •

edited

Loading

Uh oh!

david6666666 commented Apr 15, 2026

Uh oh!

lishunyang12 left a comment

Uh oh!

david6666666 commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhangj1an commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Comparison

Model layers that remain FP8

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 14, 2026

Uh oh!

david6666666 commented Apr 14, 2026

Uh oh!

zhangj1an commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen-Image Results

LPIPS score (quality)

peak GPU + runtime on Qwen/Qwen-Image-2512

Details for each setting

Uh oh!

david6666666 commented Apr 15, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

david6666666 commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhangj1an commented Apr 14, 2026 •

edited

Loading

zhangj1an commented Apr 14, 2026 •

edited

Loading

peak GPU + runtime on `Qwen/Qwen-Image-2512`