Skip to content

(Phase 1)Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709#2913

Open
baonudesifeizhai wants to merge 25 commits intovllm-project:mainfrom
baonudesifeizhai:omni2709
Open

(Phase 1)Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709#2913
baonudesifeizhai wants to merge 25 commits intovllm-project:mainfrom
baonudesifeizhai:omni2709

Conversation

@baonudesifeizhai
Copy link
Copy Markdown

@baonudesifeizhai baonudesifeizhai commented Apr 19, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

#2709

This PR adds Phase 1 support for ModelOpt FP8 diffusion checkpoints.

  • Auto-detects quantization_config from diffusion checkpoint configs.
  • Resolves generic fp8 stage configs to checkpoint-specific ModelOpt FP8 when serialized ModelOpt metadata is present.
  • Adds a ModelOpt FP8 checkpoint adapter for diffusers-style weight loading.
  • Extends HunyuanImage-3 ModelOpt FP8 loading for attention and MoE scalar scales.
  • Adds FP8 stage configs for supported image backbones.

Validation

Validated ModelOpt FP8 image generation on:

  • Flux
  • Flux2-Klein
  • Qwen-Image
  • HunyuanImage-3

Test Plan

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/flux1-dev-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/flux_dit_2gpu_fp8.yaml \
  --prompt "a small red ceramic teapot on a wooden table, soft window light" \
  --height 512 \
  --width 512 \
  --num-inference-steps 2 \
  --seed 42 \
  --output outputs/flux_modelopt_fp8.png \
  --enforce-eager \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux_modelopt_fp8.log
image

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/flux2-klein-4b-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/flux2_klein_dit_2gpu_fp8.yaml \
  --prompt "a cozy Tokyo cafe corner at night, warm tungsten lighting, rain on the window, ceramic coffee cup, highly detailed, cinematic photograph" \
  --negative-prompt "blurry, low quality, distorted, deformed, oversaturated" \
  --cfg-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 42 \
  --output outputs/flux2_klein_modelopt_fp8.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux2_klein_modelopt_fp8.log

image

modeloptfp8 for qwen-image:
https://paste.ubuntu.com/p/gby859n2Qt/

 CUDA_VISIBLE_DEVICES=0 \
/root/zdj/vllm/.venv/bin/python \
  /tmp/quantize_qwen_image_modelopt_fp8.py \
  --model /root/zdj/models/qwen-image \
  --output /root/zdj/models/qwen-image-modelopt-fp8 \
  --calib-size 8 \
  --calib-steps 8 \
  --height 512 \
  --width 512 \
  --overwrite \
  2>&1 | tee outputs/qwen_image_modelopt_fp8_export.log
CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/qwen-image-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen_image_dit_2gpu_fp8.yaml \
  --prompt "a clean product photo of a blue enamel mug on a white desk, realistic lighting" \
  --negative-prompt "blurry, low quality, distorted, deformed, oversaturated" \
  --cfg-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 42 \
  --output outputs/qwen_image_modelopt_fp8.png \
  --enforce-eager \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/qwen_image_modelopt_fp8.log
image image

hunyuan modeoptfp8 : https://paste.ubuntu.com/p/dTgpmNzw3K/

CUDA_VISIBLE_DEVICES=0,1
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH
/root/zdj/vllm/.venv/bin/python
examples/offline_inference/text_to_image/text_to_image.py
--model /root/zdj/models/hunyuan-image3-modelopt-fp8
--stage-configs-path vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml
--prompt "a cinematic photo of a red fox standing in a snowy pine forest, soft morning light, highly detailed"
--guidance-scale 4.0
--height 512
--width 512
--num-inference-steps 20
--seed 42
--use-system-prompt en_vanilla
--output outputs/hunyuan_image3_modelopt_fp8_steps20.png
--stage-init-timeout 900
--init-timeout 900
2>&1 | tee outputs/hunyuan_image3_modelopt_fp8_steps20.log
image

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/hunyuan-image3-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml \
  --prompt "a cinematic close-up photo of a glass greenhouse in a snowy mountain village at sunrise, warm golden light glowing through the windows, frost on the glass, pine trees, soft mist, ultra detailed, realistic photography" \
  --guidance-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 123 \
  --use-system-prompt en_vanilla \
  --output outputs/hunyuan_image3_modelopt_fp8_greenhouse_steps20.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/hunyuan_image3_modelopt_fp8_greenhouse_steps20.log
image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@baonudesifeizhai
Copy link
Copy Markdown
Author

baonudesifeizhai commented Apr 19, 2026

flux2dev modelopt fp8 script:
https://paste.ubuntu.com/p/Pkw5Wsjv4q/

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/flux2-dev-modelopt-fp8 \
  --stage-configs-path /tmp/flux2_dev_dit_2gpu_fp8.yaml \
  --prompt "a luxury art deco train dining car at golden hour, emerald velvet seats, brass lamps, rain streaks on the windows, cinematic wide angle photograph, highly detailed" \
  --guidance-scale 2.5 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 123 \
  --output outputs/flux2_dev_modelopt_fp8_artdeco_train_steps20.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux2_dev_modelopt_fp8_artdeco_train_steps20.log
image

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

BLOCKING:

  • Test Coverage — No e2e online serving test. Please add a test that:
    1. Starts vllm serve <model> --omni
    2. Sends a generation request via the API
    3. Asserts the response contains a valid image

ModelOpt FP8 checkpoints should work in both Omni (offline) and vllm serve / AsyncOmni (online) modes before merging.

lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…vllm-project#2920)

Threads quant_config / prefix through HunyuanVideo15Attention,
HunyuanVideo15TransformerBlock, and HunyuanVideo15Transformer3DModel so
the modelopt FP8 adapter from vllm-project#2913 has somewhere to bind per-layer scales.
Modulation, embeddings, proj_out stay raw nn.Linear (full precision).

Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…eo-1.5

examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py:
  Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint
  for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps,
  skips precision-sensitive layers (modulation, embeddings, output proj,
  token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers
  by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920).

vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml:
  Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects
  ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter).

Signed-off-by: lishunyang <lishunyang12@163.com>
@baonudesifeizhai
Copy link
Copy Markdown
Author

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main \
  serve /root/zdj/models/flux2-dev-modelopt-fp8 \
  --omni \
  --host 127.0.0.1 \
  --port 8000 \
  --stage-configs-path /tmp/flux2_dev_dit_2gpu_fp8.yaml \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux2_dev_modelopt_fp8_online_server.log

prompt:https://paste.ubuntu.com/p/ypkqDtNxQN/

5f5baec674d4556e412ec6b2c49fb2f7

lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
The default export_hf_checkpoint() doesn't actually serialize weights as FP8
for unknown model types like HunyuanVideo15Transformer3DModel — it saves
BF16 placeholders. The HunyuanImage-3 calibration helper hit the same bug.

Three changes:
- Manually call modelopt.torch.export.unified_export_hf._export_quantized_weight
  per-module to convert in-memory tensors to actual FP8.
- Save the pipeline by hand (copy source minus transformer/, then save the
  quantized transformer with hide_quantizers_from_state_dict).
- Patch transformer/config.json to inject quant_algo: FP8 + config_groups so
  vllm-omni's adapter (vllm-project#2913) auto-detects it.

Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…block

When --weight-block-size 'M,N' is given, override the weight quantizer with
block_sizes={-1: N, -2: M} so each linear gets a (out//M, in//N) scale tensor
instead of a scalar. Patched config_groups advertises strategy='block' +
block_structure='MxN' so consumers know what to expect.

Static FP8 is exempt from upstream vLLM's online block-wise gate, so this
just works at serving time via vllm-project#2913's adapter.

Default behavior unchanged (per-tensor) — pass --weight-block-size 128,128
to opt in.

Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…ject#2920)

Threads quant_config / prefix through WanSelfAttention, WanCrossAttention,
WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and
WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines
(T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding
(Conv3d), time/text/image embedders, and proj_out stay full precision.

All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter
from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip
patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here —
that was an online-FP8 quality workaround; static calibration handles it.

Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…ject#2920)

Threads quant_config / prefix through WanSelfAttention, WanCrossAttention,
WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and
WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines
(T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding
(Conv3d), time/text/image embedders, and proj_out stay full precision.

All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter
from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip
patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here —
that was an online-FP8 quality workaround; static calibration handles it.

Signed-off-by: lishunyang <lishunyang12@163.com>
@baonudesifeizhai
Copy link
Copy Markdown
Author

baonudesifeizhai commented Apr 20, 2026

z image :
For Z-Image ModelOpt FP8, the main caution is that not all linear layers are equally stable under FP8.
Use a conservative quantization profile:
Also preserve full transformer submodule prefixes during loading. Z-Image ignore-list matching depends on names like layers..attention.to_out, layers..feed_forward.w2, noise_refiner., and context_refiner.; wrong prefixes can silently produce corrupted images.
https://paste.ubuntu.com/p/F8hdb5SMnY/

CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  /tmp/quantize_z_image_base_modelopt_fp8.py \
  --model /root/zdj/models/z-image \
  --output /root/zdj/models/z-image-modelopt-fp8-conservative \
  --profile conservative \
  --calib-size 8 \
  --calib-steps 28 \
  --height 512 \
  --width 512 \
  --guidance-scale 4.0 \
  --overwrite \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_export.log

offline

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/z-image-modelopt-fp8-conservative \
  --stage-configs-path vllm_omni/model_executor/stage_configs/z_image_dit_2gpu_fp8.yaml \
  --prompt "an Elden Ring style lone tarnished knight standing before a shattered cathedral under a dying golden tree, ruined stone arches, drifting ash, dramatic god rays, dark fantasy, cinematic, ultra detailed" \
  --negative-prompt "blurry, low quality, distorted, deformed, watermark" \
  --guidance-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 28 \
  --seed 42 \
  --output outputs/z_image_modelopt_fp8_conservative_steps28.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_steps28.log
image
 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main \
  serve /root/zdj/models/z-image-modelopt-fp8-conservative \
  --omni \
  --host 127.0.0.1 \
  --port 8000 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/z_image_dit_2gpu_fp8.yaml \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_online_server.log

 curl -s http://127.0.0.1:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a Horus Heresy scene, a towering Space Marine in battered crusade-era power armor standing inside a ruined imperial cathedral during the age of civil war, shattered aquila banners, burning censers, broken stained glass, ash and embers drifting through the air, tragic gothic atmosphere, dramatic god rays, cinematic, ultra detailed",
    "negative_prompt": "blurry, low quality, distorted, deformed, watermark, extra limbs, bad anatomy",
    "size": "512x512",
    "num_inference_steps": 28,
    "guidance_scale": 4.0,
    "seed": 42,
    "response_format": "b64_json"
  }' \
  | jq -r '.data[0].b64_json' \
  | base64 -d \
  > outputs/z_image_modelopt_fp8_conservative_online_horus_heresy_steps28.png
image

@baonudesifeizhai
Copy link
Copy Markdown
Author

baonudesifeizhai commented Apr 20, 2026

for online qwen-image:

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main \
  serve /root/zdj/models/qwen-image-modelopt-fp8 \
  --omni \
  --host 127.0.0.1 \
  --port 8000 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen_image_dit_2gpu_fp8.yaml \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/qwen_image_modelopt_fp8_online_2gpu_server.log
curl -sS http://127.0.0.1:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "/root/zdj/models/qwen-image-modelopt-fp8",
    "prompt": "a cinematic photo of a red fox standing in a snowy pine forest, soft morning light, highly detailed, realistic fur texture",
    "negative_prompt": "blurry, low quality, distorted, deformed, oversaturated",
    "size": "512x512",
    "response_format": "b64_json",
    "n": 1,
    "num_inference_steps": 20,
    "true_cfg_scale": 4.0,
    "seed": 42
  }' \
  | tee outputs/qwen_image_modelopt_fp8_online_2gpu_response.json
python - <<'PY'
import base64, json
from pathlib import Path

payload = json.loads(Path("outputs/qwen_image_modelopt_fp8_online_2gpu_response.json").read_text())
Path("outputs/qwen_image_modelopt_fp8_online_2gpu_steps20.png").write_bytes(
    base64.b64decode(payload["data"][0]["b64_json"])
)
print("saved outputs/qwen_image_modelopt_fp8_online_2gpu_steps20.png")
PY

image

offline:

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/qwen-image-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen_image_dit_2gpu_fp8.yaml \
  --prompt "a grimdark Warhammer 40,000 style hive city stretching into a poisoned orange sky, endless gothic megastructures, towering manufactorum spires, cathedral-like hab blocks, polluted atmosphere, flying gunships, crowds of tiny pilgrims and workers below, dramatic volumetric light, ash and smoke, cinematic, ultra detailed" \
  --negative-prompt "blurry, low quality, distorted, deformed, oversaturated, watermark, text" \
  --cfg-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 42 \
  --output outputs/qwen_image_modelopt_fp8_warhammer40k_hive_city_steps20.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/qwen_image_modelopt_fp8_warhammer40k_hive_city_steps20.log
image

roG0d and others added 13 commits April 20, 2026 18:06
Signed-off-by: roG0d <rodgarcas98@gmail.com>
Signed-off-by: roG0d <rodgarcas98@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
@baonudesifeizhai
Copy link
Copy Markdown
Author

for e2e test:
CUDA_VISIBLE_DEVICES=0,1
VLLM_TARGET_DEVICE=cuda
VLLM_WORKER_MULTIPROC_METHOD=spawn
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH
/root/zdj/vllm/.venv/bin/python -m pytest
tests/e2e/online_serving/test_modelopt_fp8_image_serving.py::test_modelopt_fp8_images_api_returns_valid_image
-s
--run-level advanced_model
--tb=short
passed

Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>
@david6666666
Copy link
Copy Markdown
Collaborator

We should have a unified model weight conversion script, such as those in vllm-omni/vllm_omni/quantization/tools, and compare_diffusion_trajectory_similarity scripts. WDYT @baonudesifeizhai @lishunyang12

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Resolve conflicts in diffusion config and loader paths.

Co-authored-by: OpenAI Codex <codex@openai.com>

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
@lishunyang12
Copy link
Copy Markdown
Collaborator

Quality outputs look good but we have no perf numbers for any of the 5 models. Can you share:

  • Latency + peak memory table for bf16 vs modelopt-fp8 (at least Flux + HunyuanImage-3)
  • Profiler trace per the profiling guide for one model — top-N kernels to confirm fp8 GEMM path is actually active and not silently falling back to bf16
  • List of layers that fell back to bf16 (skipped/unsupported) and why

Want to validate the perf story before merging.

@baonudesifeizhai
Copy link
Copy Markdown
Author

baonudesifeizhai commented Apr 24, 2026

after force_kernel=PerTensorTorchFP8ScaledMMLinearKernel on vllm side ...

zimage:

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/z-image-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  36.19
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              1.38
Latency Mean (s):                        9.8473
Latency Median (s):                      11.5595
Latency P99 (s):                         11.5955
Latency P95 (s):                         11.5804
--------------------------------------------------
Peak Memory Max (MB):                    13626.00
Peak Memory Mean (MB):                   13626.00
Peak Memory Median (MB):                 13626.00

============================================================

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/z-image
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  38.95
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              1.28
Latency Mean (s):                        10.5946
Latency Median (s):                      12.4189
Latency P99 (s):                         12.4794
Latency P95 (s):                         12.4735
--------------------------------------------------
Peak Memory Max (MB):                    16940.00
Peak Memory Mean (MB):                   16940.00
Peak Memory Median (MB):                 16940.00

============================================================



================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-dev-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  84.41
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.59
Latency Mean (s):                        22.9657
Latency Median (s):                      26.9953
Latency P99 (s):                         27.0379
Latency P95 (s):                         27.0342
--------------------------------------------------
Peak Memory Max (MB):                    65390.00
Peak Memory Mean (MB):                   65390.00
Peak Memory Median (MB):                 65390.00

vs 


================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-dev
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  92.37
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.54
Latency Mean (s):                        25.1176
Latency Median (s):                      29.5167
Latency P99 (s):                         29.5819
Latency P95 (s):                         29.5704
--------------------------------------------------
Peak Memory Max (MB):                    80366.00
Peak Memory Mean (MB):                   80366.00
Peak Memory Median (MB):                 80366.00

============================================================
Metrics saved to outputs/perf/flux2_dev_bf16_2gpu_c16_n50.json
================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-klein-4b-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  18.59
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              2.69
Latency Mean (s):                        5.0586
Latency Median (s):                      5.9344
Latency P99 (s):                         5.9611
Latency P95 (s):                         5.9536
--------------------------------------------------
Peak Memory Max (MB):                    12758.00
Peak Memory Mean (MB):                   12758.00
Peak Memory Median (MB):                 12758.00
vs 

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-klein-4b
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  19.69
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              2.54
Latency Mean (s):                        5.3560
Latency Median (s):                      6.2916
Latency P99 (s):                         6.3101
Latency P95 (s):                         6.3021
--------------------------------------------------
Peak Memory Max (MB):                    14506.00
Peak Memory Mean (MB):                   14506.00
Peak Memory Median (MB):                 14506.00

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/hunyuan-image3-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  241.25
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.21
Latency Mean (s):                        65.6834
Latency Median (s):                      76.9524
Latency P99 (s):                         77.5409
Latency P95 (s):                         77.2703
--------------------------------------------------
Peak Memory Max (MB):                    96940.00
Peak Memory Mean (MB):                   96940.00
Peak Memory Median (MB):                 96940.00

============================================================

vs 

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/hunyuan-image3
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  282.36
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.18
Latency Mean (s):                        76.8621
Latency Median (s):                      90.0141
Latency P99 (s):                         90.8956
Latency P95 (s):                         90.5725
--------------------------------------------------
Peak Memory Max (MB):                    135402.00
Peak Memory Mean (MB):                   135402.00
Peak Memory Median (MB):                 135402.00

============================================================
================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  110.26
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.45
Latency Mean (s):                        29.9892
Latency Median (s):                      35.1732
Latency P99 (s):                         35.3604
Latency P95 (s):                         35.3243
--------------------------------------------------
Peak Memory Max (MB):                    39730.00
Peak Memory Mean (MB):                   39729.60
Peak Memory Median (MB):                 39730.00

============================================================
vs

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  102.26
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.49
Latency Mean (s):                        27.8345
Latency Median (s):                      32.6885
Latency P99 (s):                         32.7558
Latency P95 (s):                         32.7227
--------------------------------------------------
Peak Memory Max (MB):                    46464.00
Peak Memory Mean (MB):                   46464.00
Peak Memory Median (MB):                 46464.00

============================================================

@baonudesifeizhai
Copy link
Copy Markdown
Author

https://paste.ubuntu.com/p/92yBc9x7bB/

curl -sS http://127.0.0.1:8160/v1/chat/completions \
  -H "Content-Type: application/json" \
  --data-raw '{
    "model": "/root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all",
    "messages": [
      {
        "role": "user",
        "content": "A beautiful cinematic photo of a small red fox sitting in a snowy forest, ultra detailed, soft natural light"
      }
    ],
    "max_tokens": 1024
  }' \
| /root/zdj/vllm/.venv/bin/python -c 'import sys,json,re,base64; r=json.load(sys.stdin); s=json.dumps(r); m=re.search(r"data:image/png;base64,([A-Za-z0-9+/=]+)", s); assert m, s[:1000]; open("output.png","wb").write(base64.b64decode(m.group(1))); print("saved output.png")'

image
 curl -sS http://127.0.0.1:8160/v1/chat/completions \
  -H "Content-Type: application/json" \
  --data-raw '{
    "model": "/root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all",
    "messages": [
      {
        "role": "user",
        "content": "A colossal warp monster emerging from a torn reality rift inside a gothic sci-fi battlefield, grimdark far future war aesthetic, twisted horns, glowing eyes, corrupted flesh, black armor fragments, chaotic purple and red energy, cathedral ruins, smoke, fire, cinematic lighting, ultra detailed, dramatic composition"
      }
    ],
    "max_tokens": 1024
  }' \
| /root/zdj/vllm/.venv/bin/python -c 'import sys,json,re,base64; r=json.load(sys.stdin); s=json.dumps(r); m=re.search(r"data:image/png;base64,([A-Za-z0-9+/=]+)", s); assert m, s[:1000]; open("warpspawn_40k_grimdark.png","wb").write(base64.b64decode(m.group(1))); print("saved warpspawn_40k_grimdark.png")'
image
================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  254.98
Request rate:                            inf
Max request concurrency:                 32
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.20
Latency Mean (s):                        113.6107
Latency Median (s):                      129.1408
Latency P99 (s):                         180.3258
Latency P95 (s):                         180.2265
--------------------------------------------------
Peak Memory Max (MB):                    84630.00
Peak Memory Mean (MB):                   81597.36
Peak Memory Median (MB):                 84630.00
vs 


================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image-2512
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  285.53
Request rate:                            inf
Max request concurrency:                 32
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.18
Latency Mean (s):                        127.2317
Latency Median (s):                      144.4706
Latency P99 (s):                         202.5307
Latency P95 (s):                         202.4224
--------------------------------------------------
Peak Memory Max (MB):                    97340.00
Peak Memory Mean (MB):                   94306.96
Peak Memory Median (MB):                 97340.00

============================================================

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
baonudesifeizhai and others added 6 commits April 25, 2026 16:07
Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
@baonudesifeizhai
Copy link
Copy Markdown
Author

cat >/tmp/modelopt_quality_cases.json <<'JSON'
[
  {
    "id": "qwen_image_2512_modelopt_fp8_dynamic_all",
    "baseline_model": "/root/zdj/models/qwen-image-2512",
    "quantized_model": "/root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all",
    "task": "t2i",
    "prompt": "a fox sitting in the snow in a forest, realistic photo",
    "max_lpips": 0.35,
    "height": 1024,
    "width": 1024,
    "num_inference_steps": 20,
    "seed": 42,
    "negative_prompt": "blurry, low quality"
  }
]
JSON
export VLLM_OMNI_QUALITY_CONFIGS=/tmp/modelopt_quality_cases.json
export VLLM_OMNI_QUALITY_OUTPUT_DIR=/tmp/modelopt_quality_outputs

PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:${PYTHONPATH:-} \
/root/zdj/vllm/.venv/bin/python -m pytest \
  tests/diffusion/quantization/test_quantization_quality.py \
  -v -m "" -k qwen_image_2512_modelopt_fp8_dynamic_all

tests/diffusion/quantization/test_quantization_quality.py::test_quantization_quality[qwen_image_2512_modelopt_fp8_dynamic_all] PASSED [100%]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants