（Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 by baonudesifeizhai · Pull Request #2913 · vllm-project/vllm-omni

baonudesifeizhai · 2026-04-19T06:46:00Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

#2709

This PR adds Phase 1 support for ModelOpt FP8 diffusion checkpoints.

Auto-detects quantization_config from diffusion checkpoint configs.
Resolves generic fp8 stage configs to checkpoint-specific ModelOpt FP8 when serialized ModelOpt metadata is present.
Adds a ModelOpt FP8 checkpoint adapter for diffusers-style weight loading.
Extends HunyuanImage-3 ModelOpt FP8 loading for attention and MoE scalar scales.
Adds FP8 stage configs for supported image backbones.

Validation

Validated ModelOpt FP8 image generation on:

Flux
Flux2-Klein
Qwen-Image
HunyuanImage-3

Test Plan

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/flux1-dev-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/flux_dit_2gpu_fp8.yaml \
  --prompt "a small red ceramic teapot on a wooden table, soft window light" \
  --height 512 \
  --width 512 \
  --num-inference-steps 2 \
  --seed 42 \
  --output outputs/flux_modelopt_fp8.png \
  --enforce-eager \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux_modelopt_fp8.log


 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/flux2-klein-4b-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/flux2_klein_dit_2gpu_fp8.yaml \
  --prompt "a cozy Tokyo cafe corner at night, warm tungsten lighting, rain on the window, ceramic coffee cup, highly detailed, cinematic photograph" \
  --negative-prompt "blurry, low quality, distorted, deformed, oversaturated" \
  --cfg-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 42 \
  --output outputs/flux2_klein_modelopt_fp8.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux2_klein_modelopt_fp8.log

modeloptfp8 for qwen-image:
https://paste.ubuntu.com/p/gby859n2Qt/

 CUDA_VISIBLE_DEVICES=0 \
/root/zdj/vllm/.venv/bin/python \
  /tmp/quantize_qwen_image_modelopt_fp8.py \
  --model /root/zdj/models/qwen-image \
  --output /root/zdj/models/qwen-image-modelopt-fp8 \
  --calib-size 8 \
  --calib-steps 8 \
  --height 512 \
  --width 512 \
  --overwrite \
  2>&1 | tee outputs/qwen_image_modelopt_fp8_export.log

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/qwen-image-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen_image_dit_2gpu_fp8.yaml \
  --prompt "a clean product photo of a blue enamel mug on a white desk, realistic lighting" \
  --negative-prompt "blurry, low quality, distorted, deformed, oversaturated" \
  --cfg-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 42 \
  --output outputs/qwen_image_modelopt_fp8.png \
  --enforce-eager \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/qwen_image_modelopt_fp8.log

hunyuan modeoptfp8 ： https://paste.ubuntu.com/p/dTgpmNzw3K/

CUDA_VISIBLE_DEVICES=0,1
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH
/root/zdj/vllm/.venv/bin/python
examples/offline_inference/text_to_image/text_to_image.py
--model /root/zdj/models/hunyuan-image3-modelopt-fp8
--stage-configs-path vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml
--prompt "a cinematic photo of a red fox standing in a snowy pine forest, soft morning light, highly detailed"
--guidance-scale 4.0
--height 512
--width 512
--num-inference-steps 20
--seed 42
--use-system-prompt en_vanilla
--output outputs/hunyuan_image3_modelopt_fp8_steps20.png
--stage-init-timeout 900
--init-timeout 900
2>&1 | tee outputs/hunyuan_image3_modelopt_fp8_steps20.log

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/hunyuan-image3-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/hunyuan_image3_moe_dit_2gpu_fp8.yaml \
  --prompt "a cinematic close-up photo of a glass greenhouse in a snowy mountain village at sunrise, warm golden light glowing through the windows, frost on the glass, pine trees, soft mist, ultra detailed, realistic photography" \
  --guidance-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 123 \
  --use-system-prompt en_vanilla \
  --output outputs/hunyuan_image3_modelopt_fp8_greenhouse_steps20.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/hunyuan_image3_modelopt_fp8_greenhouse_steps20.log

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector · 2026-04-19T06:46:06Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

baonudesifeizhai · 2026-04-19T07:53:03Z

flux2dev modelopt fp8 script:
https://paste.ubuntu.com/p/Pkw5Wsjv4q/

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/flux2-dev-modelopt-fp8 \
  --stage-configs-path /tmp/flux2_dev_dit_2gpu_fp8.yaml \
  --prompt "a luxury art deco train dining car at golden hour, emerald velvet seats, brass lamps, rain streaks on the windows, cinematic wide angle photograph, highly detailed" \
  --guidance-scale 2.5 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 123 \
  --output outputs/flux2_dev_modelopt_fp8_artdeco_train_steps20.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux2_dev_modelopt_fp8_artdeco_train_steps20.log

hsliuustc0106 · 2026-04-19T09:24:10Z

BLOCKING:

Test Coverage — No e2e online serving test. Please add a test that:
1. Starts vllm serve <model> --omni
2. Sends a generation request via the API
3. Asserts the response contains a valid image

ModelOpt FP8 checkpoints should work in both Omni (offline) and vllm serve / AsyncOmni (online) modes before merging.

…vllm-project#2920) Threads quant_config / prefix through HunyuanVideo15Attention, HunyuanVideo15TransformerBlock, and HunyuanVideo15Transformer3DModel so the modelopt FP8 adapter from vllm-project#2913 has somewhere to bind per-layer scales. Modulation, embeddings, proj_out stay raw nn.Linear (full precision). Signed-off-by: lishunyang <lishunyang12@163.com>

…eo-1.5 examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps, skips precision-sensitive layers (modulation, embeddings, output proj, token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920). vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter). Signed-off-by: lishunyang <lishunyang12@163.com>

baonudesifeizhai · 2026-04-19T19:23:38Z

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main \
  serve /root/zdj/models/flux2-dev-modelopt-fp8 \
  --omni \
  --host 127.0.0.1 \
  --port 8000 \
  --stage-configs-path /tmp/flux2_dev_dit_2gpu_fp8.yaml \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/flux2_dev_modelopt_fp8_online_server.log

prompt:https://paste.ubuntu.com/p/ypkqDtNxQN/

The default export_hf_checkpoint() doesn't actually serialize weights as FP8 for unknown model types like HunyuanVideo15Transformer3DModel — it saves BF16 placeholders. The HunyuanImage-3 calibration helper hit the same bug. Three changes: - Manually call modelopt.torch.export.unified_export_hf._export_quantized_weight per-module to convert in-memory tensors to actual FP8. - Save the pipeline by hand (copy source minus transformer/, then save the quantized transformer with hide_quantizers_from_state_dict). - Patch transformer/config.json to inject quant_algo: FP8 + config_groups so vllm-omni's adapter (vllm-project#2913) auto-detects it. Signed-off-by: lishunyang <lishunyang12@163.com>

…block When --weight-block-size 'M,N' is given, override the weight quantizer with block_sizes={-1: N, -2: M} so each linear gets a (out//M, in//N) scale tensor instead of a scalar. Patched config_groups advertises strategy='block' + block_structure='MxN' so consumers know what to expect. Static FP8 is exempt from upstream vLLM's online block-wise gate, so this just works at serving time via vllm-project#2913's adapter. Default behavior unchanged (per-tensor) — pass --weight-block-size 128,128 to opt in. Signed-off-by: lishunyang <lishunyang12@163.com>

…ject#2920) Threads quant_config / prefix through WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines (T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding (Conv3d), time/text/image embedders, and proj_out stay full precision. All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here — that was an online-FP8 quality workaround; static calibration handles it. Signed-off-by: lishunyang <lishunyang12@163.com>

baonudesifeizhai · 2026-04-20T00:35:16Z

z image :
For Z-Image ModelOpt FP8, the main caution is that not all linear layers are equally stable under FP8.
Use a conservative quantization profile:
Also preserve full transformer submodule prefixes during loading. Z-Image ignore-list matching depends on names like layers..attention.to_out, layers..feed_forward.w2, noise_refiner., and context_refiner.; wrong prefixes can silently produce corrupted images.
https://paste.ubuntu.com/p/F8hdb5SMnY/

CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  /tmp/quantize_z_image_base_modelopt_fp8.py \
  --model /root/zdj/models/z-image \
  --output /root/zdj/models/z-image-modelopt-fp8-conservative \
  --profile conservative \
  --calib-size 8 \
  --calib-steps 28 \
  --height 512 \
  --width 512 \
  --guidance-scale 4.0 \
  --overwrite \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_export.log

offline

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/z-image-modelopt-fp8-conservative \
  --stage-configs-path vllm_omni/model_executor/stage_configs/z_image_dit_2gpu_fp8.yaml \
  --prompt "an Elden Ring style lone tarnished knight standing before a shattered cathedral under a dying golden tree, ruined stone arches, drifting ash, dramatic god rays, dark fantasy, cinematic, ultra detailed" \
  --negative-prompt "blurry, low quality, distorted, deformed, watermark" \
  --guidance-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 28 \
  --seed 42 \
  --output outputs/z_image_modelopt_fp8_conservative_steps28.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_steps28.log

 CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main \
  serve /root/zdj/models/z-image-modelopt-fp8-conservative \
  --omni \
  --host 127.0.0.1 \
  --port 8000 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/z_image_dit_2gpu_fp8.yaml \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/z_image_modelopt_fp8_conservative_online_server.log

 curl -s http://127.0.0.1:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a Horus Heresy scene, a towering Space Marine in battered crusade-era power armor standing inside a ruined imperial cathedral during the age of civil war, shattered aquila banners, burning censers, broken stained glass, ash and embers drifting through the air, tragic gothic atmosphere, dramatic god rays, cinematic, ultra detailed",
    "negative_prompt": "blurry, low quality, distorted, deformed, watermark, extra limbs, bad anatomy",
    "size": "512x512",
    "num_inference_steps": 28,
    "guidance_scale": 4.0,
    "seed": 42,
    "response_format": "b64_json"
  }' \
  | jq -r '.data[0].b64_json' \
  | base64 -d \
  > outputs/z_image_modelopt_fp8_conservative_online_horus_heresy_steps28.png

baonudesifeizhai · 2026-04-20T15:12:09Z

for online qwen-image:

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python -m vllm_omni.entrypoints.cli.main \
  serve /root/zdj/models/qwen-image-modelopt-fp8 \
  --omni \
  --host 127.0.0.1 \
  --port 8000 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen_image_dit_2gpu_fp8.yaml \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/qwen_image_modelopt_fp8_online_2gpu_server.log

curl -sS http://127.0.0.1:8000/v1/images/generations \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer EMPTY" \
  -d '{
    "model": "/root/zdj/models/qwen-image-modelopt-fp8",
    "prompt": "a cinematic photo of a red fox standing in a snowy pine forest, soft morning light, highly detailed, realistic fur texture",
    "negative_prompt": "blurry, low quality, distorted, deformed, oversaturated",
    "size": "512x512",
    "response_format": "b64_json",
    "n": 1,
    "num_inference_steps": 20,
    "true_cfg_scale": 4.0,
    "seed": 42
  }' \
  | tee outputs/qwen_image_modelopt_fp8_online_2gpu_response.json
python - <<'PY'
import base64, json
from pathlib import Path

payload = json.loads(Path("outputs/qwen_image_modelopt_fp8_online_2gpu_response.json").read_text())
Path("outputs/qwen_image_modelopt_fp8_online_2gpu_steps20.png").write_bytes(
    base64.b64decode(payload["data"][0]["b64_json"])
)
print("saved outputs/qwen_image_modelopt_fp8_online_2gpu_steps20.png")
PY

offline:

CUDA_VISIBLE_DEVICES=0,1 \
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH \
/root/zdj/vllm/.venv/bin/python \
  examples/offline_inference/text_to_image/text_to_image.py \
  --model /root/zdj/models/qwen-image-modelopt-fp8 \
  --stage-configs-path vllm_omni/model_executor/stage_configs/qwen_image_dit_2gpu_fp8.yaml \
  --prompt "a grimdark Warhammer 40,000 style hive city stretching into a poisoned orange sky, endless gothic megastructures, towering manufactorum spires, cathedral-like hab blocks, polluted atmosphere, flying gunships, crowds of tiny pilgrims and workers below, dramatic volumetric light, ash and smoke, cinematic, ultra detailed" \
  --negative-prompt "blurry, low quality, distorted, deformed, oversaturated, watermark, text" \
  --cfg-scale 4.0 \
  --height 512 \
  --width 512 \
  --num-inference-steps 20 \
  --seed 42 \
  --output outputs/qwen_image_modelopt_fp8_warhammer40k_hive_city_steps20.png \
  --stage-init-timeout 900 \
  --init-timeout 900 \
  2>&1 | tee outputs/qwen_image_modelopt_fp8_warhammer40k_hive_city_steps20.log

Signed-off-by: roG0d <rodgarcas98@gmail.com>

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

baonudesifeizhai · 2026-04-20T18:18:41Z

for e2e test:
CUDA_VISIBLE_DEVICES=0,1
VLLM_TARGET_DEVICE=cuda
VLLM_WORKER_MULTIPROC_METHOD=spawn
PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:$PYTHONPATH
/root/zdj/vllm/.venv/bin/python -m pytest
tests/e2e/online_serving/test_modelopt_fp8_image_serving.py::test_modelopt_fp8_images_api_returns_valid_image
-s
--run-level advanced_model
--tb=short
passed

Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>

david6666666 · 2026-04-21T06:55:40Z

We should have a unified model weight conversion script, such as those in vllm-omni/vllm_omni/quantization/tools, and compare_diffusion_trajectory_similarity scripts. WDYT @baonudesifeizhai @lishunyang12

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

Resolve conflicts in diffusion config and loader paths. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

lishunyang12 · 2026-04-23T09:49:39Z

Quality outputs look good but we have no perf numbers for any of the 5 models. Can you share:

Latency + peak memory table for bf16 vs modelopt-fp8 (at least Flux + HunyuanImage-3)
Profiler trace per the profiling guide for one model — top-N kernels to confirm fp8 GEMM path is actually active and not silently falling back to bf16
List of layers that fell back to bf16 (skipped/unsupported) and why

Want to validate the perf story before merging.

baonudesifeizhai · 2026-04-24T04:25:01Z

after force_kernel=PerTensorTorchFP8ScaledMMLinearKernel on vllm side ...

zimage:

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/z-image-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  36.19
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              1.38
Latency Mean (s):                        9.8473
Latency Median (s):                      11.5595
Latency P99 (s):                         11.5955
Latency P95 (s):                         11.5804
--------------------------------------------------
Peak Memory Max (MB):                    13626.00
Peak Memory Mean (MB):                   13626.00
Peak Memory Median (MB):                 13626.00

============================================================

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/z-image
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  38.95
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              1.28
Latency Mean (s):                        10.5946
Latency Median (s):                      12.4189
Latency P99 (s):                         12.4794
Latency P95 (s):                         12.4735
--------------------------------------------------
Peak Memory Max (MB):                    16940.00
Peak Memory Mean (MB):                   16940.00
Peak Memory Median (MB):                 16940.00

============================================================



================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-dev-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  84.41
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.59
Latency Mean (s):                        22.9657
Latency Median (s):                      26.9953
Latency P99 (s):                         27.0379
Latency P95 (s):                         27.0342
--------------------------------------------------
Peak Memory Max (MB):                    65390.00
Peak Memory Mean (MB):                   65390.00
Peak Memory Median (MB):                 65390.00

vs 


================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-dev
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  92.37
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.54
Latency Mean (s):                        25.1176
Latency Median (s):                      29.5167
Latency P99 (s):                         29.5819
Latency P95 (s):                         29.5704
--------------------------------------------------
Peak Memory Max (MB):                    80366.00
Peak Memory Mean (MB):                   80366.00
Peak Memory Median (MB):                 80366.00

============================================================
Metrics saved to outputs/perf/flux2_dev_bf16_2gpu_c16_n50.json

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-klein-4b-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  18.59
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              2.69
Latency Mean (s):                        5.0586
Latency Median (s):                      5.9344
Latency P99 (s):                         5.9611
Latency P95 (s):                         5.9536
--------------------------------------------------
Peak Memory Max (MB):                    12758.00
Peak Memory Mean (MB):                   12758.00
Peak Memory Median (MB):                 12758.00
vs 

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/flux2-klein-4b
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  19.69
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              2.54
Latency Mean (s):                        5.3560
Latency Median (s):                      6.2916
Latency P99 (s):                         6.3101
Latency P95 (s):                         6.3021
--------------------------------------------------
Peak Memory Max (MB):                    14506.00
Peak Memory Mean (MB):                   14506.00
Peak Memory Median (MB):                 14506.00

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/hunyuan-image3-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  241.25
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.21
Latency Mean (s):                        65.6834
Latency Median (s):                      76.9524
Latency P99 (s):                         77.5409
Latency P95 (s):                         77.2703
--------------------------------------------------
Peak Memory Max (MB):                    96940.00
Peak Memory Mean (MB):                   96940.00
Peak Memory Median (MB):                 96940.00

============================================================

vs 

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/hunyuan-image3
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  282.36
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.18
Latency Mean (s):                        76.8621
Latency Median (s):                      90.0141
Latency P99 (s):                         90.8956
Latency P95 (s):                         90.5725
--------------------------------------------------
Peak Memory Max (MB):                    135402.00
Peak Memory Mean (MB):                   135402.00
Peak Memory Median (MB):                 135402.00

============================================================

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image-modelopt-fp8
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  110.26
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.45
Latency Mean (s):                        29.9892
Latency Median (s):                      35.1732
Latency P99 (s):                         35.3604
Latency P95 (s):                         35.3243
--------------------------------------------------
Peak Memory Max (MB):                    39730.00
Peak Memory Mean (MB):                   39729.60
Peak Memory Median (MB):                 39730.00

============================================================
vs

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  102.26
Request rate:                            inf
Max request concurrency:                 16
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.49
Latency Mean (s):                        27.8345
Latency Median (s):                      32.6885
Latency P99 (s):                         32.7558
Latency P95 (s):                         32.7227
--------------------------------------------------
Peak Memory Max (MB):                    46464.00
Peak Memory Mean (MB):                   46464.00
Peak Memory Median (MB):                 46464.00

============================================================

baonudesifeizhai · 2026-04-25T01:56:02Z

https://paste.ubuntu.com/p/92yBc9x7bB/

curl -sS http://127.0.0.1:8160/v1/chat/completions \
  -H "Content-Type: application/json" \
  --data-raw '{
    "model": "/root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all",
    "messages": [
      {
        "role": "user",
        "content": "A beautiful cinematic photo of a small red fox sitting in a snowy forest, ultra detailed, soft natural light"
      }
    ],
    "max_tokens": 1024
  }' \
| /root/zdj/vllm/.venv/bin/python -c 'import sys,json,re,base64; r=json.load(sys.stdin); s=json.dumps(r); m=re.search(r"data:image/png;base64,([A-Za-z0-9+/=]+)", s); assert m, s[:1000]; open("output.png","wb").write(base64.b64decode(m.group(1))); print("saved output.png")'

 curl -sS http://127.0.0.1:8160/v1/chat/completions \
  -H "Content-Type: application/json" \
  --data-raw '{
    "model": "/root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all",
    "messages": [
      {
        "role": "user",
        "content": "A colossal warp monster emerging from a torn reality rift inside a gothic sci-fi battlefield, grimdark far future war aesthetic, twisted horns, glowing eyes, corrupted flesh, black armor fragments, chaotic purple and red energy, cathedral ruins, smoke, fire, cinematic lighting, ultra detailed, dramatic composition"
      }
    ],
    "max_tokens": 1024
  }' \
| /root/zdj/vllm/.venv/bin/python -c 'import sys,json,re,base64; r=json.load(sys.stdin); s=json.dumps(r); m=re.search(r"data:image/png;base64,([A-Za-z0-9+/=]+)", s); assert m, s[:1000]; open("warpspawn_40k_grimdark.png","wb").write(base64.b64decode(m.group(1))); print("saved warpspawn_40k_grimdark.png")'

================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  254.98
Request rate:                            inf
Max request concurrency:                 32
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.20
Latency Mean (s):                        113.6107
Latency Median (s):                      129.1408
Latency P99 (s):                         180.3258
Latency P95 (s):                         180.2265
--------------------------------------------------
Peak Memory Max (MB):                    84630.00
Peak Memory Mean (MB):                   81597.36
Peak Memory Median (MB):                 84630.00
vs 


================= Serving Benchmark Result =================
Backend:                                 vllm-omni
Model:                                   /root/zdj/models/qwen-image-2512
Dataset:                                 random
Task:                                    t2i
--------------------------------------------------
Benchmark duration (s):                  285.53
Request rate:                            inf
Max request concurrency:                 32
Successful requests:                     50/50
--------------------------------------------------
Request throughput (req/s):              0.18
Latency Mean (s):                        127.2317
Latency Median (s):                      144.4706
Latency P99 (s):                         202.5307
Latency P95 (s):                         202.4224
--------------------------------------------------
Peak Memory Max (MB):                    97340.00
Peak Memory Mean (MB):                   94306.96
Peak Memory Median (MB):                 97340.00

============================================================

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

baonudesifeizhai · 2026-04-26T08:08:47Z

cat >/tmp/modelopt_quality_cases.json <<'JSON'
[
  {
    "id": "qwen_image_2512_modelopt_fp8_dynamic_all",
    "baseline_model": "/root/zdj/models/qwen-image-2512",
    "quantized_model": "/root/zdj/models/qwen-image-2512-modelopt-fp8-dynamic-all",
    "task": "t2i",
    "prompt": "a fox sitting in the snow in a forest, realistic photo",
    "max_lpips": 0.35,
    "height": 1024,
    "width": 1024,
    "num_inference_steps": 20,
    "seed": 42,
    "negative_prompt": "blurry, low quality"
  }
]
JSON
export VLLM_OMNI_QUALITY_CONFIGS=/tmp/modelopt_quality_cases.json
export VLLM_OMNI_QUALITY_OUTPUT_DIR=/tmp/modelopt_quality_outputs

PYTHONPATH=/root/zdj/vllm-omni:/root/zdj/vllm:${PYTHONPATH:-} \
/root/zdj/vllm/.venv/bin/python -m pytest \
  tests/diffusion/quantization/test_quantization_quality.py \
  -v -m "" -k qwen_image_2512_modelopt_fp8_dynamic_all

tests/diffusion/quantization/test_quantization_quality.py::test_quantization_quality[qwen_image_2512_modelopt_fp8_dynamic_all] PASSED [100%]

baonudesifeizhai requested a review from hsliuustc0106 as a code owner April 19, 2026 06:46

lishunyang12 mentioned this pull request Apr 19, 2026

[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2 #2924

Draft

15 tasks

lishunyang12 mentioned this pull request Apr 19, 2026

[Quant] Phase 1 (video-gen): ModelOpt FP8 for Wan2.2 TI2V-5B #2927

Closed

8 tasks

david6666666 requested review from david6666666 and lishunyang12 April 20, 2026 12:43

roG0d and others added 13 commits April 20, 2026 18:06

fix

1a202b6

Signed-off-by: roG0d <rodgarcas98@gmail.com>

fix

342227e

Signed-off-by: roG0d <rodgarcas98@gmail.com>

refactoring

7cafdd6

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

refactoring

dcae8ca

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

continue refacoring

ef5c65a

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

fix huawei

4bb9ad4

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

fix online server problem

bd42298

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

fix zimage rope problem

4af7404

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

add e2e test

7b40415

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

fix qwen image online server

f05b8b2

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

fix pytest e2e problem and remove enforce eager true

db8c809

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

refactoring for test

1f13e1c

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

fix conflict

263be06

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

baonudesifeizhai force-pushed the omni2709 branch from a9b3165 to 263be06 Compare April 20, 2026 18:08

Merge branch 'main' into omni2709

7730f38

Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>

baonudesifeizhai added 3 commits April 23, 2026 00:58

fix conflict

2435504

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

Merge main into omni2709

176a4e1

Resolve conflicts in diffusion config and loader paths. Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

fix online running problem

40d4663

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

lishunyang12 mentioned this pull request Apr 24, 2026

[RFC]: Continuous Quantization Support #1854

Open

vllm-project deleted a comment from AILIFE1 Apr 24, 2026

baonudesifeizhai added 2 commits April 25, 2026 18:27

fix

ae48c59

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

rechange to cutlass kernel

22fbfd5

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

baonudesifeizhai force-pushed the omni2709 branch from a689e90 to 22fbfd5 Compare April 25, 2026 20:06

baonudesifeizhai and others added 6 commits April 25, 2026 16:07

Merge branch 'main' into omni2709

b435991

Signed-off-by: baonudesifeizhai <85092850+baonudesifeizhai@users.noreply.github.com>

fix start upfusedmoe problem

3426efd

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

fix precommit error

8fb8d1a

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

refactoring'

33328dd

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

add convert example and accuarcy test

bdb7076

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

dd convert example and accuarcy test

21b0631

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

baonudesifeizhai mentioned this pull request Apr 27, 2026

[RFC]: Unified ModelOpt Quantization in vLLM vllm-project/vllm#40182

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

（Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709#2913

（Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709#2913
baonudesifeizhai wants to merge 25 commits intovllm-project:mainfrom
baonudesifeizhai:omni2709

baonudesifeizhai commented Apr 19, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Apr 19, 2026

Uh oh!

baonudesifeizhai commented Apr 19, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 commented Apr 19, 2026

Uh oh!

baonudesifeizhai commented Apr 19, 2026

Uh oh!

baonudesifeizhai commented Apr 20, 2026 •

edited

Loading

Uh oh!

baonudesifeizhai commented Apr 20, 2026 •

edited

Loading

Uh oh!

baonudesifeizhai commented Apr 20, 2026

Uh oh!

david6666666 commented Apr 21, 2026

Uh oh!

lishunyang12 commented Apr 23, 2026

Uh oh!

baonudesifeizhai commented Apr 24, 2026 •

edited

Loading

Uh oh!

baonudesifeizhai commented Apr 25, 2026

Uh oh!

baonudesifeizhai commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

baonudesifeizhai commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Validation

Test Plan

Uh oh!

chatgpt-codex-connector Bot commented Apr 19, 2026

Uh oh!

baonudesifeizhai commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 19, 2026

Uh oh!

baonudesifeizhai commented Apr 19, 2026

Uh oh!

baonudesifeizhai commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baonudesifeizhai commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baonudesifeizhai commented Apr 20, 2026

Uh oh!

david6666666 commented Apr 21, 2026

Uh oh!

lishunyang12 commented Apr 23, 2026

Uh oh!

baonudesifeizhai commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baonudesifeizhai commented Apr 25, 2026

Uh oh!

baonudesifeizhai commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

baonudesifeizhai commented Apr 19, 2026 •

edited

Loading

baonudesifeizhai commented Apr 19, 2026 •

edited

Loading

baonudesifeizhai commented Apr 20, 2026 •

edited

Loading

baonudesifeizhai commented Apr 20, 2026 •

edited

Loading

baonudesifeizhai commented Apr 24, 2026 •

edited

Loading