Phase1 (video-gen) ModelOpt FP8 Follow-ups by ArtificialRay · Pull Request #57 · lishunyang12/vllm-omni

ArtificialRay · 2026-05-07T19:03:37Z

This PR completes follow-ups listed in vllm-project#2924. All benchmarking at Validation is done in benchmarks/diffusion/quantization_quality.py

Purpose

Phase 1 of vllm-project#2709 — extends ModelOpt FP8 support to video-gen models. This PR add ModelOpt FP8 static quantization support for Wan2.2 T2V-A14B / I2V-A14B MoE variants, Wan2.2 VACE variant, HunyuanVideo-1.5 720p T2V/I2V variants, and Block-wise static FP8 quantization for all above model variants.

Changes

ModelOpt FP8 helpers

examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py -- HV-1.5 calibrator that add I2V variant support and patches quant_algo: FP8 if per-tensor quant or patches quant_algo: FP8_PB_WO if per-block quant. Per-block quant support only if M=N=128
examples/quantization/quantize_wan2_2_modelopt_fp8.py -- Wan2.2 TI2V-5B / T2V-A14B / I2V-A14B calibrator, for A14B model, it loads two pipeline to two separate
for calibration.
examples/quantization/check_modelopt_fp8_export.py -- Verifier. Add on-disk transformer weight reduction metrics and whole-repo(whole-model) weight size reduction.
examples/offline_inference/vace/vace_video_generation.py -- Wan2.2 VACE variance calibrator, support T2V / I2V/ R2V per-tensor or per-block static quantization

Calibrator still share --weight-block-size 'M,N' for block-wise FP8, and use the same fallback pattern: _force_export_quantized_weights + _patch_quant_config + hide_quantizers_from_state_dict as vllm-project#2924

Script wiring

examples/offline_inference/image_to_video/image_to_video.py -- add --quantization and --ignore-layers to support I2V FP8 quantization test
benchmarks/diffusion/quantization_quality.py -- support both T2V and I2V video-gen model quantization quality benchmarking. Add metrics throughput, peak VRAM and peak VRAM reduction. This PR also fix bugs that previous Memory metric is always 0.0, and T2I benchmarking attribute error at generate_image()

Adapter

modelopt_fp8.py:_is_transformer_source add support for Wan2.2 A14B MoE that verify two transformer architecture

Validation --- Wan2.2-I2V-A14B (1x H100 80GB, I2V 720x1280, 81 frames, 50 steps, seed=42)

torch.compile enabled (default). --vae-use-tiling is set during benchmarking as there will be CUDA OOM on BF16 baseline if not enable.

BF16 baseline v.s. Per-tensor quantization

Metric	BF16 baseline	ModelOpt FP8 (this PR)	Delta
Total wall time	984.98 s	901.70 s	−8% (1.09× speedup)
Denoise throughput	19.27 s/it	17.61 s/it	−9%
Peak GPU memory	71.64 GiB	46.36 GiB	−35%
On-disk transformer weights	106.46 GiB	27.17 GiB	~74.5%
Model load (resident)	117.54 GiB	38.25 GiB	~67.5%
Visual fidelity (Mean LPIPS)	— (ref)	0.1826

BF16 baseline v.s. Per-block quantization

Metric	BF16 baseline	ModelOpt FP8 (this PR)	Delta
Total wall time	972.50 s	930.24 s	−4% (1.05× speedup)
Denoise throughput	19.02 s/it	18.18 s/it	−4%
Peak GPU memory	71.64 GiB	47.67 GiB	−33%
On-disk transformer weights	106.46 GiB	27.17 GiB	~74.5%
Model load (resident)	117.54 GiB	38.25 GiB	~67.5%
Visual fidelity (Mean LPIPS)	— (ref)	0.1686

Engine signals confirming the path is wired correctly:

factory.py: Building quantization config: fp8 → Building quantization config: modelopt — auto-detect upgraded the user's --quantization fp8 flag to ModelOpt based on quant_algo: FP8 or quant_algo: FP8_PB_WO in transformer/config.json
data.py: Auto-detected quantization 'modelopt' from model config
modelopt.py:381 (Per-tensor) Detected ModelOpt fp8 checkpoint (quant_algo=FP8). (Per-block) Detected ModelOpt fp8 checkpoint (quant_algo=FP8_PB_WO).
__init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod (Per-tensor only) — the ModelOpt FP8 kernel selected

Visual comparison -- Wan2.2-I2V-A14B

BF16 baseline:

wan22_A14B_bf16.mp4

ModelOpt FP8 per-tensor (this PR):

wan22_A14B_fp8_per_tensor.mp4

ModelOpt FP8 per-block (this PR)

wan22_A14B_fp8_per_block.mp4

Same prompt ("A skateboarder in a purple bomber jacket doing a kickflip in a foggy urban plaza, overcast morning light, slow motion, european architecture in the background."), same seed, same sampling params. Output is BF16-equivalent

Validation --- HunyuanVideo-1.5 720p (1x H100 80GB, T2V 720x1280, 49 frames, 30 steps, seed=42)

torch.compile enabled (default). --vae-use-tiling is set during benchmarking as there will be CUDA OOM on both BF16 baseline and ModelOpt FP8 if not enable.

BF16 baseline v.s. Per-tensor quantization

Metric	BF16 baseline	ModelOpt FP8 (this PR)	Delta
Total wall time	137.64 s	131.52 s	−4% (1.05× speedup)
Denoise throughput	4.59 s/it	4.38 s/it	−4%
Peak GPU memory	51.94 GiB	46.96 GiB	−10%
On-disk transformer weights	106.46 GiB	27.17 GiB	~74.5%
Model load (resident)	117.54 GiB	38.25 GiB	~67.5%
Visual fidelity (Mean LPIPS)	— (ref)	0.2211

BF16 baseline v.s. Per-block quantization

Metric	BF16 baseline	ModelOpt FP8 (this PR)	Delta
Total wall time	136.12 s	135.38 s	−0.5% (1.01× speedup)
Denoise throughput	4.54 s/it	4.51 s/it	−0.5%
Peak GPU memory	51.94 GiB	46.96 GiB	−10%
On-disk transformer weights	31.02 GiB	10.45 GiB	~66.3%
Model load (resident)	49.72 GiB	29.15 GiB	~41.4%
Visual fidelity (Mean LPIPS)	— (ref)	0.1911

Engine signals confirming the path is wired correctly:

factory.py: Building quantization config: fp8 → Building quantization config: modelopt — auto-detect upgraded the user's --quantization fp8 flag to ModelOpt based on quant_algo: FP8 or quant_algo: FP8_PB_WO in transformer/config.json
data.py: Auto-detected quantization 'modelopt' from model config
modelopt.py:381 (Per-tensor) Detected ModelOpt fp8 checkpoint (quant_algo=FP8). (Per-block) Detected ModelOpt fp8 checkpoint (quant_algo=FP8_PB_WO).
__init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod (Per-tensor only) — the ModelOpt FP8 kernel selected

Visual comparison -- HunyuanVideo-1.5 720p

BF16 baseline:

hunyuan_720p_bf16.mp4

ModelOpt FP8 per-tensor (this PR):

hunyuan_720p_fp8_per_tensor.mp4

ModelOpt FP8 per-block (this PR)

hunyuan_720p_fp8_per_block.mp4

Same prompt ("An astronaut in a white spacesuit riding a horse across the lunar surface, gray dust kicked up by the horse's hooves, Earth visible in the black sky, lunar lander in the distance, cinematic wide shot. Make sure the astronaut is really moving!") and negative prompt (""vibrant colors, overexposed, static, blurred details, subtitles, style, artwork, painting, picture, still, overall gray, worst quality, low quality, JPEG compression artifacts, ugly, mutilated, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, still frame, cluttered background, three legs, many people in the background, walking backwards"") same seed, same sampling params. Output is BF16-equivalent

Validation --- Wan2.1-VACE-14B (1x H100 80GB, R2V 480x832, 49 frames, 30 steps, seed=42)

torch.compile enabled (default).

BF16 baseline v.s. Per-tensor quantization

Metric	BF16 baseline	ModelOpt FP8 (this PR)	Delta
Total wall time	112.91 s	95.62 s	−15% (1.18× speedup)
Denoise throughput	3.76 s/it	3.19 s/it	−15%
Peak GPU memory	53.98 GiB	38.51 GiB	−29%
On-disk transformer weights	58.91 GiB	16.62 GiB	~71.8%
Model load (resident)	69.99 GiB	27.71 GiB	~60.4%
Visual fidelity (Mean LPIPS)	— (ref)	0.2619

BF16 baseline v.s. Per-block quantization

Metric	BF16 baseline	ModelOpt FP8 (this PR)	Delta
Total wall time	111.96 s	103.24 s	−8% (1.08× speedup)
Denoise throughput	3.73 s/it	3.44 s/it	−8%
Peak GPU memory	53.98 GiB	38.52 GiB	−29%
On-disk transformer weights	58.91 GiB	16.62 GiB	~71.8%
Model load (resident)	69.99 GiB	27.71 GiB	~60.4%
Visual fidelity (Mean LPIPS)	— (ref)	0.1640

Engine signals confirming the path is wired correctly:

factory.py: Building quantization config: fp8 → Building quantization config: modelopt — auto-detect upgraded the user's --quantization fp8 flag to ModelOpt based on quant_algo: FP8 or quant_algo: FP8_PB_WO in transformer/config.json
data.py: Auto-detected quantization 'modelopt' from model config
modelopt.py:381 (Per-tensor) Detected ModelOpt fp8 checkpoint (quant_algo=FP8). (Per-block) Detected ModelOpt fp8 checkpoint (quant_algo=FP8_PB_WO).
__init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod (Per-tensor only) — the ModelOpt FP8 kernel selected

Visual comparison -- Wan2.1-VACE-A14B

BF16 baseline:

r2v_output_bf16.mp4

ModelOpt FP8 per-tensor (this PR):

r2v_output_fp8_per_tensor.mp4

ModelOpt FP8 per-block (this PR)

r2v_output_fp8_per_block.mp4

Same prompt ("An astronaut in a white spacesuit riding a horse across the lunar surface, gray dust kicked up by the horse's hooves, Earth visible in the black sky, lunar lander in the distance, cinematic wide shot. Make sure the astronaut is really moving!") , same seed, same sampling params. Output is BF16-equivalent.

Test Plan

Wan2.2-I2V-A14B

Calibration script completes on 2x H100 -- 400 weights converted to F8_E4M3
Checker report quant_algo: FP8 for pre-tensor and quant_algo: FP8_PB_WO for per-block
On disk transformer size (transformer + transformer_2) 27.17 GiB (74.5% vs 106.46 GiB BF16 baseline)
Load via （Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 vllm-project/vllm-omni#2913 's adapter (Auto-detected quantization 'modelopt')
End-to-end inference produces valid video; visual parity with BF16
Memory reduction, total wall time, throughput(s/it)

** HunyuanVideo-1.5 720p**

Calibration script completes on 1x H100 -- 648 weights converted to F8_E4M3
Checker report quant_algo: FP8 for pre-tensor and quant_algo: FP8_PB_WO for per-block
On disk transformer size 10.45 GiB (66.3% vs 49.72 GiB BF16 baseline)
Load via （Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 vllm-project/vllm-omni#2913 's adapter (Auto-detected quantization 'modelopt')
End-to-end inference produces valid video; visual parity with BF16
Memory reduction, total wall time, throughput(s/it)

** Wan2.1-VACE-14B**

Calibration script completes on 2x H100 -- 481 weights converted to F8_E4M3
Checker report quant_algo: FP8 for pre-tensor and quant_algo: FP8_PB_WO for per-block
On disk transformer size (transformer + transformer_2) 16.62 GiB (71.8% vs 58.91 GiB BF16 baseline)
Load via （Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 vllm-project/vllm-omni#2913 's adapter (Auto-detected quantization 'modelopt')
End-to-end inference produces valid video; visual parity with BF16
Memory reduction, total wall time, throughput(s/it)

How to use

Pre calibrated checkpoints are published at huggingface:

Option A: use public checkpoint with no calibration needed

# Wan2.2-I2V-A14B → image_to_video.py
python examples/offline_inference/image_to_video/image_to_video.py \
    --model ArtificialRay7579/wan2.2-i2v-a14b-modelopt-fp8-per-block \
    --quantization fp8 \
    --image /path/to/reference.jpg \
    --prompt 'A cat playing with yarn' \
    --num-frames 81 --num-inference-steps 50 \
    --guidance-scale 5.0 --seed 42 \
    --output outputs/wan22_i2v_modelopt_fp8.mp4

# HunyuanVideo-1.5 720p T2V → text_to_video.py
python examples/offline_inference/text_to_video/text_to_video.py \
    --model ArtificialRay7579/hv15-720p-t2v-fp8-per-tensor \
    --quantization fp8 \
    --prompt 'A dog running across a field of golden wheat.' \
    --height 720 --width 1280 --num-frames 49 \
    --num-inference-steps 30 --guidance-scale 6.0 --seed 42 \
    --output outputs/hv15_720p_modelopt_fp8.mp4

# Wan2.1-VACE-14B (R2V mode) → vace_video_generation.py
python examples/offline_inference/vace/vace_video_generation.py \
    --model ArtificialRay7579/wan21-vace-14b-r2v-fp8-per-block \
    --quantization fp8 \
    --mode r2v --image /path/to/reference.jpg \
    --prompt 'A robot inspecting the workbench' \
    --height 480 --width 832 --num-frames 81 \
    --num-inference-steps 30 --guidance-scale 5.0 --seed 42 \
    --output outputs/wan21_vace14b_r2v_modelopt_fp8.mp4

Option B: calibrate from BF16 yourself (reproducibility / custom prompts)

# 1. Install
pip install 'nvidia-modelopt[all]'

# 2a. Calibrate Wan2.2-I2V-A14B (~25 min on 2×H100, MoE + image conditioning)
#     I2V requires ref images — WanImageToVideoPipeline takes `image` as a
#     required kwarg. --calib-boundary-ratio 0.5 boosts transformer's amax
#     sample on pass 1; pass 2 auto-restores production 0.875 for transformer_2.
python examples/quantization/quantize_wan2_2_modelopt_fp8.py \
    --model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
    --output ./wan22-i2v-modelopt-fp8 \
    --is-i2v --reference-images /path/to/ref_images/ \
    --calib-boundary-ratio 0.5 --overwrite

# 2b. Calibrate HunyuanVideo-1.5 720p T2V (~20 min on 1×H100)
#     Native 720p resolution: --height 720 --width 1280. For the 720p I2V
#     variant, swap to ...-720p_i2v and add --variant i2v --reference-images.
python examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py \
    --model hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_t2v \
    --output ./hv15-720p-modelopt-fp8 \
    --height 720 --width 1280 --overwrite

# 2c. Calibrate Wan2.1-VACE-14B (~25 min on 1×H100, single transformer with vace_blocks)
#     R2V mode: pass --reference-images so half the calibration samples become
#     R2V (prompt + ref image) — vace_blocks' amax then covers real ref-image
#     latents instead of zero-padded T2V-only inputs.
python examples/quantization/quantize_wan2_2_vace_modelopt_fp8.py \
    --model Wan-AI/Wan2.1-VACE-14B-diffusers \
    --output ./wan21-vace14b-modelopt-fp8 \
    --reference-images /path/to/ref_images/ \
    --overwrite

# 2d. (optional) Per-block weight quantization — better numerical fidelity, ~5–10%
#     extra latency. Block size hardcoded to 128x128 by upstream vLLM.
#     Append to any of the calls above:
#     --weight-block-size 128,128

# 3. (optional) Verify each export — checks quantization_config + on-disk FP8
#    dtype + per-tensor vs per-block weight_scale shapes.
python examples/quantization/check_modelopt_fp8_export.py --output ./wan22-i2v-modelopt-fp8
python examples/quantization/check_modelopt_fp8_export.py --output ./hv15-720p-modelopt-fp8
python examples/quantization/check_modelopt_fp8_export.py --output ./wan21-vace14b-modelopt-fp8

# 4. Serve — `--quantization fp8` is auto-upgraded to ModelOpt FP8 because
#    the checkpoint's transformer/config.json carries modelopt metadata.
#   (same invocation as Option A, just pass the local output path as --model)

Known limitations

HV-1.5 720p T2V's speedup is likely limited by enabling vae-use-tiling, wall-time speedup and throughput improvement are nearly negligible
HV-1.5 and Wan2.2 aren't in ModelOpt's recognized-model registry — QKV fusion is skipped and we hand-roll the weight-export path.
MHA quantizers (K/V/softmax) off by default because quantize attention weight computation amplify FP8 drift.

Follow-ups

yaml configuration for these model checkpoints
Publish calibrated checkpoints to HF Hub under vllm-project-org/

…om for bf16 model

…om for fp8 model(for vae)

…ut calculation, and rewrite quant_quality script to automate model offline quant

…/vllm-omni into wan2-2-fp8-offline

Signed-off-by: ArtificialRay <shuaiweihuang@163.com>

lishunyang12 · 2026-05-09T08:56:58Z

Thanks for your contribution. May i know on which device did you test modelopt fp8?

ArtificialRay · 2026-05-09T15:40:07Z

Thanks for your contribution. May i know on which device did you test modelopt fp8?

Thanks for reply. I use H100 80GB for modelopt fp8 test

lishunyang12 · 2026-05-10T04:43:36Z

Thanks for your contribution. May i know on which device did you test modelopt fp8?

Thanks for reply. I use H100 80GB for modelopt fp8 test

May i have your contact？

ArtificialRay · 2026-05-10T04:55:14Z

You can contact with me via wechat: ArthurRay2333

ArtificialRay added 25 commits April 27, 2026 14:18

update quant wan2-2 modelopt to support A14B model

b331fc9

update wan2-2 modelopt quant script

ac90b07

update two gpu quantization quality script

d5eef49

update i2v modelopt quant script

29d7cd9

update hunyuanvideo and wan2.2 vace modelopt script

f94ceb3

add quantization config parsing image2video script

1404ff2

update vae-use-tiling for quantization quality script to avoid cuda o…

66de927

…om for bf16 model

update vae-use-tiling for quantization quality script to avoid cuda o…

15a7f2d

…om for fp8 model(for vae)

update quantization quality script to support i2v videogen task

b6545b2

fix modelopt fp8 quantization script and quality script in T2V

b8837b2

update per-block quant

607477d

update vace videogen script

407b86f

update quantization quality script to support model load and throughp…

30d4455

…ut calculation, and rewrite quant_quality script to automate model offline quant

fix quantization quality script in hunyuanvideo1.5

d3b0de7

update modelopt check script

b0a487c

update remote transmisson to bench_quant_videogen

0ee60ef

update check_quant_videogen

9bc347f

update bench quant videogen script

1f1e503

update quality bench scripts to add negative prompt to wan2.2 I2V

1ccd1ab

Merge branch 'wan2-2-fp8-offline' of https://github.com/ArtificialRay…

cb2be73

…/vllm-omni into wan2-2-fp8-offline

update quality bench script for wan2.2 i2v

55d6dd0

update quality bench script to add denoise throughput(s/it)

e2bcb40

quant_quality script update for image gen model

d8ea3cc

del unrelative scripts

3626d61

del unrelative scripts

5310fa7

ArtificialRay marked this pull request as draft May 7, 2026 19:03

update recommend test cmd after quantization for wan models

63878c7

Signed-off-by: ArtificialRay <shuaiweihuang@163.com>

ArtificialRay marked this pull request as ready for review May 9, 2026 05:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase1 (video-gen) ModelOpt FP8 Follow-ups#57

Phase1 (video-gen) ModelOpt FP8 Follow-ups#57
ArtificialRay wants to merge 26 commits into
lishunyang12:modelopt-fp8-hv15from
ArtificialRay:wan2-2-fp8-offline

ArtificialRay commented May 7, 2026 •

edited

Loading

Uh oh!

lishunyang12 commented May 9, 2026

Uh oh!

ArtificialRay commented May 9, 2026

Uh oh!

lishunyang12 commented May 10, 2026

Uh oh!

ArtificialRay commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArtificialRay commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Validation --- Wan2.2-I2V-A14B (1x H100 80GB, I2V 720x1280, 81 frames, 50 steps, seed=42)

Visual comparison -- Wan2.2-I2V-A14B

Validation --- HunyuanVideo-1.5 720p (1x H100 80GB, T2V 720x1280, 49 frames, 30 steps, seed=42)

Visual comparison -- HunyuanVideo-1.5 720p

Validation --- Wan2.1-VACE-14B (1x H100 80GB, R2V 480x832, 49 frames, 30 steps, seed=42)

Visual comparison -- Wan2.1-VACE-A14B

Test Plan

How to use

Option A: use public checkpoint with no calibration needed

Option B: calibrate from BF16 yourself (reproducibility / custom prompts)

Known limitations

Follow-ups

Uh oh!

lishunyang12 commented May 9, 2026

Uh oh!

ArtificialRay commented May 9, 2026

Uh oh!

lishunyang12 commented May 10, 2026

Uh oh!

ArtificialRay commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArtificialRay commented May 7, 2026 •

edited

Loading