Skip to content

Phase1 (video-gen) ModelOpt FP8 Follow-ups#57

Open
ArtificialRay wants to merge 26 commits into
lishunyang12:modelopt-fp8-hv15from
ArtificialRay:wan2-2-fp8-offline
Open

Phase1 (video-gen) ModelOpt FP8 Follow-ups#57
ArtificialRay wants to merge 26 commits into
lishunyang12:modelopt-fp8-hv15from
ArtificialRay:wan2-2-fp8-offline

Conversation

@ArtificialRay
Copy link
Copy Markdown

@ArtificialRay ArtificialRay commented May 7, 2026

This PR completes follow-ups listed in vllm-project#2924. All benchmarking at Validation is done in benchmarks/diffusion/quantization_quality.py

Purpose

Phase 1 of vllm-project#2709 — extends ModelOpt FP8 support to video-gen models. This PR add ModelOpt FP8 static quantization support for Wan2.2 T2V-A14B / I2V-A14B MoE variants, Wan2.2 VACE variant, HunyuanVideo-1.5 720p T2V/I2V variants, and Block-wise static FP8 quantization for all above model variants.

Changes

ModelOpt FP8 helpers

  • examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py -- HV-1.5 calibrator that add I2V variant support and patches quant_algo: FP8 if per-tensor quant or patches quant_algo: FP8_PB_WO if per-block quant. Per-block quant support only if M=N=128
  • examples/quantization/quantize_wan2_2_modelopt_fp8.py -- Wan2.2 TI2V-5B / T2V-A14B / I2V-A14B calibrator, for A14B model, it loads two pipeline to two separate
  • for calibration.
  • examples/quantization/check_modelopt_fp8_export.py -- Verifier. Add on-disk transformer weight reduction metrics and whole-repo(whole-model) weight size reduction.
  • examples/offline_inference/vace/vace_video_generation.py -- Wan2.2 VACE variance calibrator, support T2V / I2V/ R2V per-tensor or per-block static quantization

Calibrator still share --weight-block-size 'M,N' for block-wise FP8, and use the same fallback pattern: _force_export_quantized_weights + _patch_quant_config + hide_quantizers_from_state_dict as vllm-project#2924

Script wiring

  • examples/offline_inference/image_to_video/image_to_video.py -- add --quantization and --ignore-layers to support I2V FP8 quantization test
  • benchmarks/diffusion/quantization_quality.py -- support both T2V and I2V video-gen model quantization quality benchmarking. Add metrics throughput, peak VRAM and peak VRAM reduction. This PR also fix bugs that previous Memory metric is always 0.0, and T2I benchmarking attribute error at generate_image()

Adapter

  • modelopt_fp8.py:_is_transformer_source add support for Wan2.2 A14B MoE that verify two transformer architecture

Validation --- Wan2.2-I2V-A14B (1x H100 80GB, I2V 720x1280, 81 frames, 50 steps, seed=42)

torch.compile enabled (default). --vae-use-tiling is set during benchmarking as there will be CUDA OOM on BF16 baseline if not enable.

BF16 baseline v.s. Per-tensor quantization

Metric BF16 baseline ModelOpt FP8 (this PR) Delta
Total wall time 984.98 s 901.70 s −8% (1.09× speedup)
Denoise throughput 19.27 s/it 17.61 s/it −9%
Peak GPU memory 71.64 GiB 46.36 GiB −35%
On-disk transformer weights 106.46 GiB 27.17 GiB ~74.5%
Model load (resident) 117.54 GiB 38.25 GiB ~67.5%
Visual fidelity (Mean LPIPS) — (ref) 0.1826

BF16 baseline v.s. Per-block quantization

Metric BF16 baseline ModelOpt FP8 (this PR) Delta
Total wall time 972.50 s 930.24 s −4% (1.05× speedup)
Denoise throughput 19.02 s/it 18.18 s/it −4%
Peak GPU memory 71.64 GiB 47.67 GiB −33%
On-disk transformer weights 106.46 GiB 27.17 GiB ~74.5%
Model load (resident) 117.54 GiB 38.25 GiB ~67.5%
Visual fidelity (Mean LPIPS) — (ref) 0.1686

Engine signals confirming the path is wired correctly:

  • factory.py: Building quantization config: fp8 Building quantization config: modelopt — auto-detect upgraded the user's --quantization fp8 flag to ModelOpt based on quant_algo: FP8 or quant_algo: FP8_PB_WO in transformer/config.json
  • data.py: Auto-detected quantization 'modelopt' from model config
  • modelopt.py:381 (Per-tensor) Detected ModelOpt fp8 checkpoint (quant_algo=FP8). (Per-block) Detected ModelOpt fp8 checkpoint (quant_algo=FP8_PB_WO).
  • __init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod (Per-tensor only) — the ModelOpt FP8 kernel selected

Visual comparison -- Wan2.2-I2V-A14B

BF16 baseline:

wan22_A14B_bf16.mp4

ModelOpt FP8 per-tensor (this PR):

wan22_A14B_fp8_per_tensor.mp4

ModelOpt FP8 per-block (this PR)

wan22_A14B_fp8_per_block.mp4

Same prompt ("A skateboarder in a purple bomber jacket doing a kickflip in a foggy urban plaza, overcast morning light, slow motion, european architecture in the background."), same seed, same sampling params. Output is BF16-equivalent

Validation --- HunyuanVideo-1.5 720p (1x H100 80GB, T2V 720x1280, 49 frames, 30 steps, seed=42)

torch.compile enabled (default). --vae-use-tiling is set during benchmarking as there will be CUDA OOM on both BF16 baseline and ModelOpt FP8 if not enable.

BF16 baseline v.s. Per-tensor quantization

Metric BF16 baseline ModelOpt FP8 (this PR) Delta
Total wall time 137.64 s 131.52 s −4% (1.05× speedup)
Denoise throughput 4.59 s/it 4.38 s/it −4%
Peak GPU memory 51.94 GiB 46.96 GiB −10%
On-disk transformer weights 106.46 GiB 27.17 GiB ~74.5%
Model load (resident) 117.54 GiB 38.25 GiB ~67.5%
Visual fidelity (Mean LPIPS) — (ref) 0.2211

BF16 baseline v.s. Per-block quantization

Metric BF16 baseline ModelOpt FP8 (this PR) Delta
Total wall time 136.12 s 135.38 s −0.5% (1.01× speedup)
Denoise throughput 4.54 s/it 4.51 s/it −0.5%
Peak GPU memory 51.94 GiB 46.96 GiB −10%
On-disk transformer weights 31.02 GiB 10.45 GiB ~66.3%
Model load (resident) 49.72 GiB 29.15 GiB ~41.4%
Visual fidelity (Mean LPIPS) — (ref) 0.1911

Engine signals confirming the path is wired correctly:

  • factory.py: Building quantization config: fp8 Building quantization config: modelopt — auto-detect upgraded the user's --quantization fp8 flag to ModelOpt based on quant_algo: FP8 or quant_algo: FP8_PB_WO in transformer/config.json
  • data.py: Auto-detected quantization 'modelopt' from model config
  • modelopt.py:381 (Per-tensor) Detected ModelOpt fp8 checkpoint (quant_algo=FP8). (Per-block) Detected ModelOpt fp8 checkpoint (quant_algo=FP8_PB_WO).
  • __init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod (Per-tensor only) — the ModelOpt FP8 kernel selected

Visual comparison -- HunyuanVideo-1.5 720p

BF16 baseline:

hunyuan_720p_bf16.mp4

ModelOpt FP8 per-tensor (this PR):

hunyuan_720p_fp8_per_tensor.mp4

ModelOpt FP8 per-block (this PR)

hunyuan_720p_fp8_per_block.mp4

Same prompt ("An astronaut in a white spacesuit riding a horse across the lunar surface, gray dust kicked up by the horse's hooves, Earth visible in the black sky, lunar lander in the distance, cinematic wide shot. Make sure the astronaut is really moving!") and negative prompt (""vibrant colors, overexposed, static, blurred details, subtitles, style, artwork, painting, picture, still, overall gray, worst quality, low quality, JPEG compression artifacts, ugly, mutilated, extra fingers, poorly drawn hands, poorly drawn face, deformed, disfigured, malformed limbs, fused fingers, still frame, cluttered background, three legs, many people in the background, walking backwards"") same seed, same sampling params. Output is BF16-equivalent

Validation --- Wan2.1-VACE-14B (1x H100 80GB, R2V 480x832, 49 frames, 30 steps, seed=42)

torch.compile enabled (default).

BF16 baseline v.s. Per-tensor quantization

Metric BF16 baseline ModelOpt FP8 (this PR) Delta
Total wall time 112.91 s 95.62 s −15% (1.18× speedup)
Denoise throughput 3.76 s/it 3.19 s/it −15%
Peak GPU memory 53.98 GiB 38.51 GiB −29%
On-disk transformer weights 58.91 GiB 16.62 GiB ~71.8%
Model load (resident) 69.99 GiB 27.71 GiB ~60.4%
Visual fidelity (Mean LPIPS) — (ref) 0.2619

BF16 baseline v.s. Per-block quantization

Metric BF16 baseline ModelOpt FP8 (this PR) Delta
Total wall time 111.96 s 103.24 s −8% (1.08× speedup)
Denoise throughput 3.73 s/it 3.44 s/it −8%
Peak GPU memory 53.98 GiB 38.52 GiB −29%
On-disk transformer weights 58.91 GiB 16.62 GiB ~71.8%
Model load (resident) 69.99 GiB 27.71 GiB ~60.4%
Visual fidelity (Mean LPIPS) — (ref) 0.1640

Engine signals confirming the path is wired correctly:

  • factory.py: Building quantization config: fp8 Building quantization config: modelopt — auto-detect upgraded the user's --quantization fp8 flag to ModelOpt based on quant_algo: FP8 or quant_algo: FP8_PB_WO in transformer/config.json
  • data.py: Auto-detected quantization 'modelopt' from model config
  • modelopt.py:381 (Per-tensor) Detected ModelOpt fp8 checkpoint (quant_algo=FP8). (Per-block) Detected ModelOpt fp8 checkpoint (quant_algo=FP8_PB_WO).
  • __init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod (Per-tensor only) — the ModelOpt FP8 kernel selected

Visual comparison -- Wan2.1-VACE-A14B

BF16 baseline:

r2v_output_bf16.mp4

ModelOpt FP8 per-tensor (this PR):

r2v_output_fp8_per_tensor.mp4

ModelOpt FP8 per-block (this PR)

r2v_output_fp8_per_block.mp4

Same prompt ("An astronaut in a white spacesuit riding a horse across the lunar surface, gray dust kicked up by the horse's hooves, Earth visible in the black sky, lunar lander in the distance, cinematic wide shot. Make sure the astronaut is really moving!") , same seed, same sampling params. Output is BF16-equivalent.

Test Plan

Wan2.2-I2V-A14B

  • Calibration script completes on 2x H100 -- 400 weights converted to F8_E4M3
  • Checker report quant_algo: FP8 for pre-tensor and quant_algo: FP8_PB_WO for per-block
  • On disk transformer size (transformer + transformer_2) 27.17 GiB (74.5% vs 106.46 GiB BF16 baseline)
  • Load via (Phase 1)Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 vllm-project/vllm-omni#2913 's adapter (Auto-detected quantization 'modelopt')
  • End-to-end inference produces valid video; visual parity with BF16
  • Memory reduction, total wall time, throughput(s/it)

** HunyuanVideo-1.5 720p**

** Wan2.1-VACE-14B**

  • Calibration script completes on 2x H100 -- 481 weights converted to F8_E4M3
  • Checker report quant_algo: FP8 for pre-tensor and quant_algo: FP8_PB_WO for per-block
  • On disk transformer size (transformer + transformer_2) 16.62 GiB (71.8% vs 58.91 GiB BF16 baseline)
  • Load via (Phase 1)Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 vllm-project/vllm-omni#2913 's adapter (Auto-detected quantization 'modelopt')
  • End-to-end inference produces valid video; visual parity with BF16
  • Memory reduction, total wall time, throughput(s/it)

How to use

Pre calibrated checkpoints are published at huggingface:

Option A: use public checkpoint with no calibration needed

# Wan2.2-I2V-A14B → image_to_video.py
python examples/offline_inference/image_to_video/image_to_video.py \
    --model ArtificialRay7579/wan2.2-i2v-a14b-modelopt-fp8-per-block \
    --quantization fp8 \
    --image /path/to/reference.jpg \
    --prompt 'A cat playing with yarn' \
    --num-frames 81 --num-inference-steps 50 \
    --guidance-scale 5.0 --seed 42 \
    --output outputs/wan22_i2v_modelopt_fp8.mp4

# HunyuanVideo-1.5 720p T2V → text_to_video.py
python examples/offline_inference/text_to_video/text_to_video.py \
    --model ArtificialRay7579/hv15-720p-t2v-fp8-per-tensor \
    --quantization fp8 \
    --prompt 'A dog running across a field of golden wheat.' \
    --height 720 --width 1280 --num-frames 49 \
    --num-inference-steps 30 --guidance-scale 6.0 --seed 42 \
    --output outputs/hv15_720p_modelopt_fp8.mp4

# Wan2.1-VACE-14B (R2V mode) → vace_video_generation.py
python examples/offline_inference/vace/vace_video_generation.py \
    --model ArtificialRay7579/wan21-vace-14b-r2v-fp8-per-block \
    --quantization fp8 \
    --mode r2v --image /path/to/reference.jpg \
    --prompt 'A robot inspecting the workbench' \
    --height 480 --width 832 --num-frames 81 \
    --num-inference-steps 30 --guidance-scale 5.0 --seed 42 \
    --output outputs/wan21_vace14b_r2v_modelopt_fp8.mp4

Option B: calibrate from BF16 yourself (reproducibility / custom prompts)

# 1. Install
pip install 'nvidia-modelopt[all]'

# 2a. Calibrate Wan2.2-I2V-A14B (~25 min on 2×H100, MoE + image conditioning)
#     I2V requires ref images — WanImageToVideoPipeline takes `image` as a
#     required kwarg. --calib-boundary-ratio 0.5 boosts transformer's amax
#     sample on pass 1; pass 2 auto-restores production 0.875 for transformer_2.
python examples/quantization/quantize_wan2_2_modelopt_fp8.py \
    --model Wan-AI/Wan2.2-I2V-A14B-Diffusers \
    --output ./wan22-i2v-modelopt-fp8 \
    --is-i2v --reference-images /path/to/ref_images/ \
    --calib-boundary-ratio 0.5 --overwrite

# 2b. Calibrate HunyuanVideo-1.5 720p T2V (~20 min on 1×H100)
#     Native 720p resolution: --height 720 --width 1280. For the 720p I2V
#     variant, swap to ...-720p_i2v and add --variant i2v --reference-images.
python examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py \
    --model hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-720p_t2v \
    --output ./hv15-720p-modelopt-fp8 \
    --height 720 --width 1280 --overwrite

# 2c. Calibrate Wan2.1-VACE-14B (~25 min on 1×H100, single transformer with vace_blocks)
#     R2V mode: pass --reference-images so half the calibration samples become
#     R2V (prompt + ref image) — vace_blocks' amax then covers real ref-image
#     latents instead of zero-padded T2V-only inputs.
python examples/quantization/quantize_wan2_2_vace_modelopt_fp8.py \
    --model Wan-AI/Wan2.1-VACE-14B-diffusers \
    --output ./wan21-vace14b-modelopt-fp8 \
    --reference-images /path/to/ref_images/ \
    --overwrite

# 2d. (optional) Per-block weight quantization — better numerical fidelity, ~5–10%
#     extra latency. Block size hardcoded to 128x128 by upstream vLLM.
#     Append to any of the calls above:
#     --weight-block-size 128,128

# 3. (optional) Verify each export — checks quantization_config + on-disk FP8
#    dtype + per-tensor vs per-block weight_scale shapes.
python examples/quantization/check_modelopt_fp8_export.py --output ./wan22-i2v-modelopt-fp8
python examples/quantization/check_modelopt_fp8_export.py --output ./hv15-720p-modelopt-fp8
python examples/quantization/check_modelopt_fp8_export.py --output ./wan21-vace14b-modelopt-fp8

# 4. Serve — `--quantization fp8` is auto-upgraded to ModelOpt FP8 because
#    the checkpoint's transformer/config.json carries modelopt metadata.
#   (same invocation as Option A, just pass the local output path as --model)

Known limitations

  • HV-1.5 720p T2V's speedup is likely limited by enabling vae-use-tiling, wall-time speedup and throughput improvement are nearly negligible
  • HV-1.5 and Wan2.2 aren't in ModelOpt's recognized-model registry — QKV fusion is skipped and we hand-roll the weight-export path.
  • MHA quantizers (K/V/softmax) off by default because quantize attention weight computation amplify FP8 drift.

Follow-ups

  • yaml configuration for these model checkpoints
  • Publish calibrated checkpoints to HF Hub under vllm-project-org/

…ut calculation, and rewrite quant_quality script to automate model offline quant
@ArtificialRay ArtificialRay marked this pull request as draft May 7, 2026 19:03
Signed-off-by: ArtificialRay <shuaiweihuang@163.com>
@ArtificialRay ArtificialRay marked this pull request as ready for review May 9, 2026 05:16
@lishunyang12
Copy link
Copy Markdown
Owner

Thanks for your contribution. May i know on which device did you test modelopt fp8?

@ArtificialRay
Copy link
Copy Markdown
Author

Thanks for your contribution. May i know on which device did you test modelopt fp8?

Thanks for reply. I use H100 80GB for modelopt fp8 test

@lishunyang12
Copy link
Copy Markdown
Owner

Thanks for your contribution. May i know on which device did you test modelopt fp8?

Thanks for reply. I use H100 80GB for modelopt fp8 test

May i have your contact?

@ArtificialRay
Copy link
Copy Markdown
Author

You can contact with me via wechat: ArthurRay2333

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants