[Quant] Phase 1 (video-gen): ModelOpt FP8 for Wan2.2 TI2V-5B by lishunyang12 · Pull Request #2927 · vllm-project/vllm-omni

lishunyang12 · 2026-04-19T21:33:36Z

Purpose

Phase 1 of #2709, same structure as #2924 — extends ModelOpt FP8 support to Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB H100). Uses the same loader infrastructure as #2913 (image-gen) and #2924 (HV-1.5).

Builds on:

（Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 #2913 (Phase 1, image-gen) — ModelOpt FP8 checkpoint auto-detect + adapter (cherry-picked; rebases away when （Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 #2913 merges)
[Quant] Wire quant_config through HunyuanVideo-1.5 and Wan2.2 DiT for online FP8 #2920 (referenced, will not merge) — DiT wiring for Wan2.2 (extracted into this PR)
[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2 #2924 (Phase 1, HV-1.5) — sibling PR for the other video DiT; same calibration helper pattern

Changes

DiT wiring (extracted from #2920, narrowed to Wan2.2)

vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py — `WanSelfAttention`, `WanCrossAttention`, `WanFeedForward` (+ `ColumnParallelGELU`), `WanTransformerBlock`, `WanTransformer3DModel` thread `quant_config` / `prefix` to all attention + FFN linears. Modulation (`scale_shift_table`), patch embedding (Conv3d), time/text/image embedders, and `proj_out` stay full precision.
`wan2_2_vace_transformer.py` — threading inherited via `super().init`.
`create_transformer_from_config` (`pipeline_wan2_2.py`) and `create_vace_transformer_from_config` (`pipeline_wan2_2_vace.py`) accept optional `quant_config`.
All four pipelines (T2V / I2V / TI2V / VACE) pass `od_config.quantization_config`.

The aggressive skip patterns from #2920 (attn1/attn2 `quant_config=None`) are not applied here — that was an online-FP8 quality workaround for numerical drift on long attention sequences. Static calibration sets scales optimally from real activations, so we start with full attention + FFN FP8. If quality demands it, narrow via the calibrator's `_filter_func_wan22` regex.

ModelOpt FP8 helpers

`examples/quantization/quantize_wan2_2_modelopt_fp8.py` — offline calibration script. 8 video prompts × 10 denoising steps at 704×1280 × 49 frames, guidance 5.0. Same design as the HV-1.5 calibrator ([Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2 #2924): force-export FP8 weights (ModelOpt's `export_hf_checkpoint` doesn't handle diffusers-video), patch `quant_algo: FP8` into `config.json`, hide quantizers during save. `--weight-block-size` flag supported but per-block serving is gated by [Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2 #2924's upstream investigation.
`vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml` — stage config for `Wan22TI2VPipeline` with ModelOpt FP8 auto-detect.

How to use

```bash

1. Install

pip install 'nvidia-modelopt[all]'

2. Offline calibration (one-time, ~10-15 min on 1×H100)

python examples/quantization/quantize_wan2_2_modelopt_fp8.py
--model Wan-AI/Wan2.2-TI2V-5B-Diffusers
--output ./wan22-ti2v-modelopt-fp8
--overwrite

3. (optional) Verify export

python examples/quantization/check_modelopt_fp8_export.py
--output ./wan22-ti2v-modelopt-fp8

4. Serve — auto-detect upgrades --quantization fp8 to ModelOpt FP8

python examples/offline_inference/text_to_video/text_to_video.py
--model ./wan22-ti2v-modelopt-fp8
--quantization fp8
--prompt "A dog running across a field of golden wheat."
--height 704 --width 1280 --num-frames 49
--num-inference-steps 30 --seed 42 --guidance-scale 5.0
--output outputs/wan22_modelopt_fp8.mp4
```

Validation

(To be filled in after H100 calibration — same procedure as #2924. Expected: BF16-equivalent visual quality, ~15% weight memory reduction, ~10% wall-clock speedup.)

Test Plan

Calibration script completes on 1×H100 80GB (target: Wan-AI/Wan2.2-TI2V-5B-Diffusers)
`check_modelopt_fp8_export.py` reports `quant_algo: FP8` and FP8 weights on disk
Resulting checkpoint loads via （Phase 1）Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 #2913's ModelOpt FP8 adapter (`Auto-detected quantization 'modelopt'`)
`Selected ...Fp8...LinearMethod` in serving log — correct kernel
End-to-end inference produces valid video on 704×1280, 49 frames
Visual comparison BF16 vs ModelOpt FP8 — output indistinguishable
Memory + speed numbers measured vs BF16 baseline
Pre-commit (ruff, format, typos) — passing locally

Follow-ups (still Phase 1, other video/variant coverage)

Wan2.2 T2V-A14B and I2V-A14B MoE variants (need 2×H100 TP=2; same calibration flow)
Wan2.2 VACE variant (wiring threaded; calibration helper needs VACE-specific prompts/params)
HunyuanVideo-1.5 720p + I2V variants ([Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2 #2924 follow-up)
Per-block static FP8 serving once upstream vLLM's `ModelOptFp8Config` dispatches on `strategy: block` (tracked in [Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2 #2924 ablation)
Publish calibrated checkpoints to HF Hub under `vllm-project-org/`

Depends on #2913. References #2920 (online-FP8 ablation) and #2924 (HV-1.5 sibling).

cc @baonudesifeizhai @hsliuustc0106 @ArtificialRay

Signed-off-by: roG0d <rodgarcas98@gmail.com>

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

…ject#2920) Threads quant_config / prefix through WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines (T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding (Conv3d), time/text/image embedders, and proj_out stay full precision. All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here — that was an online-FP8 quality workaround; static calibration handles it. Signed-off-by: lishunyang <lishunyang12@163.com>

…V-5B examples/quantization/quantize_wan2_2_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch quant_algo: FP8 into config.json, hide quantizers during save. Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding, proj_out, scale_shift_table, SP helpers). MHA quantizers off by default. vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Signed-off-by: lishunyang <lishunyang12@163.com>

lishunyang12 · 2026-04-19T21:35:02Z

Folding Wan2.2 into #2924 — that PR is titled 'Phase 1 (video-gen)' so it should cover both HV-1.5 and Wan2.2. The two commits from this branch have been cherry-picked onto modelopt-fp8-hv15.

roG0d and others added 9 commits April 20, 2026 05:30

fix

9ab5345

Signed-off-by: roG0d <rodgarcas98@gmail.com>

fix

4bc0a09

Signed-off-by: roG0d <rodgarcas98@gmail.com>

refactoring

0575938

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

refactoring

398c4d7

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

continue refacoring

bf65675

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

fix huawei

9f81027

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

fix online server problem

9bf6b3e

Signed-off-by: roG0d <baonudesifeizhai@gmail.com>

lishunyang12 closed this Apr 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quant] Phase 1 (video-gen): ModelOpt FP8 for Wan2.2 TI2V-5B#2927

[Quant] Phase 1 (video-gen): ModelOpt FP8 for Wan2.2 TI2V-5B#2927
lishunyang12 wants to merge 9 commits into
vllm-project:mainfrom
lishunyang12:modelopt-fp8-wan22

lishunyang12 commented Apr 19, 2026

Uh oh!

lishunyang12 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lishunyang12 commented Apr 19, 2026

Purpose

Changes

How to use

1. Install

2. Offline calibration (one-time, ~10-15 min on 1×H100)

3. (optional) Verify export

4. Serve — auto-detect upgrades --quantization fp8 to ModelOpt FP8

Validation

Test Plan

Follow-ups (still Phase 1, other video/variant coverage)

Uh oh!

lishunyang12 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants