Skip to content

[Quant] Phase 1 (video-gen): ModelOpt FP8 for Wan2.2 TI2V-5B#2927

Closed
lishunyang12 wants to merge 9 commits into
vllm-project:mainfrom
lishunyang12:modelopt-fp8-wan22
Closed

[Quant] Phase 1 (video-gen): ModelOpt FP8 for Wan2.2 TI2V-5B#2927
lishunyang12 wants to merge 9 commits into
vllm-project:mainfrom
lishunyang12:modelopt-fp8-wan22

Conversation

@lishunyang12
Copy link
Copy Markdown
Collaborator

Purpose

Phase 1 of #2709, same structure as #2924 — extends ModelOpt FP8 support to Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB H100). Uses the same loader infrastructure as #2913 (image-gen) and #2924 (HV-1.5).

Builds on:

Changes

DiT wiring (extracted from #2920, narrowed to Wan2.2)

  • vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py — `WanSelfAttention`, `WanCrossAttention`, `WanFeedForward` (+ `ColumnParallelGELU`), `WanTransformerBlock`, `WanTransformer3DModel` thread `quant_config` / `prefix` to all attention + FFN linears. Modulation (`scale_shift_table`), patch embedding (Conv3d), time/text/image embedders, and `proj_out` stay full precision.
  • `wan2_2_vace_transformer.py` — threading inherited via `super().init`.
  • `create_transformer_from_config` (`pipeline_wan2_2.py`) and `create_vace_transformer_from_config` (`pipeline_wan2_2_vace.py`) accept optional `quant_config`.
  • All four pipelines (T2V / I2V / TI2V / VACE) pass `od_config.quantization_config`.

The aggressive skip patterns from #2920 (attn1/attn2 `quant_config=None`) are not applied here — that was an online-FP8 quality workaround for numerical drift on long attention sequences. Static calibration sets scales optimally from real activations, so we start with full attention + FFN FP8. If quality demands it, narrow via the calibrator's `_filter_func_wan22` regex.

ModelOpt FP8 helpers

How to use

```bash

1. Install

pip install 'nvidia-modelopt[all]'

2. Offline calibration (one-time, ~10-15 min on 1×H100)

python examples/quantization/quantize_wan2_2_modelopt_fp8.py
--model Wan-AI/Wan2.2-TI2V-5B-Diffusers
--output ./wan22-ti2v-modelopt-fp8
--overwrite

3. (optional) Verify export

python examples/quantization/check_modelopt_fp8_export.py
--output ./wan22-ti2v-modelopt-fp8

4. Serve — auto-detect upgrades --quantization fp8 to ModelOpt FP8

python examples/offline_inference/text_to_video/text_to_video.py
--model ./wan22-ti2v-modelopt-fp8
--quantization fp8
--prompt "A dog running across a field of golden wheat."
--height 704 --width 1280 --num-frames 49
--num-inference-steps 30 --seed 42 --guidance-scale 5.0
--output outputs/wan22_modelopt_fp8.mp4
```

Validation

(To be filled in after H100 calibration — same procedure as #2924. Expected: BF16-equivalent visual quality, ~15% weight memory reduction, ~10% wall-clock speedup.)

Test Plan

  • Calibration script completes on 1×H100 80GB (target: Wan-AI/Wan2.2-TI2V-5B-Diffusers)
  • `check_modelopt_fp8_export.py` reports `quant_algo: FP8` and FP8 weights on disk
  • Resulting checkpoint loads via (Phase 1)Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 #2913's ModelOpt FP8 adapter (`Auto-detected quantization 'modelopt'`)
  • `Selected ...Fp8...LinearMethod` in serving log — correct kernel
  • End-to-end inference produces valid video on 704×1280, 49 frames
  • Visual comparison BF16 vs ModelOpt FP8 — output indistinguishable
  • Memory + speed numbers measured vs BF16 baseline
  • Pre-commit (ruff, format, typos) — passing locally

Follow-ups (still Phase 1, other video/variant coverage)

Depends on #2913. References #2920 (online-FP8 ablation) and #2924 (HV-1.5 sibling).

cc @baonudesifeizhai @hsliuustc0106 @ArtificialRay

roG0d and others added 9 commits April 20, 2026 05:30
Signed-off-by: roG0d <rodgarcas98@gmail.com>
Signed-off-by: roG0d <rodgarcas98@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
…ject#2920)

Threads quant_config / prefix through WanSelfAttention, WanCrossAttention,
WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and
WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines
(T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding
(Conv3d), time/text/image embedders, and proj_out stay full precision.

All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter
from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip
patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here —
that was an online-FP8 quality workaround; static calibration handles it.

Signed-off-by: lishunyang <lishunyang12@163.com>
…V-5B

examples/quantization/quantize_wan2_2_modelopt_fp8.py:
  Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint
  for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design
  as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch
  quant_algo: FP8 into config.json, hide quantizers during save.
  Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding,
  proj_out, scale_shift_table, SP helpers). MHA quantizers off by default.

vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml:
  Stage config for serving the calibrated checkpoint via vllm-omni.

Signed-off-by: lishunyang <lishunyang12@163.com>
@lishunyang12
Copy link
Copy Markdown
Collaborator Author

Folding Wan2.2 into #2924 — that PR is titled 'Phase 1 (video-gen)' so it should cover both HV-1.5 and Wan2.2. The two commits from this branch have been cherry-picked onto modelopt-fp8-hv15.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants