[Quant] Phase 1 (video-gen): ModelOpt FP8 for Wan2.2 TI2V-5B#2927
Closed
lishunyang12 wants to merge 9 commits into
Closed
[Quant] Phase 1 (video-gen): ModelOpt FP8 for Wan2.2 TI2V-5B#2927lishunyang12 wants to merge 9 commits into
lishunyang12 wants to merge 9 commits into
Conversation
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
…ject#2920) Threads quant_config / prefix through WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines (T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding (Conv3d), time/text/image embedders, and proj_out stay full precision. All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here — that was an online-FP8 quality workaround; static calibration handles it. Signed-off-by: lishunyang <lishunyang12@163.com>
…V-5B examples/quantization/quantize_wan2_2_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch quant_algo: FP8 into config.json, hide quantizers during save. Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding, proj_out, scale_shift_table, SP helpers). MHA quantizers off by default. vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Signed-off-by: lishunyang <lishunyang12@163.com>
Collaborator
Author
|
Folding Wan2.2 into #2924 — that PR is titled 'Phase 1 (video-gen)' so it should cover both HV-1.5 and Wan2.2. The two commits from this branch have been cherry-picked onto |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Phase 1 of #2709, same structure as #2924 — extends ModelOpt FP8 support to Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB H100). Uses the same loader infrastructure as #2913 (image-gen) and #2924 (HV-1.5).
Builds on:
Changes
DiT wiring (extracted from #2920, narrowed to Wan2.2)
vllm_omni/diffusion/models/wan2_2/wan2_2_transformer.py— `WanSelfAttention`, `WanCrossAttention`, `WanFeedForward` (+ `ColumnParallelGELU`), `WanTransformerBlock`, `WanTransformer3DModel` thread `quant_config` / `prefix` to all attention + FFN linears. Modulation (`scale_shift_table`), patch embedding (Conv3d), time/text/image embedders, and `proj_out` stay full precision.The aggressive skip patterns from #2920 (attn1/attn2 `quant_config=None`) are not applied here — that was an online-FP8 quality workaround for numerical drift on long attention sequences. Static calibration sets scales optimally from real activations, so we start with full attention + FFN FP8. If quality demands it, narrow via the calibrator's `_filter_func_wan22` regex.
ModelOpt FP8 helpers
How to use
```bash
1. Install
pip install 'nvidia-modelopt[all]'
2. Offline calibration (one-time, ~10-15 min on 1×H100)
python examples/quantization/quantize_wan2_2_modelopt_fp8.py
--model Wan-AI/Wan2.2-TI2V-5B-Diffusers
--output ./wan22-ti2v-modelopt-fp8
--overwrite
3. (optional) Verify export
python examples/quantization/check_modelopt_fp8_export.py
--output ./wan22-ti2v-modelopt-fp8
4. Serve — auto-detect upgrades --quantization fp8 to ModelOpt FP8
python examples/offline_inference/text_to_video/text_to_video.py
--model ./wan22-ti2v-modelopt-fp8
--quantization fp8
--prompt "A dog running across a field of golden wheat."
--height 704 --width 1280 --num-frames 49
--num-inference-steps 30 --seed 42 --guidance-scale 5.0
--output outputs/wan22_modelopt_fp8.mp4
```
Validation
(To be filled in after H100 calibration — same procedure as #2924. Expected: BF16-equivalent visual quality, ~15% weight memory reduction, ~10% wall-clock speedup.)
Test Plan
Follow-ups (still Phase 1, other video/variant coverage)
Depends on #2913. References #2920 (online-FP8 ablation) and #2924 (HV-1.5 sibling).
cc @baonudesifeizhai @hsliuustc0106 @ArtificialRay