Skip to content

[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2#2924

Draft
lishunyang12 wants to merge 23 commits intovllm-project:mainfrom
lishunyang12:modelopt-fp8-hv15
Draft

[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2#2924
lishunyang12 wants to merge 23 commits intovllm-project:mainfrom
lishunyang12:modelopt-fp8-hv15

Conversation

@lishunyang12
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 commented Apr 19, 2026

Purpose

Phase 1 of #2709extends ModelOpt FP8 support to video-gen models. #2913 covers Phase 1 for image-gen (Flux, Flux2-Klein, Qwen-Image, HunyuanImage-3); this PR adds the video-gen counterpart for both HunyuanVideo-1.5 and Wan2.2 TI2V-5B using the same loader infrastructure.

Builds on:

Changes

DiT wiring (extracted from #2920)

  • hunyuan_video_15_transformer.py + pipelines — HunyuanVideo15Attention, HunyuanVideo15TransformerBlock, HunyuanVideo15Transformer3DModel accept quant_config / prefix; threaded to to_qkv, to_out[0], add_kv_proj, to_add_out, ff, ff_context.
  • wan2_2_transformer.py + wan2_2_vace_transformer.py + 4 pipelines — WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, WanTransformer3DModel, VACE variant. Factories (create_transformer_from_config, create_vace_transformer_from_config) accept optional quant_config.
  • Modulation (raw nn.Linear / scale_shift_table), patch embedders (Conv3d), time/text/image embedders, proj_out, and the HV-1.5 token refiner stay full precision.
  • The aggressive skip patterns from [Quant] Wire quant_config through HunyuanVideo-1.5 and Wan2.2 DiT for online FP8 #2920 (attn1/attn2 quant_config=None on Wan2.2) are not applied here — that was an online-FP8 workaround; static calibration handles it.

ModelOpt FP8 helpers

  • examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py — HV-1.5 calibrator. Force-exports FP8 weights, patches quant_algo: FP8, hides quantizers during save. MHA quantizers off by default.
  • examples/quantization/quantize_wan2_2_modelopt_fp8.py — Wan2.2 TI2V-5B calibrator. Same design.
  • examples/quantization/check_modelopt_fp8_export.py — verifier. Reads safetensors header dtypes, checks quant_algo: FP8, classifies scale granularity (per-tensor / per-channel / per-block).
  • vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml + wan2_2_ti2v_dit_fp8.yaml — serving stage configs with auto-detect.

Adapter (this PR also fixes a general-purpose bug in #2913's adapter):

  • modelopt_fp8.py:_get_weights_mapper now walks submodules to aggregate hf_to_vllm_mapper from whichever sub-module defines it. The adapter is instantiated with the whole Pipeline, so model-specific remaps (like Wan2.2's ffn.net.0.ffn.net_0.) must be discovered on the transformer submodule, not the top-level Pipeline. Fixes silent-noise output that occurred on Wan2.2 ModelOpt FP8 before this change.
  • WanTransformer3DModel.hf_to_vllm_mapper added with that remap.

Both calibrators share --weight-block-size 'M,N' for block-wise FP8, and the same fallback pattern: _force_export_quantized_weights + _patch_quant_config + hide_quantizers_from_state_dict — because ModelOpt's export_hf_checkpoint doesn't handle diffusers-video checkpoints natively.

Validation — HunyuanVideo-1.5 (1×H100 80GB, T2V 480×832, 33 frames, 30 steps, seed=42)

torch.compile enabled (default).

Metric BF16 baseline ModelOpt FP8 (this PR) Delta
Model load 33.81 GiB 28.74 GiB −15%
Peak GPU memory (allocated) 72.42 GiB 67.36 GiB −7%
Total wall time 24.05 s 20.79 s −14%
Throughput 1.44 it/s 1.67 it/s +16%
On-disk transformer weights 31.02 GiB 10.45 GiB −66%

Engine signals confirming the path is wired correctly:

  • factory.py: Building quantization config: fp8Building quantization config: modelopt — auto-detect upgraded the user's --quantization fp8 flag to ModelOpt based on quant_algo: FP8 in transformer/config.json
  • data.py: Auto-detected quantization 'modelopt' from model config
  • __init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod — the ModelOpt FP8 kernel selected

Visual comparison — HunyuanVideo-1.5

BF16 baseline:

hv15_bf16_compiled.mp4

ModelOpt FP8 (this PR):

hv15_modelopt_fp8_compiled.mp4

Same prompt ("A dog running across a field of golden wheat."), same seed, same sampling params. Output is BF16-equivalent — no detail collapse or composition drift like the online FP8 path showed in #2920.

Validation — Wan2.2 TI2V-5B (1×H100 80GB, T2V 704×1280, 49 frames, 30 steps, seed=42)

torch.compile enabled (default).

Metric BF16 baseline ModelOpt FP8 (this PR) Delta
Model load 21.22 GiB 16.68 GiB −21%
Peak GPU memory (allocated) 38.28 GiB 33.74 GiB −12%
Total wall time 19.75 s 16.63 s −16%
Throughput 1.96 it/s 2.45 it/s +25%
On-disk transformer weights 21.22 GiB 4.76 GiB −77%

Engine signals:

  • factory.py: Building quantization config: fp8modelopt (auto-detect fired)
  • data.py: Auto-detected quantization 'modelopt' from model config
  • __init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod
  • Zero unloaded weight_scale warnings after the hf_to_vllm_mapper fix for Wan2.2's ffn.net.0.ffn.net_0. diffusers↔vllm-omni name remap.

Visual comparison — Wan2.2 TI2V-5B

BF16 baseline:

wan22_bf16_v4.mp4

ModelOpt FP8 (this PR):

wan22_modelopt_fp8_v4.mp4

Same prompt ("A dog running across a field of golden wheat."), same seed, same sampling params. Output is BF16-equivalent.

How to use

Pre-calibrated checkpoints are published on HF Hub so reviewers can test without recalibrating:

Option A: use the published checkpoints (no calibration needed)

# HunyuanVideo-1.5
python examples/offline_inference/text_to_video/text_to_video.py \
    --model shunyang90/HunyuanVideo-1.5-480p-ModelOpt-FP8 \
    --quantization fp8 \
    --prompt "A dog running across a field of golden wheat." \
    --height 480 --width 832 --num-frames 33 \
    --num-inference-steps 30 --seed 42 --guidance-scale 6.0 \
    --output outputs/hv15_modelopt_fp8.mp4

# Wan2.2 TI2V-5B
python examples/offline_inference/text_to_video/text_to_video.py \
    --model shunyang90/Wan2.2-TI2V-5B-ModelOpt-FP8 \
    --quantization fp8 \
    --prompt "A dog running across a field of golden wheat." \
    --height 704 --width 1280 --num-frames 49 \
    --num-inference-steps 30 --seed 42 --guidance-scale 5.0 \
    --output outputs/wan22_modelopt_fp8.mp4

Option B: calibrate from BF16 yourself (reproducibility / custom prompts)

# 1. Install
pip install 'nvidia-modelopt[all]'

# 2a. Calibrate HV-1.5 (~10–15 min on 1×H100)
python examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py \
    --model hunyuanvideo-community/HunyuanVideo-1.5-Diffusers-480p_t2v \
    --output ./hv15-480p-modelopt-fp8 --overwrite

# 2b. Calibrate Wan2.2 TI2V-5B (~10 min on 1×H100)
python examples/quantization/quantize_wan2_2_modelopt_fp8.py \
    --model Wan-AI/Wan2.2-TI2V-5B-Diffusers \
    --output ./wan22-ti2v-modelopt-fp8 --overwrite

# 3. (optional) Verify
python examples/quantization/check_modelopt_fp8_export.py --output ./hv15-480p-modelopt-fp8
python examples/quantization/check_modelopt_fp8_export.py --output ./wan22-ti2v-modelopt-fp8

# 4. Serve — auto-detect upgrades --quantization fp8 to ModelOpt FP8
# (same invocation as Option A, just pass the local output path as --model)

Test Plan

HunyuanVideo-1.5

  • Calibration script completes on 1×H100 — 648 weights converted to FP8
  • Checker reports quant_algo: FP8, 648 F8_E4M3 tensors, per-tensor scale granularity
  • On-disk transformer 10.45 GiB (−66% vs 31.02 GiB BF16)
  • Loads via (Phase 1)Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 #2913's adapter (Auto-detected quantization 'modelopt')
  • End-to-end inference produces valid video; visual parity with BF16
  • Memory −15%, wall-clock −14% vs BF16

Wan2.2 TI2V-5B

  • Calibration script completes on 1×H100 — 300 weights converted to FP8
  • Checker reports quant_algo: FP8, 300 F8_E4M3 tensors, per-tensor scale granularity
  • Loads cleanly via (Phase 1)Add ModelOpt FP8 auto-detect support for diffusion checkpoints #2709 #2913's adapter (after the hf_to_vllm_mapper fix — see adapter change below)
  • End-to-end inference produces valid video; visual parity with BF16
  • Memory −21%, wall-clock −16% vs BF16
  • On-disk transformer 4.76 GiB (−77% vs 21.22 GiB BF16)

Both

  • Pre-commit (ruff, format, typos) — passing
  • torch.compile enabled (default) on both BF16 and FP8 for fair comparison
  • HV-1.5 I2V variant + Wan2.2 I2V / T2V-A14B / VACE — wiring threaded, calibration untested

Known limitations

  • Per-block static FP8 is calibrated-correct but not serveable yet. Upstream vLLM's ModelOptFp8Config / ModelOptFp8LinearMethod only dispatches per-tensor scales — a block-wise checkpoint crashes at load with a shape-mismatch assertion in parameter.py:_assert_and_load. Per-tensor serving is the shippable path; --weight-block-size is kept in the calibrator for when upstream gains block-wise dispatch.
  • HV-1.5 and Wan2.2 aren't in ModelOpt's recognized-model registry — QKV fusion is skipped and we hand-roll the weight-export path. Works, but means fewer of ModelOpt's standard diffusion optimizations.
  • MHA quantizers (K/V/softmax) off by default — attention numerics on long video sequences were sensitive even with static scales (empirically in [Quant] Wire quant_config through HunyuanVideo-1.5 and Wan2.2 DiT for online FP8 #2920 ablation).

Follow-ups (still Phase 1, other video/variant coverage)

  • Wan2.2 T2V-A14B / I2V-A14B MoE variants (need 2×H100)
  • Wan2.2 VACE variant (wiring threaded; calibration helper needs VACE-specific prompts)
  • HunyuanVideo-1.5 720p + I2V variants
  • Block-wise static FP8 serving once upstream vLLM dispatches on strategy: block
  • Publish calibrated checkpoints to HF Hub under vllm-project-org/

Depends on #2913. References #2920 (online-FP8 ablation reference, will not merge).

cc @baonudesifeizhai @hsliuustc0106 @ArtificialRay

roG0d and others added 9 commits April 20, 2026 03:03
Signed-off-by: roG0d <rodgarcas98@gmail.com>
Signed-off-by: roG0d <rodgarcas98@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
…vllm-project#2920)

Threads quant_config / prefix through HunyuanVideo15Attention,
HunyuanVideo15TransformerBlock, and HunyuanVideo15Transformer3DModel so
the modelopt FP8 adapter from vllm-project#2913 has somewhere to bind per-layer scales.
Modulation, embeddings, proj_out stay raw nn.Linear (full precision).

Signed-off-by: lishunyang <lishunyang12@163.com>
…eo-1.5

examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py:
  Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint
  for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps,
  skips precision-sensitive layers (modulation, embeddings, output proj,
  token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers
  by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920).

vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml:
  Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects
  ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter).

Signed-off-by: lishunyang <lishunyang12@163.com>
@lishunyang12 lishunyang12 changed the title [Quant] ModelOpt FP8 for HunyuanVideo-1.5 (Phase 2 of #2709) [Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 Apr 19, 2026
…ce_scale kwarg unsupported)

HV-1.5's diffusers pipeline uses the new Guider abstraction (guider_config.json
in the checkpoint) rather than a guidance_scale kwarg. Try setting it on the
guider object once up front; in the per-prompt call, try with guidance_scale
first and fall back without it on TypeError. Calibration only needs amax stats,
so the exact CFG value isn't critical.

Signed-off-by: lishunyang <lishunyang12@163.com>
Three checks: (A) transformer/config.json has sane quantization_config,
(B) safetensors contain FP8 tensors, (C) optional disk-size delta vs BF16.
Run after the quantize_*_modelopt_fp8.py scripts to spot issues before
attempting to serve.

Signed-off-by: lishunyang <lishunyang12@163.com>
…or view)

torch's get_tensor() returns FP8 storage as bf16 views on some safetensors
versions, giving false negatives. Read the on-disk dtype from the header
directly — that's what actually determines whether the checkpoint is FP8.

Signed-off-by: lishunyang <lishunyang12@163.com>
The default export_hf_checkpoint() doesn't actually serialize weights as FP8
for unknown model types like HunyuanVideo15Transformer3DModel — it saves
BF16 placeholders. The HunyuanImage-3 calibration helper hit the same bug.

Three changes:
- Manually call modelopt.torch.export.unified_export_hf._export_quantized_weight
  per-module to convert in-memory tensors to actual FP8.
- Save the pipeline by hand (copy source minus transformer/, then save the
  quantized transformer with hide_quantizers_from_state_dict).
- Patch transformer/config.json to inject quant_algo: FP8 + config_groups so
  vllm-omni's adapter (vllm-project#2913) auto-detects it.

Signed-off-by: lishunyang <lishunyang12@163.com>
…, not pipeline

Diffusers pipelines are ConfigMixin, not nn.Module — they don't have
.named_modules(). Pass pipe.transformer directly.

Signed-off-by: lishunyang <lishunyang12@163.com>
…ation fp8, not --stage-configs-path

Signed-off-by: lishunyang <lishunyang12@163.com>
@lishunyang12 lishunyang12 marked this pull request as ready for review April 19, 2026 20:59
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

…block

When --weight-block-size 'M,N' is given, override the weight quantizer with
block_sizes={-1: N, -2: M} so each linear gets a (out//M, in//N) scale tensor
instead of a scalar. Patched config_groups advertises strategy='block' +
block_structure='MxN' so consumers know what to expect.

Static FP8 is exempt from upstream vLLM's online block-wise gate, so this
just works at serving time via vllm-project#2913's adapter.

Default behavior unchanged (per-tensor) — pass --weight-block-size 128,128
to opt in.

Signed-off-by: lishunyang <lishunyang12@163.com>
@lishunyang12 lishunyang12 changed the title [Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 [Quant] Phase 1 (video-gen): ModelOpt FP8 Apr 19, 2026
…s per-block)

Reads shape info from safetensors header and classifies the checkpoint as
per-tensor / per-channel / per-block based on whether weight_scale tensors
are scalar, 1-D, or N-D. Helps verify --weight-block-size actually took
effect (or if ModelOpt silently flattened to per-tensor).

Signed-off-by: lishunyang <lishunyang12@163.com>
… granularity

ModelOpt block-wise produces shapes like [16, 1, 16, 1] where size-1 dims are
broadcasting axes. Classify by non-unity dim count: 0=per-tensor, 1=per-channel,
2+=per-block.

Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 19, 2026
…V-5B

examples/quantization/quantize_wan2_2_modelopt_fp8.py:
  Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint
  for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design
  as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch
  quant_algo: FP8 into config.json, hide quantizers during save.
  Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding,
  proj_out, scale_shift_table, SP helpers). MHA quantizers off by default.

vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml:
  Stage config for serving the calibrated checkpoint via vllm-omni.

Signed-off-by: lishunyang <lishunyang12@163.com>
…ject#2920)

Threads quant_config / prefix through WanSelfAttention, WanCrossAttention,
WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and
WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines
(T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding
(Conv3d), time/text/image embedders, and proj_out stay full precision.

All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter
from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip
patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here —
that was an online-FP8 quality workaround; static calibration handles it.

Signed-off-by: lishunyang <lishunyang12@163.com>
…V-5B

examples/quantization/quantize_wan2_2_modelopt_fp8.py:
  Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint
  for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design
  as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch
  quant_algo: FP8 into config.json, hide quantizers during save.
  Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding,
  proj_out, scale_shift_table, SP helpers). MHA quantizers off by default.

vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml:
  Stage config for serving the calibrated checkpoint via vllm-omni.

Signed-off-by: lishunyang <lishunyang12@163.com>
@lishunyang12 lishunyang12 changed the title [Quant] Phase 1 (video-gen): ModelOpt FP8 [Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2 Apr 19, 2026
…n.net_0)

Wan2.2 ModelOpt FP8 checkpoint has diffusers-style dotted FFN names
(ffn.net.0.proj, ffn.net.2) but vllm-omni's WanFeedForward uses underscored
names (ffn.net_0.proj, ffn.net_2). The transformer's load_weights remaps
these for .weight tensors, but the ModelOpt adapter resolves scale tensor
names independently via WeightsMapper and was missing the remap — all 120
FFN scale tensors (30 blocks x 2 linears x 2 scales) silently fell through,
leaving FP8 weights with no valid scales at serving time (visible as pure
noise output).

Fix:
- Add hf_to_vllm_mapper class attribute on WanTransformer3DModel with the
  ffn remap.
- Extend ModelOptFp8CheckpointAdapter._get_weights_mapper to merge a model's
  hf_to_vllm_mapper (if present) into the resolution map. Models can now
  register arbitrary substring remaps via this standard vLLM attribute.

Signed-off-by: lishunyang <lishunyang12@163.com>
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is substantial (>1000 LOC / >10 files). Could you please run the L3 tests locally and paste the results here?

Once L3 test results are available, I will proceed with a full review of the ModelOpt FP8 video-gen implementation.

Helps diagnose name-mismatch between checkpoint keys and model parameters
(e.g. diffusers .ffn.net.0. vs vllm-omni .ffn.net_0.).

Signed-off-by: lishunyang <lishunyang12@163.com>
…t FP8 adapter

The adapter is instantiated with the whole Pipeline, not just the DiT. Only
checking the top-level model means hf_to_vllm_mapper defined on a sub-module
(e.g. WanTransformer3DModel inside Wan22TI2VPipeline) was invisible. Walk
named_modules() and aggregate any mappers found.

Signed-off-by: lishunyang <lishunyang12@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants