[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2#2924
Draft
lishunyang12 wants to merge 23 commits intovllm-project:mainfrom
Draft
[Quant] Phase 1 (video-gen): ModelOpt FP8 for HunyuanVideo-1.5 and Wan2.2#2924lishunyang12 wants to merge 23 commits intovllm-project:mainfrom
lishunyang12 wants to merge 23 commits intovllm-project:mainfrom
Conversation
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
Signed-off-by: roG0d <baonudesifeizhai@gmail.com>
…vllm-project#2920) Threads quant_config / prefix through HunyuanVideo15Attention, HunyuanVideo15TransformerBlock, and HunyuanVideo15Transformer3DModel so the modelopt FP8 adapter from vllm-project#2913 has somewhere to bind per-layer scales. Modulation, embeddings, proj_out stay raw nn.Linear (full precision). Signed-off-by: lishunyang <lishunyang12@163.com>
…eo-1.5 examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps, skips precision-sensitive layers (modulation, embeddings, output proj, token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920). vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter). Signed-off-by: lishunyang <lishunyang12@163.com>
…ce_scale kwarg unsupported) HV-1.5's diffusers pipeline uses the new Guider abstraction (guider_config.json in the checkpoint) rather than a guidance_scale kwarg. Try setting it on the guider object once up front; in the per-prompt call, try with guidance_scale first and fall back without it on TypeError. Calibration only needs amax stats, so the exact CFG value isn't critical. Signed-off-by: lishunyang <lishunyang12@163.com>
Three checks: (A) transformer/config.json has sane quantization_config, (B) safetensors contain FP8 tensors, (C) optional disk-size delta vs BF16. Run after the quantize_*_modelopt_fp8.py scripts to spot issues before attempting to serve. Signed-off-by: lishunyang <lishunyang12@163.com>
…or view) torch's get_tensor() returns FP8 storage as bf16 views on some safetensors versions, giving false negatives. Read the on-disk dtype from the header directly — that's what actually determines whether the checkpoint is FP8. Signed-off-by: lishunyang <lishunyang12@163.com>
The default export_hf_checkpoint() doesn't actually serialize weights as FP8 for unknown model types like HunyuanVideo15Transformer3DModel — it saves BF16 placeholders. The HunyuanImage-3 calibration helper hit the same bug. Three changes: - Manually call modelopt.torch.export.unified_export_hf._export_quantized_weight per-module to convert in-memory tensors to actual FP8. - Save the pipeline by hand (copy source minus transformer/, then save the quantized transformer with hide_quantizers_from_state_dict). - Patch transformer/config.json to inject quant_algo: FP8 + config_groups so vllm-omni's adapter (vllm-project#2913) auto-detects it. Signed-off-by: lishunyang <lishunyang12@163.com>
…, not pipeline Diffusers pipelines are ConfigMixin, not nn.Module — they don't have .named_modules(). Pass pipe.transformer directly. Signed-off-by: lishunyang <lishunyang12@163.com>
…ation fp8, not --stage-configs-path Signed-off-by: lishunyang <lishunyang12@163.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…block
When --weight-block-size 'M,N' is given, override the weight quantizer with
block_sizes={-1: N, -2: M} so each linear gets a (out//M, in//N) scale tensor
instead of a scalar. Patched config_groups advertises strategy='block' +
block_structure='MxN' so consumers know what to expect.
Static FP8 is exempt from upstream vLLM's online block-wise gate, so this
just works at serving time via vllm-project#2913's adapter.
Default behavior unchanged (per-tensor) — pass --weight-block-size 128,128
to opt in.
Signed-off-by: lishunyang <lishunyang12@163.com>
…s per-block) Reads shape info from safetensors header and classifies the checkpoint as per-tensor / per-channel / per-block based on whether weight_scale tensors are scalar, 1-D, or N-D. Helps verify --weight-block-size actually took effect (or if ModelOpt silently flattened to per-tensor). Signed-off-by: lishunyang <lishunyang12@163.com>
… granularity ModelOpt block-wise produces shapes like [16, 1, 16, 1] where size-1 dims are broadcasting axes. Classify by non-unity dim count: 0=per-tensor, 1=per-channel, 2+=per-block. Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12
added a commit
to lishunyang12/vllm-omni
that referenced
this pull request
Apr 19, 2026
…V-5B examples/quantization/quantize_wan2_2_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch quant_algo: FP8 into config.json, hide quantizers during save. Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding, proj_out, scale_shift_table, SP helpers). MHA quantizers off by default. vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Signed-off-by: lishunyang <lishunyang12@163.com>
8 tasks
…ject#2920) Threads quant_config / prefix through WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines (T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding (Conv3d), time/text/image embedders, and proj_out stay full precision. All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here — that was an online-FP8 quality workaround; static calibration handles it. Signed-off-by: lishunyang <lishunyang12@163.com>
…V-5B examples/quantization/quantize_wan2_2_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for Wan2.2 TI2V-5B (the dense 5B variant that fits 80GB BF16). Same design as the HunyuanVideo-1.5 calibrator (vllm-project#2924): force-export FP8 weights, patch quant_algo: FP8 into config.json, hide quantizers during save. Skips Wan2.2's precision-sensitive layers (condition_embedder, patch_embedding, proj_out, scale_shift_table, SP helpers). MHA quantizers off by default. vllm_omni/model_executor/stage_configs/wan2_2_ti2v_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Signed-off-by: lishunyang <lishunyang12@163.com>
…n.net_0) Wan2.2 ModelOpt FP8 checkpoint has diffusers-style dotted FFN names (ffn.net.0.proj, ffn.net.2) but vllm-omni's WanFeedForward uses underscored names (ffn.net_0.proj, ffn.net_2). The transformer's load_weights remaps these for .weight tensors, but the ModelOpt adapter resolves scale tensor names independently via WeightsMapper and was missing the remap — all 120 FFN scale tensors (30 blocks x 2 linears x 2 scales) silently fell through, leaving FP8 weights with no valid scales at serving time (visible as pure noise output). Fix: - Add hf_to_vllm_mapper class attribute on WanTransformer3DModel with the ffn remap. - Extend ModelOptFp8CheckpointAdapter._get_weights_mapper to merge a model's hf_to_vllm_mapper (if present) into the resolution map. Models can now register arbitrary substring remaps via this standard vLLM attribute. Signed-off-by: lishunyang <lishunyang12@163.com>
Collaborator
hsliuustc0106
left a comment
There was a problem hiding this comment.
This PR is substantial (>1000 LOC / >10 files). Could you please run the L3 tests locally and paste the results here?
Once L3 test results are available, I will proceed with a full review of the ModelOpt FP8 video-gen implementation.
Helps diagnose name-mismatch between checkpoint keys and model parameters (e.g. diffusers .ffn.net.0. vs vllm-omni .ffn.net_0.). Signed-off-by: lishunyang <lishunyang12@163.com>
…t FP8 adapter The adapter is instantiated with the whole Pipeline, not just the DiT. Only checking the top-level model means hf_to_vllm_mapper defined on a sub-module (e.g. WanTransformer3DModel inside Wan22TI2VPipeline) was invisible. Walk named_modules() and aggregate any mappers found. Signed-off-by: lishunyang <lishunyang12@163.com>
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Phase 1 of #2709 — extends ModelOpt FP8 support to video-gen models. #2913 covers Phase 1 for image-gen (Flux, Flux2-Klein, Qwen-Image, HunyuanImage-3); this PR adds the video-gen counterpart for both HunyuanVideo-1.5 and Wan2.2 TI2V-5B using the same loader infrastructure.
Builds on:
quant_configwiring for HV-1.5 + Wan2.2 (extracted into this PR; [Quant] Wire quant_config through HunyuanVideo-1.5 and Wan2.2 DiT for online FP8 #2920 stays as online-FP8 ablation reference)Changes
DiT wiring (extracted from #2920)
hunyuan_video_15_transformer.py+ pipelines —HunyuanVideo15Attention,HunyuanVideo15TransformerBlock,HunyuanVideo15Transformer3DModelacceptquant_config/prefix; threaded toto_qkv,to_out[0],add_kv_proj,to_add_out,ff,ff_context.wan2_2_transformer.py+wan2_2_vace_transformer.py+ 4 pipelines —WanSelfAttention,WanCrossAttention,WanFeedForward(+ColumnParallelGELU),WanTransformerBlock,WanTransformer3DModel, VACE variant. Factories (create_transformer_from_config,create_vace_transformer_from_config) accept optionalquant_config.nn.Linear/scale_shift_table), patch embedders (Conv3d), time/text/image embedders,proj_out, and the HV-1.5 token refiner stay full precision.attn1/attn2quant_config=Noneon Wan2.2) are not applied here — that was an online-FP8 workaround; static calibration handles it.ModelOpt FP8 helpers
examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py— HV-1.5 calibrator. Force-exports FP8 weights, patchesquant_algo: FP8, hides quantizers during save. MHA quantizers off by default.examples/quantization/quantize_wan2_2_modelopt_fp8.py— Wan2.2 TI2V-5B calibrator. Same design.examples/quantization/check_modelopt_fp8_export.py— verifier. Reads safetensors header dtypes, checksquant_algo: FP8, classifies scale granularity (per-tensor / per-channel / per-block).vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml+wan2_2_ti2v_dit_fp8.yaml— serving stage configs with auto-detect.Adapter (this PR also fixes a general-purpose bug in #2913's adapter):
modelopt_fp8.py:_get_weights_mappernow walks submodules to aggregatehf_to_vllm_mapperfrom whichever sub-module defines it. The adapter is instantiated with the wholePipeline, so model-specific remaps (like Wan2.2'sffn.net.0.→ffn.net_0.) must be discovered on the transformer submodule, not the top-level Pipeline. Fixes silent-noise output that occurred on Wan2.2 ModelOpt FP8 before this change.WanTransformer3DModel.hf_to_vllm_mapperadded with that remap.Both calibrators share
--weight-block-size 'M,N'for block-wise FP8, and the same fallback pattern:_force_export_quantized_weights+_patch_quant_config+hide_quantizers_from_state_dict— because ModelOpt'sexport_hf_checkpointdoesn't handle diffusers-video checkpoints natively.Validation — HunyuanVideo-1.5 (1×H100 80GB, T2V 480×832, 33 frames, 30 steps, seed=42)
torch.compileenabled (default).Engine signals confirming the path is wired correctly:
factory.py: Building quantization config: fp8→Building quantization config: modelopt— auto-detect upgraded the user's--quantization fp8flag to ModelOpt based onquant_algo: FP8intransformer/config.jsondata.py: Auto-detected quantization 'modelopt' from model config__init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethod— the ModelOpt FP8 kernel selectedVisual comparison — HunyuanVideo-1.5
BF16 baseline:
hv15_bf16_compiled.mp4
ModelOpt FP8 (this PR):
hv15_modelopt_fp8_compiled.mp4
Same prompt (
"A dog running across a field of golden wheat."), same seed, same sampling params. Output is BF16-equivalent — no detail collapse or composition drift like the online FP8 path showed in #2920.Validation — Wan2.2 TI2V-5B (1×H100 80GB, T2V 704×1280, 49 frames, 30 steps, seed=42)
torch.compileenabled (default).Engine signals:
factory.py: Building quantization config: fp8→modelopt(auto-detect fired)data.py: Auto-detected quantization 'modelopt' from model config__init__.py: Selected CutlassFP8ScaledMMLinearKernel for ModelOptFp8LinearMethodweight_scalewarnings after thehf_to_vllm_mapperfix for Wan2.2'sffn.net.0.→ffn.net_0.diffusers↔vllm-omni name remap.Visual comparison — Wan2.2 TI2V-5B
BF16 baseline:
wan22_bf16_v4.mp4
ModelOpt FP8 (this PR):
wan22_modelopt_fp8_v4.mp4
Same prompt (
"A dog running across a field of golden wheat."), same seed, same sampling params. Output is BF16-equivalent.How to use
Pre-calibrated checkpoints are published on HF Hub so reviewers can test without recalibrating:
shunyang90/HunyuanVideo-1.5-480p-ModelOpt-FP8shunyang90/Wan2.2-TI2V-5B-ModelOpt-FP8Option A: use the published checkpoints (no calibration needed)
Option B: calibrate from BF16 yourself (reproducibility / custom prompts)
Test Plan
HunyuanVideo-1.5
quant_algo: FP8, 648F8_E4M3tensors, per-tensor scale granularityAuto-detected quantization 'modelopt')Wan2.2 TI2V-5B
quant_algo: FP8, 300F8_E4M3tensors, per-tensor scale granularityhf_to_vllm_mapperfix — see adapter change below)Both
torch.compileenabled (default) on both BF16 and FP8 for fair comparisonKnown limitations
ModelOptFp8Config/ModelOptFp8LinearMethodonly dispatches per-tensor scales — a block-wise checkpoint crashes at load with a shape-mismatch assertion inparameter.py:_assert_and_load. Per-tensor serving is the shippable path;--weight-block-sizeis kept in the calibrator for when upstream gains block-wise dispatch.Follow-ups (still Phase 1, other video/variant coverage)
strategy: blockvllm-project-org/Depends on #2913. References #2920 (online-FP8 ablation reference, will not merge).
cc @baonudesifeizhai @hsliuustc0106 @ArtificialRay