[Quant] Wire quant_config through HunyuanVideo-1.5 and Wan2.2 DiT for online FP8#2920
Open
lishunyang12 wants to merge 11 commits into
Open
[Quant] Wire quant_config through HunyuanVideo-1.5 and Wan2.2 DiT for online FP8#2920lishunyang12 wants to merge 11 commits into
lishunyang12 wants to merge 11 commits into
Conversation
Signed-off-by: lishunyang <lishunyang12@163.com>
1 task
Signed-off-by: lishunyang <lishunyang12@163.com>
torch.Generator handles don't cross process boundaries; the worker subprocess was using fresh RNG per generate call, producing unrelated BF16/FP8 outputs (LPIPS ~0.59). Switch to integer seed= via sampling params (pickles correctly). Use pynvml polling for peak VRAM since torch.cuda.max_memory_allocated reads 0 in the caller. Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
…n't lose earlier results Signed-off-by: lishunyang <lishunyang12@163.com>
Handle all common dim orders ([T,H,W,C], [C,T,H,W], [T,C,H,W]) and print raw/normalized shapes so anyone hitting the ValueError can see what came back. Signed-off-by: lishunyang <lishunyang12@163.com>
FP8 on the text-conditioning joint attention collapses output to noise. Mirror FLUX dual-stream fix (vllm-project#2728): keep cross-attn BF16, keep self-attn and FFN quantized. Signed-off-by: lishunyang <lishunyang12@163.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Cross-attn skip alone still caused visible quality loss on long sequences (astronaut-on-Mars test at 121 frames: composition shift, detail loss). Keep attn1 full precision and quantize only FFN — same pattern as FLUX keeping dual-stream BF16 and FP8'ing single-stream only (vllm-project#2728). Memory gain is smaller but output quality matches BF16. Signed-off-by: lishunyang <lishunyang12@163.com>
Transformer gains env-var-driven preset resolution for per-role FP8 selection: BF16 - nothing quantized S1 - FFN only (ff + ff_context) S2 - video stream only (to_qkv + to_out[0] + ff) S3 - all FP8 except encoder cross-attn (keeps add_kv_proj/to_add_out BF16) S4 - everything FP8 (default, pre-sweep behavior) Bench script gains --presets and --frames-list to sweep the matrix in one run, caches BF16 per frame count, emits combined markdown table to results.md. Signed-off-by: lishunyang <lishunyang12@163.com>
Passes a dict spec ({method: fp8, weight_block_size: [M, N]}) to Omni when
specified, falls back to the 'fp8' string for per-tensor. Block-wise scales
typically recover most of the BF16-vs-FP8 quality gap at a small perf cost.
Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12
added a commit
to lishunyang12/vllm-omni
that referenced
this pull request
Apr 19, 2026
…vllm-project#2920) Threads quant_config / prefix through HunyuanVideo15Attention, HunyuanVideo15TransformerBlock, and HunyuanVideo15Transformer3DModel so the modelopt FP8 adapter from vllm-project#2913 has somewhere to bind per-layer scales. Modulation, embeddings, proj_out stay raw nn.Linear (full precision). Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12
added a commit
to lishunyang12/vllm-omni
that referenced
this pull request
Apr 19, 2026
…eo-1.5 examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps, skips precision-sensitive layers (modulation, embeddings, output proj, token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920). vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter). Signed-off-by: lishunyang <lishunyang12@163.com>
15 tasks
lishunyang12
added a commit
to lishunyang12/vllm-omni
that referenced
this pull request
Apr 19, 2026
…ject#2920) Threads quant_config / prefix through WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines (T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding (Conv3d), time/text/image embedders, and proj_out stay full precision. All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here — that was an online-FP8 quality workaround; static calibration handles it. Signed-off-by: lishunyang <lishunyang12@163.com>
8 tasks
lishunyang12
added a commit
to lishunyang12/vllm-omni
that referenced
this pull request
Apr 19, 2026
…ject#2920) Threads quant_config / prefix through WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines (T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding (Conv3d), time/text/image embedders, and proj_out stay full precision. All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here — that was an online-FP8 quality workaround; static calibration handles it. Signed-off-by: lishunyang <lishunyang12@163.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
lishunyang12
added a commit
to lishunyang12/vllm-omni
that referenced
this pull request
May 2, 2026
…vllm-project#2920) Threads quant_config / prefix through HunyuanVideo15Attention, HunyuanVideo15TransformerBlock, and HunyuanVideo15Transformer3DModel so the modelopt FP8 adapter from vllm-project#2913 has somewhere to bind per-layer scales. Modulation, embeddings, proj_out stay raw nn.Linear (full precision). Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12
added a commit
to lishunyang12/vllm-omni
that referenced
this pull request
May 2, 2026
…eo-1.5 examples/quantization/quantize_hunyuanvideo_15_modelopt_fp8.py: Offline calibration helper that produces a ModelOpt FP8 diffusers checkpoint for HunyuanVideo-1.5. Calibrates with 8 video prompts x 10 denoising steps, skips precision-sensitive layers (modulation, embeddings, output proj, token refiner) matching the vllm-project#2728 / vllm-project#2795 pattern, disables MHA quantizers by default (HV-1.5 self-attention degrades visibly under FP8 - see vllm-project#2920). vllm_omni/model_executor/stage_configs/hunyuan_video_15_dit_fp8.yaml: Stage config for serving the calibrated checkpoint via vllm-omni. Auto-detects ModelOpt metadata from the checkpoint (uses vllm-project#2913's adapter). Signed-off-by: lishunyang <lishunyang12@163.com>
lishunyang12
added a commit
to lishunyang12/vllm-omni
that referenced
this pull request
May 2, 2026
…ject#2920) Threads quant_config / prefix through WanSelfAttention, WanCrossAttention, WanFeedForward (+ ColumnParallelGELU), WanTransformerBlock, and WanTransformer3DModel / WanVACETransformer3DModel, plus the four pipelines (T2V / I2V / TI2V / VACE). Modulation (scale_shift_table), patch_embedding (Conv3d), time/text/image embedders, and proj_out stay full precision. All attention + FFN linears receive quant_config so the ModelOpt FP8 adapter from vllm-project#2913 can bind per-layer scales at load time. The aggressive skip patterns from vllm-project#2920 (attn1/attn2 quant_config=None) are NOT applied here — that was an online-FP8 quality workaround; static calibration handles it. Signed-off-by: lishunyang <lishunyang12@163.com>
8 tasks
Collaborator
|
@lishunyang12 Hello, any updates? |
Collaborator
|
please resolve conflicts, thx |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Thread
quant_configthrough the HunyuanVideo-1.5 and Wan2.2 DiT transformers so that--quantization fp8actually activates the FP8 kernel. Before this PR the flag populatedod_config.quantization_configbut never reached the transformer'sQKVParallelLinear/RowParallelLinear/FeedForwardconstructors — every linear picked upUnquantizedLinearMethodand the loader's_process_weights_after_loadinghad nothing to quantize.Proof the original #1516 FP8 claim was a silent no-op: FP8 vs BF16 rows in that PR's latency table are within 0.1 s of each other; model-load memory was identical. After this PR the engine log shows
Selected CutlassFP8ScaledMMLinearKernel for Fp8OnlineLinearMethodon FP8 runs, and model-load memory drops measurably.Pattern follows the #2728 / #2795 fix for Z-Image / Qwen-Image / FLUX.1-dev:
quant_config=+prefix=.contiguous()guard at attention entry (FP8 kernels require it)Fixes #2912.
Benchmark (1×H100 80GB)
HunyuanVideo-1.5 (480p T2V, 33 frames, 30 steps)
Selected CutlassFP8ScaledMMLinearKernel✅Wan2.2 (TI2V-5B, 704×1280 T2V)
Seed 42, guidance 6.0 (HV-1.5) / 5.0 (Wan2.2). Tested with TI2V-5B (fits in 80 GiB BF16); A14B MoE wiring is identical but needs TP=2 and is out of scope here.
HunyuanVideo-1.5 FP8 preset ablation
Visual comparison of BF16 baseline against four per-layer FP8 presets (same prompt, same seed, 480×832, 33 frames, 30 steps). All FP8 configs reduce model-load memory; quality differences are primarily in fine detail / atmospheric texture.
ff+ff_contextto_qkv,to_out[0],ff(video stream only)ff_context+ modulationadd_kv_proj,to_add_outBF16)BF16 baseline
hv15_bf16_seed42_f33_ablation.mp4
S1 — FFN only
hv15_fp8_S1_seed42_f33_ablation.mp4
S2 — video stream only
hv15_fp8_S2_seed42_f33_ablation.mp4
S3 — all FP8 except encoder cross-attn
hv15_fp8_S3_seed42_f33_ablation.mp4
S4 — everything FP8
hv15_fp8_S4_seed42_f33_ablation.mp4
Observation: All four presets (S1 → S4) show similar quality reduction relative to BF16, with no visually decisive winner. This suggests the FFN path (common to every preset) is the primary source of FP8 drift — not attention. The shipped default is S4 (maximum memory savings, no measurable quality penalty over S1).
Known limitations
Online FP8 has visible quality reduction on video DiTs. Output stays coherent but fine detail — atmospheric depth, high-frequency texture, distant ridges — softens vs BF16. Same profile as #2795 shipped for Qwen-Image (LPIPS 0.32, threshold 0.35).
Block-wise FP8 is not available for online quantization.
Fp8Config.__init__in upstream vLLM (fp8.py:120-125) currently requiresis_checkpoint_fp8_serialized=Truefor anyweight_block_size. Block-wise would recover most of the quality gap but needs pre-quantized checkpoints. Tracked as follow-up.Recommended use: memory-constrained workflows where the ~15% memory / ~10% speed tradeoff is worth the detail softening. For quality-critical rendering, leave
--quantization none.Model layers that remain FP8 (shipped config)
HunyuanVideo-1.5
HunyuanVideo15Attention.to_qkv·to_out[0]·add_kv_proj·to_add_outFeedForward.net[0].proj(GELU) ·net[2]— bothffandff_contextKept full precision: modulation (raw
nn.Linear),AdaLayerNormZero, patch embed,proj_out, VAE, text encoders (Qwen2.5-VL + ByT5 + SigLIP), token refiner path.Matches Qwen-Image pattern from #2795. An env-var preset mechanism (
HV15_FP8_PRESET) is also added for research sweeps (see ablation above); defaultS4matches the behavior above.Wan2.2 (T2V / I2V / TI2V / VACE)
WanFeedForward.net_0(GELU viaColumnParallelGELU) ·net_2Kept full precision:
WanSelfAttention.to_qkv·to_out— self-attention over long video tokens (87K+ at 704×1280×121) accumulates visible FP8 driftWanCrossAttention.to_q·to_k·to_v·to_out·add_k_proj·add_v_proj— text/image joint attention; mirrors FLUX dual-stream ([Bug]: fp8 online quantization produces catastrophic LPIPS regression across diffusion transformers (Z-Image, FLUX.1-dev, Qwen-Image) #2728)proj_out, VAE, text encoder (UMT5)Why FFN-only for Wan2.2: with full attention+FFN FP8, long-sequence outputs collapsed to noise (LPIPS 0.93). Cross-attn skip alone reduced it but left composition drift (LPIPS 0.52). FFN-only is the most aggressive setting that produced stable output on long videos.
Changes
HunyuanVideo-1.5
hunyuan_video_15_transformer.py—HunyuanVideo15Attention,HunyuanVideo15TransformerBlock,HunyuanVideo15Transformer3DModelthreadquant_config/prefixto all attention + FFN linears; modulation layers stay rawnn.Linear.pipeline_hunyuan_video_1_5.py+pipeline_hunyuan_video_1_5_i2v.py— passquant_config=od_config.quantization_configto transformer.Wan2.2 (all four pipelines)
wan2_2_transformer.py—WanFeedForward(+ColumnParallelGELU) receivesquant_config;WanSelfAttentionandWanCrossAttentiongetquant_config=None(documented with [Bug]: fp8 online quantization produces catastrophic LPIPS regression across diffusion transformers (Z-Image, FLUX.1-dev, Qwen-Image) #2728 reference).wan2_2_vace_transformer.py— threading inherited viasuper().__init__.create_transformer_from_config(pipeline_wan2_2.py) andcreate_vace_transformer_from_config(pipeline_wan2_2_vace.py) accept optionalquant_config.pipeline_wan2_2.py,..._i2v.py,..._ti2v.py,..._vace.py) passod_config.quantization_config.Bench infrastructure
benchmarks/diffusion/bench_video_fp8.py— new; BF16 vs FP8 perf / peak-VRAM / LPIPS / PSNR / SSIM with--presets,--frames-list,--weight-block-size, MP4 output. Used to produce the tables and ablation above.Test plan
Selected CutlassFP8ScaledMMLinearKernelconfirmed in engine log on FP8 runsFollow-ups
OmniDiffusionSamplingParams.seeddoesn't propagate to HV-1.5's noise generator across the worker subprocess boundary, which blocks reliable LPIPS measurement. Separate bugfix PR.cc @ArtificialRay @DarkLight1337