Skip to content

[Quantization] Add FP8 support for Wan 2.2 transformer and Qwen Image VAE/text encoder#1412

Closed
lishunyang12 wants to merge 1 commit into
vllm-project:mainfrom
lishunyang12:feat/fp8-quant-wan22
Closed

[Quantization] Add FP8 support for Wan 2.2 transformer and Qwen Image VAE/text encoder#1412
lishunyang12 wants to merge 1 commit into
vllm-project:mainfrom
lishunyang12:feat/fp8-quant-wan22

Conversation

@lishunyang12
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 commented Feb 20, 2026

Summary

This PR extends FP8 quantization support to two additional model families:

  1. Wan 2.2 transformer — Thread quant_config through all parallel linear layers (same pattern as Z-Image)
  2. Qwen Image VAE & text encoder — Hook-based FP8 weight storage for HF-native layers (nn.Linear, Conv2d, Conv3d)

Subsumes #1414 (FP8 for Qwen Image VAE/encoder).

Wan 2.2 Changes

Wire quant_config from pipelines through all parallel linear layers in the Wan 2.2 video transformer, following the same pattern established by Z-Image (commit b7604ae).

File Change
wan2_2_transformer.py Add quant_config param to 6 classes (ColumnParallelGELU, WanFeedForward, WanSelfAttention, WanCrossAttention, WanTransformerBlock, WanTransformer3DModel) and pass to all ColumnParallelLinear, RowParallelLinear, QKVParallelLinear layers
pipeline_wan2_2.py Extract quant_config via get_vllm_quant_config_for_layers and pass to both transformer and transformer_2
pipeline_wan2_2_i2v.py Same wiring for I2V pipeline
pipeline_wan2_2_ti2v.py Same wiring for TI2V pipeline
text_to_video.py Add --quantization and --ignored-layers args
image_to_video.py Add --quantization and --ignored-layers args

Not quantized (same as Z-Image pattern): DistributedRMSNorm, Attention, Conv3dLayer, nn.Linear (proj_out), FP32LayerNorm, embedding layers.

Qwen Image VAE/Encoder Changes

Add FP8 weight-only storage for Linear/Conv2d/Conv3d layers in the Qwen Image VAE and text encoder. Weights are stored in float8_e4m3fn with per-tensor scales and dequantized to BF16 before each forward pass — saving ~50% memory for these components.

File Change
models/utils.py New apply_fp8_weight_storage() utility — quantizes weights, registers forward pre/post hooks for dequant
pipeline_qwen_image.py Apply FP8 storage after VAE/text_encoder load, mark params as loaded
pipeline_qwen_image_edit.py Same pattern
pipeline_qwen_image_edit_plus.py Same pattern

Wan 2.2 Test Results

T2V Pipeline (Wan2.2-T2V-A14B-Diffusers)

Environment: 1x GPU, 1280×720, 81 frames, 40 steps, seed=42

Config Model Memory (GiB) Generation Time (s)
BF16 (baseline) 64.46 892.8
FP8 38.18 828.0
FP8 + ignored_layers=proj_out 38.18 826.0

I2V Pipeline (Wan2.2-I2V-A14B-Diffusers)

Environment: 1x GPU, auto-resolution, 81 frames, 50 steps, seed=42

Config Model Memory (GiB) Generation Time (s)
BF16 (baseline) 64.46 301.1
FP8 38.18 264.6
  • FP8 reduces model memory by ~26 GiB (~41%) across both pipelines
  • T2V: ~7% faster, I2V: ~12% faster
  • Visual quality is comparable

Test plan

  • Lint/type check passes
  • Wan 2.2 T2V/I2V with --quantization fp8 works end-to-end
  • Without --quantization, behavior is identical (quant_config=None is no-op)
  • Qwen Image VAE/encoder FP8 weight storage works correctly
  • Pre-commit passing
wan22_fp8_quantized.mp4
wan22_fp8_ignored_layers.mp4
wan22_bf16_baseline.mp4
i2v_bf16_baseline.mp4
i2v_fp8_quantized.mp4

Copy link
Copy Markdown
Collaborator Author

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will submit test results later.

Copy link
Copy Markdown
Collaborator

@SamitHuang SamitHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR is clear. it should be ready to merge after checking the visual quality of quantization.

type=str,
default=None,
choices=["fp8"],
help="Quantization method for the transformer. "
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about text encoder?

Copy link
Copy Markdown
Collaborator Author

@lishunyang12 lishunyang12 Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hii, thanks for review.
The text encoder (UMT5) is not quantized here — same as
what Z-Image does. Only the diffusion transformer layers
get FP8. The text encoder is relatively small compared to
the transformer, so quantizing it has less impact on
memory while potentially hurting prompt embedding quality.
We could add text encoder quantization as a follow-up.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@vllm-omni-reviewer

@github-actions
Copy link
Copy Markdown

🤖 VLLM-Omni PR Review

Code Review: FP8 Quantization Support for Wan 2.2 Transformer

1. Overview

This PR adds FP8 quantization support for the Wan 2.2 video transformer by threading quant_config through all parallel linear layers. The implementation follows the established pattern from Z-Image (commit b7604ae), making it a well-structured, consistent addition to the codebase.

Overall Assessment: Positive - The changes are clean, consistent, and follow existing patterns. A few minor suggestions for robustness.


2. Code Quality

Strengths

  • Consistent pattern: The quant_config threading follows the exact same pattern across all 6 classes in the transformer
  • Good use of TYPE_CHECKING: Properly avoids circular imports with QuantizationConfig
  • Clear CLI help text: The argument descriptions are informative and include examples

Potential Issues

1. Inconsistent API usage in example scripts (text_to_video.py:167-173, image_to_video.py:151-157):

if args.quantization and ignored_layers:
    quant_kwargs["quantization_config"] = {
        "method": args.quantization,
        "ignored_layers": ignored_layers,
    }
elif args.quantization:
    quant_kwargs["quantization"] = args.quantization

This uses different keys (quantization_config vs quantization) depending on whether ignored_layers is provided. This could lead to confusion or bugs if Omni/OmniDiffusionConfig doesn't handle both cases identically. Consider unifying to always use one format.

2. Variable reference in print statement (image_to_video.py:189-190):

if ignored_layers:
    print(f"  Ignored layers: {ignored_layers}")

This correctly references ignored_layers which is defined at function scope, but the variable is computed before the quant_kwargs logic. This is fine, but the ordering could be clearer.


3. Architecture & Design

Strengths

  • Clean separation: Quantization config is extracted at the pipeline level and passed down through the model hierarchy
  • Non-invasive: When quant_config=None, the behavior is identical to before (no-op)
  • Comprehensive coverage: All three Wan 2.2 pipelines (T2V, I2V, TI2V) are updated consistently

Design Considerations

1. Layer exclusion pattern: The PR correctly notes that DistributedRMSNorm, Attention, Conv3dLayer, nn.Linear (proj_out), FP32LayerNorm, and embedding layers are not quantized. This matches the Z-Image pattern and is appropriate for maintaining numerical stability.

2. Missing proj_out quantization: The final output projection (proj_out) uses nn.Linear instead of RowParallelLinear. This is intentional per the PR description but worth verifying this doesn't create a bottleneck.


4. Security & Safety

No significant security concerns. The changes are purely additive and don't introduce new attack vectors.

Minor consideration: The ignored_layers argument accepts arbitrary strings without validation. Malformed patterns could lead to unexpected behavior, but this is a power-user feature and the risk is acceptable.


5. Testing & Documentation

Test Plan Assessment

The test plan in the PR description is adequate but could be more comprehensive:

Suggested additions:

  • Verify memory reduction with FP8 enabled (compare GPU memory usage)
  • Test with various --ignored-layers patterns
  • Verify output quality/consistency between FP8 and BF16

Documentation

  • Missing: No documentation updates for the new feature
  • Suggested: Add a brief section to any existing quantization docs or a docstring in the transformer classes explaining FP8 support

6. Specific Suggestions

wan2_2_transformer.py

Line 100-113: Consider adding a docstring to ColumnParallelGELU.__init__ explaining the quant_config parameter:

def __init__(
    self,
    dim_in: int,
    dim_out: int,
    *,
    approximate: str = "tanh",
    bias: bool = True,
    quant_config: "QuantizationConfig | None" = None,
):
    """Column parallel linear with GELU activation.
    
    Args:
        quant_config: Optional quantization config for FP8/other quantization methods.
    """

text_to_video.py and image_to_video.py

Lines 167-173 / 151-157: Consider unifying the API:

# Option A: Always use quantization_config dict
if args.quantization:
    quant_kwargs["quantization_config"] = {
        "method": args.quantization,
        **({"ignored_layers": ignored_layers} if ignored_layers else {}),
    }

# Option B: Always use separate kwargs
if args.quantization:
    quant_kwargs["quantization"] = args.quantization
    if ignored_layers:
        quant_kwargs["ignored_layers"] = ignored_layers

pipeline_wan2_2.py, pipeline_wan2_2_i2v.py, pipeline_wan2_2_ti2v.py

Line 275 / 218 / 181: The comment is good, but consider adding a log message for visibility:

quant_config = get_vllm_quant_config_for_layers(od_config.quantization_config)
if quant_config is not None:
    logger.info("Enabling quantization for transformer: %s", quant_config)

7. Approval Status

LGTM with suggestions

The PR is well-structured and follows established patterns. The core implementation is correct and consistent across all files. The suggestions above are minor improvements for robustness and maintainability, not blockers.

Recommended before merge:

  1. Verify the quantization vs quantization_config API behavior in Omni class
  2. Consider adding a log message when quantization is enabled
  3. Add basic docstrings for the new quant_config parameters

The PR achieves its stated goal of enabling FP8 quantization for Wan 2.2 transformers with minimal, focused changes.


This review was generated automatically by the VLLM-Omni PR Reviewer Bot
using glm-5.

@lishunyang12
Copy link
Copy Markdown
Collaborator Author

@hsliuustc0106 @SamitHuang Please help check it as i uploaded the test results. Thx

@lishunyang12
Copy link
Copy Markdown
Collaborator Author

lishunyang12 commented Feb 24, 2026

Regarding the GLM-5's suggestions:

  1. Both quantization and quantization_config are valid — OmniDiffusionConfig.__post_init__ handles both paths and logs a warning on conflicts. The dict form is needed when passing ignored_layers.
  2. There's already a log message at config creation time (__init__.py:85): Creating diffusion quantization config: fp8
  3. The quant_config param follows the same pattern established in Z-Image and vLLM's existing parallel layers, so I'll skip the docstrings to keep the diff minimal.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

This PR adds FP8 W8A8 quantization support for Wan 2.2 video transformer, enabling significant memory reduction on Ada/Hopper GPUs. The implementation follows the established Z-Image pattern consistently, threading quant_config through all 6 transformer classes and their parallel linear layers. The changes are well-structured, properly scoped (excluding text encoder and normalization layers as expected), and include comprehensive CLI support. The author has provided test results and addressed review feedback thoroughly.

import math
from collections.abc import Iterable
from typing import Any
from typing import TYPE_CHECKING, Any
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use of TYPE_CHECKING to avoid runtime import overhead while maintaining type hints for QuantizationConfig. This keeps the quantization dependency optional at runtime.

from vllm_omni.diffusion.model_loader.diffusers_loader import DiffusersPipelineLoader
from vllm_omni.diffusion.models.schedulers import FlowUniPCMultistepScheduler
from vllm_omni.diffusion.models.wan2_2.wan2_2_transformer import WanTransformer3DModel
from vllm_omni.diffusion.quantization import get_vllm_quant_config_for_layers
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proper integration with the existing quantization infrastructure. get_vllm_quant_config_for_layers handles the ignored_layers filtering and config validation.

if load_transformer_2:
transformer_2_config = load_transformer_config(model, "transformer_2", local_files_only)
self.transformer_2 = create_transformer_from_config(transformer_2_config)
self.transformer_2 = create_transformer_from_config(transformer_2_config, quant_config=quant_config)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: both transformer and transformer_2 receive the same quant_config, ensuring consistent quantization across the dual-transformer architecture.

help="Number of GPUs used for tensor parallelism (TP) inside the DiT.",
)
parser.add_argument(
"--quantization",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI interface is well-designed with clear help text. The --quantization and --ignored-layers args provide flexibility for users to experiment with different quantization strategies.

# Check if profiling is requested via environment variable
profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))

# Build quantization kwargs
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The quantization_config dict construction properly handles both the quantization method and ignored_layers, matching the OmniDiffusionConfig expectations.



def create_transformer_from_config(config: dict) -> WanTransformer3DModel:
def create_transformer_from_config(config: dict, quant_config=None) -> WanTransformer3DModel:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing type annotation for quant_config parameter. Should be quant_config: QuantizationConfig | None = None to match the pattern used in the transformer classes and maintain type safety.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest push. Added TYPE_CHECKING import and QuantizationConfig type annotation to match the pattern in wan2_2_transformer.py:

# Before
def create_transformer_from_config(config: dict, quant_config=None) -> WanTransformer3DModel:

# After
def create_transformer_from_config(
    config: dict, quant_config: "QuantizationConfig | None" = None
) -> WanTransformer3DModel:

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

examples/offline_inference/text_to_video/text_to_video.py:97

Critical: This PR adds quantization support but provides no test coverage. We need tests to verify:

  1. FP8 quantization actually reduces memory usage
  2. Output quality remains acceptable with quantization
  3. Invalid quantization configs are handled gracefully
  4. The ignored_layers parameter works correctly

Without tests, we can't validate the 'significant memory reduction' claim or prevent regressions.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Looking more critically at this PR, there are several concerns that should be addressed:

Missing Test Coverage: This adds a significant feature (FP8 quantization) with zero test coverage. We need tests to validate:

  • Memory reduction claims (before/after measurements)
  • Output quality with quantization vs without
  • Error handling for invalid configs
  • The ignored_layers functionality

Missing Performance Data: The PR claims "significant memory reduction" but the test results only show latency metrics, not actual memory usage. We need:

  • Peak memory usage comparison (FP8 vs BF16)
  • VRAM consumption measurements
  • Quality metrics (FID/CLIP scores) to ensure quantization doesn't degrade output

Type Safety: The create_transformer_from_config function has quant_config=None without type annotation, breaking the type safety pattern used elsewhere.

Documentation: No documentation added explaining:

  • When to use FP8 quantization
  • Expected memory savings
  • Quality trade-offs
  • How to use ignored_layers effectively

While the implementation follows the Z-Image pattern correctly, these gaps make it difficult to validate the feature works as intended and prevent future regressions.

@lishunyang12
Copy link
Copy Markdown
Collaborator Author

lishunyang12 commented Feb 24, 2026

examples/offline_inference/text_to_video/text_to_video.py:97

Critical: This PR adds quantization support but provides no test coverage. We need tests to verify:

  1. FP8 quantization actually reduces memory usage
  2. Output quality remains acceptable with quantization
  3. Invalid quantization configs are handled gracefully
  4. The ignored_layers parameter works correctly

Without tests, we can't validate the 'significant memory reduction' claim or prevent regressions.

emmm, i already added the test to validate the memory reduction and show output quality consistency. False negative.

@lishunyang12
Copy link
Copy Markdown
Collaborator Author

@hsliuustc0106 lets just ignore AI comments as they are not valid. Doc should be provided in a separate PR.

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR adds FP8 quantization support to Wan 2.2 transformers by threading quant_config through all parallel linear layers. The implementation follows the established Z-Image pattern and includes comprehensive test results showing ~41% memory reduction with minimal quality impact.

Pros:

  • Clean, consistent implementation across all 6 transformer classes
  • Follows established Z-Image pattern (commit b7604ae)
  • Comprehensive test results with actual memory measurements and video outputs
  • Proper use of TYPE_CHECKING to avoid circular imports
  • Good CLI help text with examples

Cons:

  • Inconsistent API usage in example scripts (two different ways to pass quantization config)
  • Minor code duplication between text_to_video.py and image_to_video.py

Recommendation: Approve with minor suggestions for API consistency.

"Example: --ignored-layers 'to_qkv,to_out'",
)
return parser.parse_args()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Inconsistent API usage

The code uses two different approaches depending on whether ignored_layers is provided:

  • With ignored_layers: quantization_config dict with method and ignored_layers
  • Without: Simple quantization string

This could be confusing. Consider unifying to always use the same format:

if args.quantization:
    quant_kwargs["quantization_config"] = {
        "method": args.quantization,
        **(({"ignored_layers": ignored_layers} if ignored_layers else {}))
    }

Or verify that Omni handles both formats identically.

"Example: --ignored-layers 'to_qkv,to_out'",
)
return parser.parse_args()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Same inconsistent API usage

Same concern as in text_to_video.py - consider unifying the quantization config format.

@@ -28,6 +28,11 @@
SequenceParallelOutput,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good practice: TYPE_CHECKING usage

Nice use of TYPE_CHECKING to avoid circular imports while maintaining type safety.

@@ -92,14 +97,23 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
class ColumnParallelGELU(nn.Module):
"""Column parallel linear with GELU activation."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add docstring

Consider adding a brief docstring explaining the quant_config parameter:

def __init__(
    self,
    dim_in: int,
    dim_out: int,
    *,
    approximate: str = "tanh",
    bias: bool = True,
    quant_config: "QuantizationConfig | None" = None,
):
    """Column parallel linear with GELU activation.
    
    Args:
        quant_config: Optional quantization config for FP8/other methods.
    """

@@ -23,10 +23,16 @@
from vllm_omni.diffusion.model_loader.diffusers_loader import DiffusersPipelineLoader
from vllm_omni.diffusion.models.schedulers import FlowUniPCMultistepScheduler
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good: Consistent pattern

The quantization config extraction and threading follows the same pattern as Z-Image. This consistency makes the codebase easier to maintain.

from vllm_omni.diffusion.models.schedulers import FlowUniPCMultistepScheduler
from vllm_omni.diffusion.models.wan2_2.wan2_2_transformer import WanTransformer3DModel
from vllm_omni.diffusion.quantization import get_vllm_quant_config_for_layers
from vllm_omni.diffusion.request import OmniDiffusionRequest
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add logging

Consider adding a log message when quantization is enabled for better visibility:

quant_config = get_vllm_quant_config_for_layers(od_config.quantization_config)
if quant_config is not None:
    logger.info("Enabling quantization for Wan 2.2 transformer: %s", quant_config)

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Hi @lishunyang12 👋

This FP8 quantization PR hasn't been updated for 13 days. Is this still on your radar? Let us know if you need any support.

Thanks!

@lishunyang12 lishunyang12 changed the title [Quantization] Add FP8 quantization support for Wan 2.2 transformer [Quantization] Add FP8 support for Wan 2.2 transformer and Qwen Image VAE/text encoder Mar 12, 2026
@lishunyang12 lishunyang12 force-pushed the feat/fp8-quant-wan22 branch from ea18a8d to 71a9035 Compare March 12, 2026 14:55
@lishunyang12 lishunyang12 force-pushed the feat/fp8-quant-wan22 branch 2 times, most recently from ee61360 to b9edf6b Compare March 13, 2026 14:41
… VAE/text encoder

Signed-off-by: lishunyang <lishunyang12@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants