[Feature]: FP8 Quantization Support for DiT #1034

SamitHuang · 2026-02-11T06:14:00Z

it's better to further explain what are the common sensitive layers? Like Norm ?

Agreed. Sensitive layers examples have been added.

ZJY0516 · 2026-02-11T07:43:02Z

It's great that we provide flexibility here. But it means we have to maintain detailed examples for models

Agreed — the per-model table in the doc serves as the single source of truth. As we add models, we'll update that table with the recommended ignored_layers

ZJY0516 · 2026-02-11T07:45:07Z

We'd better add an dev doc for quantization support. It can be done in the following pr

Agreed, will add a developer guide for adding quantization support to new models in a follow-up PR.

-Original file line number
+Diff line change
@@ Expand Up / @@ -43,6 +43,9 @@ nav: @@
           - Overview: user_guide/diffusion_acceleration.md
           - TeaCache: user_guide/diffusion/teacache.md
           - Cache-DiT: user_guide/diffusion/cache_dit_acceleration.md
+          - Quantization:
+            - Overview: user_guide/diffusion/quantization/overview.md
+            - FP8: user_guide/diffusion/quantization/fp8.md
           - Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md
           - CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md
         - ComfyUI: features/comfyui.md
@@ Expand Down @@

-Original file line number
+Diff line change
@@ -0,0 +1,77 @@
+    # FP8 Quantization
+    ## Overview
+    FP8 quantization converts BF16/FP16 weights to FP8 at model load time. No calibration or pre-quantized checkpoint needed.
+    Depending on the model, either all layers can be quantized, or some sensitive layers should stay in BF16. See the [per-model table](#supported-models) for which case applies.
+    Common sensitive layers in DiT-based diffusion models include **image-stream MLPs** (`img_mlp`). These are particularly vulnerable to FP8 precision loss because they process denoising latents whose dynamic range shifts significantly across timesteps, and unlike attention projections (which benefit from QK-Norm stabilization), MLPs have no built-in normalization to absorb quantization error. In deep architectures (e.g., 60+ residual blocks), small per-layer errors compound and degrade output quality. Other layers such as **attention projections** (`to_qkv`, `to_out`) and **text-stream MLPs** (`txt_mlp`) are generally more robust due to normalization or more stable input statistics.
+    ## Configuration
+. **Python API**: set `quantization="fp8"`. To skip sensitive layers, use `quantization_config` with `ignored_layers`.
+    ```python
+    from vllm_omni import Omni
+    from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+    # All layers quantized
+    omni = Omni(model="<your-model>", quantization="fp8")
+    # Skip sensitive layers
+    omni = Omni(
+        model="<your-model>",
+        quantization_config={
+            "method": "fp8",
+            "ignored_layers": ["<layer-name>"],
+        },
+    )
+    outputs = omni.generate(
+        "A cat sitting on a windowsill",
+        OmniDiffusionSamplingParams(num_inference_steps=50),
+    )
+    ```
+. **CLI**: pass `--quantization fp8` and optionally `--ignored-layers`.
+    ```bash
+    # All layers
+    python text_to_image.py --model <your-model> --quantization fp8
+    # Skip sensitive layers
+    python text_to_image.py --model <your-model> --quantization fp8 --ignored-layers "img_mlp"
+    # Online serving
+    vllm serve <your-model> --omni --quantization fp8
+    ```
+    | Parameter | Type | Default | Description |
+    |-----------|------|---------|-------------|
+    | `method` | str | — | Quantization method (`"fp8"`) |
+    | `ignored_layers` | list[str] | `[]` | Layer name patterns to keep in BF16 |
+    | `activation_scheme` | str | `"dynamic"` | `"dynamic"` (no calibration) or `"static"` |
+    | `weight_block_size` | list[int] \| None | `None` | Block size for block-wise weight quantization |
+    The available `ignored_layers` names depend on the model architecture (e.g., `to_qkv`, `to_out`, `img_mlp`, `txt_mlp`). Consult the transformer source for your target model.
+    ## Supported Models
+    | Model | HF Models | Recommendation | `ignored_layers` |
+    |-------|-----------|---------------|------------------|
+    | Z-Image | `Tongyi-MAI/Z-Image-Turbo` | All layers | None |
+    | Qwen-Image | `Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512` | Skip sensitive layers | `img_mlp` |
+    ## Combining with Other Features
+    FP8 quantization can be combined with cache acceleration:
+    ```python
+    omni = Omni(
+        model="<your-model>",
+        quantization="fp8",
+        cache_backend="tea_cache",
+        cache_config={"rel_l1_thresh": 0.2},
+    )
+    ```

-Original file line number
+Diff line change
@@ -0,0 +1,17 @@
+    # Quantization for Diffusion Transformers
+    vLLM-Omni supports quantization of DiT linear layers to reduce memory usage and accelerate inference.
+    ## Supported Methods
+    | Method | Guide |
+    |--------|-------|
+    | FP8 | [FP8](fp8.md) |
+    ## Device Compatibility
+    | GPU Generation | Example GPUs | FP8 Mode |
+    |---------------|-------------------|----------|
+    | Ada/Hopper (SM 89+) | RTX 4090, H100, H200 | Full W8A8 with native hardware |
+    Kernel selection is automatic.

-Original file line number
+Diff line change
@@ -1,6 +1,6 @@
     # Diffusion Acceleration Overview
-    vLLM-Omni supports various cache acceleration methods to speed up diffusion model inference with minimal quality degradation. These methods include **cache methods** that intelligently cache intermediate computations to avoid redundant work across diffusion timesteps, and **parallelism methods** that distribute the computation across multiple devices.
+    vLLM-Omni supports various acceleration methods to speed up diffusion model inference with minimal quality degradation. These include **cache methods** that intelligently cache intermediate computations to avoid redundant work across diffusion timesteps, **parallelism methods** that distribute the computation across multiple devices, and **quantization methods** that reduce memory footprint while preserving accuracy.
     ## Supported Acceleration Methods
@@ Expand All @@
     Both methods can provide significant speedups (typically **1.5x-2.0x**) while maintaining high output quality.
+    vLLM-Omni also supports quantization methods:
+. **[FP8 Quantization](diffusion/quantization/overview.md)** - Reduces DiT linear layers from BF16 to FP8, providing ~1.28x speedup with minimal quality loss. Supports per-layer skip for sensitive layers.
     vLLM-Omni also supports parallelism methods for diffusion models, including:
 . [Ulysses-SP](diffusion/parallelism_acceleration.md#ulysses-sp) - splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.
@@ Expand All @@
     | **TeaCache** | `cache_backend="tea_cache"` | Simple, adaptive caching with minimal configuration | Quick setup, balanced speed/quality |
     | **Cache-DiT** | `cache_backend="cache_dit"` | Advanced caching with multiple techniques (DBCache, TaylorSeer, SCM) | Maximum acceleration, fine-grained control |
+    ### Quantization Methods
+    | Method | Configuration | Description | Best For |
+    |--------|--------------|-------------|----------|
+    | **FP8** | `quantization="fp8"` | FP8 W8A8 on Ada/Hopper, weight-only on older GPUs | Memory reduction, inference speedup |
     ## Supported Models
     The following table shows which models are currently supported by each acceleration method:
@@ Expand All @@
     ### VideoGen
-    | Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention |CFG-Parallel |
+    | Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention | CFG-Parallel |
     |-------|------------------|:--------:|:---------:|:----------:|:--------------:|:----------------:|
     | **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ❌ | ✅ | ✅ | ✅ | ✅ |
+    ### Quantization
+    | Model | Model Identifier | FP8 |
+    |-------|------------------|:---:|
+    | **Qwen-Image** | `Qwen/Qwen-Image` | ✅ |
+    | **Qwen-Image-2512** | `Qwen/Qwen-Image-2512` | ✅ |
+    | **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ |
     ## Performance Benchmarks
@@ Expand Down Expand Up / @@ -272,12 +290,30 @@ outputs = omni.generate( @@
     )
     ```
+    ### Using FP8 Quantization
+    ```python
+    from vllm_omni import Omni
+    from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+    omni = Omni(
+        model="<your-model>",
+        quantization="fp8",
+    )
+    outputs = omni.generate(
+        "A cat sitting on a windowsill",
+        OmniDiffusionSamplingParams(num_inference_steps=50),
+    )
+    ```
     ## Documentation
     For detailed information on each acceleration method:
     - **[TeaCache Guide](diffusion/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices
     - **[Cache-DiT Acceleration Guide](diffusion/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters
+    - **[FP8 Quantization Guide](diffusion/quantization/overview.md)** - FP8 quantization for DiT models with per-layer control
     - **[Tensor Parallelism](diffusion/parallelism_acceleration.md#tensor-parallelism)** - Guidance on how to enable TP for diffusion models.
     - **[Sequence Parallelism](diffusion/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration.
     - **[CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel)** - Guidance on how to set CFG-Parallel to run positive/negative branches across ranks.
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -5,6 +5,7 @@ @@
     import os
     import time
     from pathlib import Path
+    from typing import Any
     import torch
@@ Expand Down Expand Up / @@ -118,6 +119,24 @@ def parse_args() -> argparse.Namespace: @@
             default=1,
             help="Number of ready layers (blocks) to keep on GPU during generation.",
         )
+        parser.add_argument(
+            "--quantization",
+            type=str,
+            default=None,
+            choices=["fp8"],
+            help="Quantization method for the transformer. "
+            "Options: 'fp8' (FP8 W8A8 on Ada/Hopper, weight-only on older GPUs). "
+            "Default: None (no quantization, uses BF16).",
+        )
+        parser.add_argument(
+            "--ignored-layers",
+            type=str,
+            default=None,
+            help="Comma-separated list of layer name patterns to skip quantization. "
+            "Only used when --quantization is set. "
+            "Available layers: to_qkv, to_out, add_kv_proj, to_add_out, img_mlp, txt_mlp, proj_out. "
+            "Example: --ignored-layers 'add_kv_proj,to_add_out'",
+        )
         parser.add_argument(
             "--vae-use-slicing",
             action="store_true",
@@ Expand Down Expand Up / @@ -188,6 +207,18 @@ def main(): @@
         # Check if profiling is requested via environment variable
         profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))
+        # Build quantization kwargs: use quantization_config dict when
+        # ignored_layers is specified so the list flows through OmniDiffusionConfig
+        quant_kwargs: dict[str, Any] = {}
+        ignored_layers = [s.strip() for s in args.ignored_layers.split(",") if s.strip()] if args.ignored_layers else None
+        if args.quantization and ignored_layers:
+            quant_kwargs["quantization_config"] = {
+                "method": args.quantization,
+                "ignored_layers": ignored_layers,
+            }
+        elif args.quantization:
+            quant_kwargs["quantization"] = args.quantization
         omni = Omni(
             model=args.model,
             enable_layerwise_offload=args.enable_layerwise_offload,
@@ Expand All / @@ -200,6 +231,7 @@ def main(): @@
             parallel_config=parallel_config,
             enforce_eager=args.enforce_eager,
             enable_cpu_offload=args.enable_cpu_offload,
+            **quant_kwargs,
         )
         if profiler_enabled:
@@ Expand All / @@ -212,6 +244,9 @@ def main(): @@
         print(f"  Model: {args.model}")
         print(f"  Inference steps: {args.num_inference_steps}")
         print(f"  Cache backend: {args.cache_backend if args.cache_backend else 'None (no acceleration)'}")
+        print(f"  Quantization: {args.quantization if args.quantization else 'None (BF16)'}")
+        if ignored_layers:
+            print(f"  Ignored layers: {ignored_layers}")
         print(
             f"  Parallel configuration: tensor_parallel_size={args.tensor_parallel_size}, "
             f"ulysses_degree={args.ulysses_degree}, ring_degree={args.ring_degree}, cfg_parallel_size={args.cfg_parallel_size}, "
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: FP8 Quantization Support for DiT #1034

Uh oh!

Diff view

Diff view

There are no files selected for viewing

SamitHuang Feb 11, 2026

Uh oh!

lishunyang12 Feb 11, 2026

Uh oh!

ZJY0516 Feb 11, 2026

Uh oh!

lishunyang12 Feb 11, 2026

Uh oh!

ZJY0516 Feb 11, 2026

Uh oh!

lishunyang12 Feb 11, 2026

Uh oh!

Uh oh!

-Original file line number
+Diff line change
@@ Expand Up / @@ -94,6 +94,7 @@ plugins: @@
           exclude:
             - "re:vllm_omni\\._.*"  # Internal modules
             - "vllm_omni.diffusion.models.qwen_image"  # avoid importing vllm in mkdocs building
+            - "vllm_omni.diffusion.quantization"  # avoid importing vllm in mkdocs building
             - "vllm_omni.entrypoints.async_diffusion"  # avoid importing vllm in mkdocs building
             - "vllm_omni.entrypoints.openai"  # avoid importing vllm in mkdocs building
             - "vllm_omni.entrypoints.openai.protocol"  # avoid importing vllm in mkdocs building
@@ Expand Down @@

Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# SPDX-License-Identifier: Apache-2.0
		# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

[Feature]: FP8 Quantization Support for DiT #1034

Uh oh!

[Feature]: FP8 Quantization Support for DiT #1034

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

SamitHuang Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

ZJY0516 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!