Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,9 @@ nav:
- Overview: user_guide/diffusion_acceleration.md
- TeaCache: user_guide/diffusion/teacache.md
- Cache-DiT: user_guide/diffusion/cache_dit_acceleration.md
- Quantization:
- Overview: user_guide/diffusion/quantization/overview.md
- FP8: user_guide/diffusion/quantization/fp8.md
- Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md
- CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md
- ComfyUI: features/comfyui.md
Expand Down
77 changes: 77 additions & 0 deletions docs/user_guide/diffusion/quantization/fp8.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# FP8 Quantization

## Overview

FP8 quantization converts BF16/FP16 weights to FP8 at model load time. No calibration or pre-quantized checkpoint needed.

Depending on the model, either all layers can be quantized, or some sensitive layers should stay in BF16. See the [per-model table](#supported-models) for which case applies.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to further explain what are the common sensitive layers? Like Norm ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Sensitive layers examples have been added.


Common sensitive layers in DiT-based diffusion models include **image-stream MLPs** (`img_mlp`). These are particularly vulnerable to FP8 precision loss because they process denoising latents whose dynamic range shifts significantly across timesteps, and unlike attention projections (which benefit from QK-Norm stabilization), MLPs have no built-in normalization to absorb quantization error. In deep architectures (e.g., 60+ residual blocks), small per-layer errors compound and degrade output quality. Other layers such as **attention projections** (`to_qkv`, `to_out`) and **text-stream MLPs** (`txt_mlp`) are generally more robust due to normalization or more stable input statistics.

## Configuration

1. **Python API**: set `quantization="fp8"`. To skip sensitive layers, use `quantization_config` with `ignored_layers`.

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

# All layers quantized
omni = Omni(model="<your-model>", quantization="fp8")

# Skip sensitive layers
omni = Omni(
model="<your-model>",
quantization_config={
"method": "fp8",
"ignored_layers": ["<layer-name>"],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that we provide flexibility here. But it means we have to maintain detailed examples for models

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the per-model table in the doc serves as the single source of truth. As we add models, we'll update that table with the recommended ignored_layers

},
)

outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50),
)
```

2. **CLI**: pass `--quantization fp8` and optionally `--ignored-layers`.

```bash
# All layers
python text_to_image.py --model <your-model> --quantization fp8

# Skip sensitive layers
python text_to_image.py --model <your-model> --quantization fp8 --ignored-layers "img_mlp"

# Online serving
vllm serve <your-model> --omni --quantization fp8
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `method` | str | — | Quantization method (`"fp8"`) |
| `ignored_layers` | list[str] | `[]` | Layer name patterns to keep in BF16 |
| `activation_scheme` | str | `"dynamic"` | `"dynamic"` (no calibration) or `"static"` |
| `weight_block_size` | list[int] \| None | `None` | Block size for block-wise weight quantization |

The available `ignored_layers` names depend on the model architecture (e.g., `to_qkv`, `to_out`, `img_mlp`, `txt_mlp`). Consult the transformer source for your target model.

## Supported Models

| Model | HF Models | Recommendation | `ignored_layers` |
|-------|-----------|---------------|------------------|
| Z-Image | `Tongyi-MAI/Z-Image-Turbo` | All layers | None |
| Qwen-Image | `Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512` | Skip sensitive layers | `img_mlp` |

## Combining with Other Features

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better add an dev doc for quantization support. It can be done in the following pr

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will add a developer guide for adding quantization support to new models in a follow-up PR.

FP8 quantization can be combined with cache acceleration:

```python
omni = Omni(
model="<your-model>",
quantization="fp8",
cache_backend="tea_cache",
cache_config={"rel_l1_thresh": 0.2},
)
```
17 changes: 17 additions & 0 deletions docs/user_guide/diffusion/quantization/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Quantization for Diffusion Transformers

vLLM-Omni supports quantization of DiT linear layers to reduce memory usage and accelerate inference.

## Supported Methods

| Method | Guide |
|--------|-------|
| FP8 | [FP8](fp8.md) |

## Device Compatibility

| GPU Generation | Example GPUs | FP8 Mode |
|---------------|-------------------|----------|
| Ada/Hopper (SM 89+) | RTX 4090, H100, H200 | Full W8A8 with native hardware |

Kernel selection is automatic.
40 changes: 38 additions & 2 deletions docs/user_guide/diffusion_acceleration.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Diffusion Acceleration Overview

vLLM-Omni supports various cache acceleration methods to speed up diffusion model inference with minimal quality degradation. These methods include **cache methods** that intelligently cache intermediate computations to avoid redundant work across diffusion timesteps, and **parallelism methods** that distribute the computation across multiple devices.
vLLM-Omni supports various acceleration methods to speed up diffusion model inference with minimal quality degradation. These include **cache methods** that intelligently cache intermediate computations to avoid redundant work across diffusion timesteps, **parallelism methods** that distribute the computation across multiple devices, and **quantization methods** that reduce memory footprint while preserving accuracy.

## Supported Acceleration Methods

Expand All @@ -14,6 +14,10 @@ vLLM-Omni currently supports two main cache acceleration backends:

Both methods can provide significant speedups (typically **1.5x-2.0x**) while maintaining high output quality.

vLLM-Omni also supports quantization methods:

3. **[FP8 Quantization](diffusion/quantization/overview.md)** - Reduces DiT linear layers from BF16 to FP8, providing ~1.28x speedup with minimal quality loss. Supports per-layer skip for sensitive layers.

vLLM-Omni also supports parallelism methods for diffusion models, including:

1. [Ulysses-SP](diffusion/parallelism_acceleration.md#ulysses-sp) - splits the input along the sequence dimension and uses all-to-all communication to allow each device to compute only a subset of attention heads.
Expand All @@ -35,6 +39,12 @@ vLLM-Omni also supports parallelism methods for diffusion models, including:
| **TeaCache** | `cache_backend="tea_cache"` | Simple, adaptive caching with minimal configuration | Quick setup, balanced speed/quality |
| **Cache-DiT** | `cache_backend="cache_dit"` | Advanced caching with multiple techniques (DBCache, TaylorSeer, SCM) | Maximum acceleration, fine-grained control |

### Quantization Methods

| Method | Configuration | Description | Best For |
|--------|--------------|-------------|----------|
| **FP8** | `quantization="fp8"` | FP8 W8A8 on Ada/Hopper, weight-only on older GPUs | Memory reduction, inference speedup |

## Supported Models

The following table shows which models are currently supported by each acceleration method:
Expand All @@ -58,10 +68,18 @@ The following table shows which models are currently supported by each accelerat

### VideoGen

| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention |CFG-Parallel |
| Model | Model Identifier | TeaCache | Cache-DiT | Ulysses-SP | Ring-Attention | CFG-Parallel |
|-------|------------------|:--------:|:---------:|:----------:|:--------------:|:----------------:|
| **Wan2.2** | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | ❌ | ✅ | ✅ | ✅ | ✅ |

### Quantization

| Model | Model Identifier | FP8 |
|-------|------------------|:---:|
| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ |
| **Qwen-Image-2512** | `Qwen/Qwen-Image-2512` | ✅ |
| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ |


## Performance Benchmarks

Expand Down Expand Up @@ -272,12 +290,30 @@ outputs = omni.generate(
)
```

### Using FP8 Quantization

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(
model="<your-model>",
quantization="fp8",
)

outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50),
)
```

## Documentation

For detailed information on each acceleration method:

- **[TeaCache Guide](diffusion/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices
- **[Cache-DiT Acceleration Guide](diffusion/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters
- **[FP8 Quantization Guide](diffusion/quantization/overview.md)** - FP8 quantization for DiT models with per-layer control
- **[Tensor Parallelism](diffusion/parallelism_acceleration.md#tensor-parallelism)** - Guidance on how to enable TP for diffusion models.
- **[Sequence Parallelism](diffusion/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration.
- **[CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel)** - Guidance on how to set CFG-Parallel to run positive/negative branches across ranks.
Expand Down
35 changes: 35 additions & 0 deletions examples/offline_inference/text_to_image/text_to_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import os
import time
from pathlib import Path
from typing import Any

import torch

Expand Down Expand Up @@ -118,6 +119,24 @@ def parse_args() -> argparse.Namespace:
default=1,
help="Number of ready layers (blocks) to keep on GPU during generation.",
)
parser.add_argument(
"--quantization",
type=str,
default=None,
choices=["fp8"],
help="Quantization method for the transformer. "
"Options: 'fp8' (FP8 W8A8 on Ada/Hopper, weight-only on older GPUs). "
"Default: None (no quantization, uses BF16).",
)
parser.add_argument(
"--ignored-layers",
type=str,
default=None,
help="Comma-separated list of layer name patterns to skip quantization. "
"Only used when --quantization is set. "
"Available layers: to_qkv, to_out, add_kv_proj, to_add_out, img_mlp, txt_mlp, proj_out. "
"Example: --ignored-layers 'add_kv_proj,to_add_out'",
)
parser.add_argument(
"--vae-use-slicing",
action="store_true",
Expand Down Expand Up @@ -188,6 +207,18 @@ def main():
# Check if profiling is requested via environment variable
profiler_enabled = bool(os.getenv("VLLM_TORCH_PROFILER_DIR"))

# Build quantization kwargs: use quantization_config dict when
# ignored_layers is specified so the list flows through OmniDiffusionConfig
quant_kwargs: dict[str, Any] = {}
ignored_layers = [s.strip() for s in args.ignored_layers.split(",") if s.strip()] if args.ignored_layers else None
if args.quantization and ignored_layers:
quant_kwargs["quantization_config"] = {
"method": args.quantization,
"ignored_layers": ignored_layers,
}
elif args.quantization:
quant_kwargs["quantization"] = args.quantization

omni = Omni(
model=args.model,
enable_layerwise_offload=args.enable_layerwise_offload,
Expand All @@ -200,6 +231,7 @@ def main():
parallel_config=parallel_config,
enforce_eager=args.enforce_eager,
enable_cpu_offload=args.enable_cpu_offload,
**quant_kwargs,
)

if profiler_enabled:
Expand All @@ -212,6 +244,9 @@ def main():
print(f" Model: {args.model}")
print(f" Inference steps: {args.num_inference_steps}")
print(f" Cache backend: {args.cache_backend if args.cache_backend else 'None (no acceleration)'}")
print(f" Quantization: {args.quantization if args.quantization else 'None (BF16)'}")
if ignored_layers:
print(f" Ignored layers: {ignored_layers}")
print(
f" Parallel configuration: tensor_parallel_size={args.tensor_parallel_size}, "
f"ulysses_degree={args.ulysses_degree}, ring_degree={args.ring_degree}, cfg_parallel_size={args.cfg_parallel_size}, "
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ plugins:
exclude:
- "re:vllm_omni\\._.*" # Internal modules
- "vllm_omni.diffusion.models.qwen_image" # avoid importing vllm in mkdocs building
- "vllm_omni.diffusion.quantization" # avoid importing vllm in mkdocs building
- "vllm_omni.entrypoints.async_diffusion" # avoid importing vllm in mkdocs building
- "vllm_omni.entrypoints.openai" # avoid importing vllm in mkdocs building
- "vllm_omni.entrypoints.openai.protocol" # avoid importing vllm in mkdocs building
Expand Down
2 changes: 2 additions & 0 deletions tests/diffusion/quantization/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
Loading