Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
61dee65
[Feature] Add Int8 quantization support for Z-Image and Qwen-Image
yjb767868009 Feb 25, 2026
aa3598e
fix process_weights_after_loading
yjb767868009 Feb 25, 2026
e394819
fix DiffusionInt8Config init
yjb767868009 Feb 25, 2026
f4282df
fix Int8Config's function from_config undefine Int8Config
yjb767868009 Feb 25, 2026
b1c29a4
fix format
yjb767868009 Feb 25, 2026
2a95e10
fix format
yjb767868009 Feb 25, 2026
a3fcc33
fix format
yjb767868009 Feb 25, 2026
17aab52
add quant_config_cls and fix import torch_npu
yjb767868009 Feb 26, 2026
9ecc62f
fix format
yjb767868009 Feb 26, 2026
b8cb90d
Merge branch 'main' into int8-quant
yjb767868009 Mar 3, 2026
6173d12
fix invalid character
yjb767868009 Mar 3, 2026
a117002
Merge branch 'main' into int8-quant
yjb767868009 Mar 3, 2026
6ce717a
add int8 for GPU
yjb767868009 Mar 11, 2026
d5ac438
[CI] Add scripts for bechmark collection and email distribution. (#1307)
congw729 Mar 3, 2026
b6b5842
Merge branch 'int8-quant' of https://github.com/yjb767868009/vllm-omn…
yjb767868009 Mar 11, 2026
806285b
fix import
yjb767868009 Mar 12, 2026
23b4252
raise error in int8 unsupported platfrom
yjb767868009 Mar 12, 2026
03a7e47
fix npu int8 process_weights_after_loading unclear & complete test_in…
yjb767868009 Mar 13, 2026
d6eec8c
fix format
yjb767868009 Mar 13, 2026
a17f9bd
fix format
yjb767868009 Mar 13, 2026
302f0b6
add smoke test & lazy weight loading
yjb767868009 Mar 17, 2026
a380398
fix import torch_npu
yjb767868009 Mar 17, 2026
aed5647
fix pytest.mark.skipif
yjb767868009 Mar 17, 2026
624460f
fix format
yjb767868009 Mar 17, 2026
19dc858
fix format
yjb767868009 Mar 17, 2026
ee91bf6
Merge branch 'main' into int8-quant
yjb767868009 Mar 18, 2026
141f715
fix problem from path updates in the vllm operator
yjb767868009 Mar 18, 2026
871952b
fix format
yjb767868009 Mar 18, 2026
516ded1
Merge branch 'main' into int8-quant
david6666666 Mar 19, 2026
d571770
Merge branch 'main' into int8-quant
david6666666 Mar 19, 2026
084db20
Fix the issue of quantization parameter passing, and add z_image as t…
yjb767868009 Mar 19, 2026
9975e1e
Merge branch 'main' into int8-quant
yjb767868009 Mar 19, 2026
e1cfe8c
fix format
yjb767868009 Mar 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ nav:
- Quantization:
- Overview: user_guide/diffusion/quantization/overview.md
- FP8: user_guide/diffusion/quantization/fp8.md
- Int8: user_guide/diffusion/quantization/int8.md
- GGUF: user_guide/diffusion/quantization/gguf.md
- Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md
- CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md
Expand Down
75 changes: 75 additions & 0 deletions docs/user_guide/diffusion/quantization/int8.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Int8 Quantization

## Overview

Int8 quantization converts BF16/FP16 weights to Int8 at model load time. No calibration or pre-quantized checkpoint needed.

Depending on the model, either all layers can be quantized, or some sensitive layers should stay in BF16/FP16. See the [per-model table](#supported-models) for which case applies.

## Configuration

1. **Python API**: set `quantization="int8"`. To skip sensitive layers, use `quantization_config` with `ignored_layers`.

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

# All layers quantized
omni = Omni(model="<your-model>", quantization="int8")

# Skip sensitive layers
omni = Omni(
model="<your-model>",
quantization_config={
"method": "int8",
"ignored_layers": ["<layer-name>"],
},
)

outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50),
)
```

2. **CLI**: pass `--quantization int8` and optionally `--ignored-layers`.

```bash
# All layers
python text_to_image.py --model <your-model> --quantization int8

# Skip sensitive layers
python text_to_image.py --model <your-model> --quantization int8 --ignored-layers "img_mlp"

# Online serving
vllm serve <your-model> --omni --quantization int8
```

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `method` | str | — | Quantization method (`"int8"`) |
| `ignored_layers` | list[str] | `[]` | Layer name patterns to keep in BF16/FP16 |
| `activation_scheme` | str | `"dynamic"` | `"dynamic"` (no calibration) |


The available `ignored_layers` names depend on the model architecture (e.g., `to_qkv`, `to_out`, `img_mlp`, `txt_mlp`). Consult the transformer source for your target model.

## Supported Models

| Model | HF Models | Recommendation | `ignored_layers` |
|-------|-----------|---------------|------------------|
| Z-Image | `Tongyi-MAI/Z-Image-Turbo` | All layers | None |
| Qwen-Image | `Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512` | All layers | None |

## Combining with Other Features

Int8 quantization can be combined with cache acceleration:

```python
omni = Omni(
model="<your-model>",
quantization="int8",
cache_backend="tea_cache",
cache_config={"rel_l1_thresh": 0.2},
)
```
10 changes: 9 additions & 1 deletion docs/user_guide/diffusion/quantization/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,20 @@ vLLM-Omni supports quantization of DiT linear layers to reduce memory usage and
| Method | Guide |
|--------|-------|
| FP8 | [FP8](fp8.md) |
| Int8 | [Int8](int8.md) |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that int8 is added, should there be a matching 'Device Compatibility for Int8' section?

| GGUF | [GGUF](gguf.md) |

## Device Compatibility
## Device Compatibility for FP8

| GPU Generation | Example GPUs | FP8 Mode |
|---------------|-------------------|----------|
| Ada/Hopper (SM 89+) | RTX 4090, H100, H200 | Full W8A8 with native hardware |

Kernel selection is automatic.

## Device Compatibility for Int8

| Device Type | Generation | Example | Int8 Mode |
|-------------|---------------|-------------------|----------|
| NVIDIA GPU | Ada/Hopper (SM 89+) | RTX 4090, H100, H200 | Full W8A8 with native hardware |
| Ascend NPU | Atlas A2/Atlas A3 | Atlas 800T A2/Atlas 900 A3 | Full W8A8 with native hardware |
32 changes: 25 additions & 7 deletions docs/user_guide/diffusion_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Both methods can provide significant speedups (typically **1.5x-2.0x**) while ma

vLLM-Omni also supports quantization methods:

3. **[FP8 Quantization](diffusion/quantization/overview.md)** - Reduces DiT linear layers from BF16 to FP8, providing ~1.28x speedup with minimal quality loss. Supports per-layer skip for sensitive layers.
3. **[Quantization](diffusion/quantization/overview.md)** - Reduces DiT linear layers from BF16 to FP8 or Int8, providing ~1.28x speedup with minimal quality loss. Supports per-layer skip for sensitive layers.

vLLM-Omni also supports parallelism methods for diffusion models, including:

Expand Down Expand Up @@ -46,6 +46,7 @@ vLLM-Omni also supports parallelism methods for diffusion models, including:
| Method | Configuration | Description | Best For |
|--------|--------------|-------------|----------|
| **FP8** | `quantization="fp8"` | FP8 W8A8 on Ada/Hopper, weight-only on older GPUs | Memory reduction, inference speedup |
| **Int8** | `quantization="int8"` | Int8 W8A8 | Memory reduction, inference speedup |

## Supported Models

Expand Down Expand Up @@ -81,11 +82,11 @@ The following table shows which models are currently supported by each accelerat

### Quantization

| Model | Model Identifier | FP8 |
|-------|------------------|:---:|
| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ |
| **Qwen-Image-2512** | `Qwen/Qwen-Image-2512` | ✅ |
| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ |
| Model | Model Identifier | FP8 | Int8 |
|-------|------------------|:---:|:---:|
| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ |
| **Qwen-Image-2512** | `Qwen/Qwen-Image-2512` | ✅ | ✅ |
| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ | ✅ |


## Performance Benchmarks
Expand Down Expand Up @@ -338,13 +339,30 @@ outputs = omni.generate(
)
```

### Using Int8 Quantization

```python
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

omni = Omni(
model="<your-model>",
quantization="int8",
)

outputs = omni.generate(
"A cat sitting on a windowsill",
OmniDiffusionSamplingParams(num_inference_steps=50),
)
```

## Documentation

For detailed information on each acceleration method:

- **[TeaCache Guide](diffusion/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices
- **[Cache-DiT Acceleration Guide](diffusion/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters
- **[FP8 Quantization Guide](diffusion/quantization/overview.md)** - FP8 quantization for DiT models with per-layer control
- **[Quantization Guide](diffusion/quantization/overview.md)** - Quantization for DiT models with per-layer control
- **[Tensor Parallelism](diffusion/parallelism_acceleration.md#tensor-parallelism)** - Guidance on how to enable TP for diffusion models.
- **[Sequence Parallelism](diffusion/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration.
- **[CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel)** - Guidance on how to set CFG-Parallel to run positive/negative branches across ranks.
Expand Down
10 changes: 4 additions & 6 deletions examples/offline_inference/text_to_image/text_to_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,12 +131,10 @@ def parse_args() -> argparse.Namespace:
"--quantization",
type=str,
default=None,
choices=["fp8", "gguf"],
help=(
"Quantization method for the transformer. "
"Options: 'fp8' (FP8 W8A8), 'gguf' (GGUF quantized weights). "
"Default: None (no quantization, uses BF16)."
),
choices=["fp8", "int8", "gguf"],
help="Quantization method for the transformer. "
"Options: 'fp8' (FP8 W8A8 on Ada/Hopper, weight-only on older GPUs), 'int8' (Int8 W8A8), 'gguf' (GGUF quantized weights). "
"Default: None (no quantization, uses BF16).",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space between '...weights).' and 'Default: ...' — the concatenated string reads weights).Default:.

)
parser.add_argument(
"--gguf-model",
Expand Down
Loading
Loading