vllm-project · hsliuustc0106 · Mar 19, 2026 · Feb 25, 2026 · Feb 25, 2026 · Feb 25, 2026
@@ -58,6 +58,7 @@ nav:
       - Quantization:
         - Overview: user_guide/diffusion/quantization/overview.md
         - FP8: user_guide/diffusion/quantization/fp8.md
+        - Int8: user_guide/diffusion/quantization/int8.md
         - GGUF: user_guide/diffusion/quantization/gguf.md
       - Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md
       - CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md

@@ -0,0 +1,75 @@
+# Int8 Quantization
+
+## Overview
+
+Int8 quantization converts BF16/FP16 weights to Int8 at model load time. No calibration or pre-quantized checkpoint needed.
+
+Depending on the model, either all layers can be quantized, or some sensitive layers should stay in BF16/FP16. See the [per-model table](#supported-models) for which case applies.
+
+## Configuration
+
+1. **Python API**: set `quantization="int8"`. To skip sensitive layers, use `quantization_config` with `ignored_layers`.
+
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+
+# All layers quantized
+omni = Omni(model="<your-model>", quantization="int8")
+
+# Skip sensitive layers
+omni = Omni(
+    model="<your-model>",
+    quantization_config={
+        "method": "int8",
+        "ignored_layers": ["<layer-name>"],
+    },
+)
+
+outputs = omni.generate(
+    "A cat sitting on a windowsill",
+    OmniDiffusionSamplingParams(num_inference_steps=50),
+)
+```
+
+2. **CLI**: pass `--quantization int8` and optionally `--ignored-layers`.
+
+```bash
+# All layers
+python text_to_image.py --model <your-model> --quantization int8
+
+# Skip sensitive layers
+python text_to_image.py --model <your-model> --quantization int8 --ignored-layers "img_mlp"
+
+# Online serving
+vllm serve <your-model> --omni --quantization int8
+```
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `method` | str | — | Quantization method (`"int8"`) |
+| `ignored_layers` | list[str] | `[]` | Layer name patterns to keep in BF16/FP16 |
+| `activation_scheme` | str | `"dynamic"` | `"dynamic"` (no calibration) |
+
+
+The available `ignored_layers` names depend on the model architecture (e.g., `to_qkv`, `to_out`, `img_mlp`, `txt_mlp`). Consult the transformer source for your target model.
+
+## Supported Models
+
+| Model | HF Models | Recommendation | `ignored_layers` |
+|-------|-----------|---------------|------------------|
+| Z-Image | `Tongyi-MAI/Z-Image-Turbo` | All layers | None |
+| Qwen-Image | `Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512` | All layers | None |
+
+## Combining with Other Features
+
+Int8 quantization can be combined with cache acceleration:
+
+```python
+omni = Omni(
+    model="<your-model>",
+    quantization="int8",
+    cache_backend="tea_cache",
+    cache_config={"rel_l1_thresh": 0.2},
+)
+```
@@ -7,12 +7,20 @@ vLLM-Omni supports quantization of DiT linear layers to reduce memory usage and
 | Method | Guide |
 |--------|-------|
 | FP8 | [FP8](fp8.md) |
+| Int8 | [Int8](int8.md) |
 | GGUF | [GGUF](gguf.md) |
 
-## Device Compatibility
+## Device Compatibility for FP8
 
 | GPU Generation | Example GPUs | FP8 Mode |
 |---------------|-------------------|----------|
 | Ada/Hopper (SM 89+) | RTX 4090, H100, H200 | Full W8A8 with native hardware |
 
 Kernel selection is automatic.
+
+## Device Compatibility for Int8
+
+| Device Type | Generation | Example | Int8 Mode |
+|-------------|---------------|-------------------|----------|
+| NVIDIA GPU | Ada/Hopper (SM 89+) | RTX 4090, H100, H200 | Full W8A8 with native hardware |
+| Ascend NPU | Atlas A2/Atlas A3 | Atlas 800T A2/Atlas 900 A3 | Full W8A8 with native hardware |
@@ -16,7 +16,7 @@ Both methods can provide significant speedups (typically **1.5x-2.0x**) while ma
 
 vLLM-Omni also supports quantization methods:
 
-3. **[FP8 Quantization](diffusion/quantization/overview.md)** - Reduces DiT linear layers from BF16 to FP8, providing ~1.28x speedup with minimal quality loss. Supports per-layer skip for sensitive layers.
+3. **[Quantization](diffusion/quantization/overview.md)** - Reduces DiT linear layers from BF16 to FP8 or Int8, providing ~1.28x speedup with minimal quality loss. Supports per-layer skip for sensitive layers.
 
 vLLM-Omni also supports parallelism methods for diffusion models, including:
 
@@ -46,6 +46,7 @@ vLLM-Omni also supports parallelism methods for diffusion models, including:
 | Method | Configuration | Description | Best For |
 |--------|--------------|-------------|----------|
 | **FP8** | `quantization="fp8"` | FP8 W8A8 on Ada/Hopper, weight-only on older GPUs | Memory reduction, inference speedup |
+| **Int8** | `quantization="int8"` | Int8 W8A8 | Memory reduction, inference speedup |
 
 ## Supported Models
 
@@ -81,11 +82,11 @@ The following table shows which models are currently supported by each accelerat
 
 ### Quantization
 
-| Model | Model Identifier | FP8 |
-|-------|------------------|:---:|
-| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ |
-| **Qwen-Image-2512** | `Qwen/Qwen-Image-2512` | ✅ |
-| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ |
+| Model | Model Identifier | FP8 | Int8 |
+|-------|------------------|:---:|:---:|
+| **Qwen-Image** | `Qwen/Qwen-Image` | ✅ | ✅ |
+| **Qwen-Image-2512** | `Qwen/Qwen-Image-2512` | ✅ | ✅ |
+| **Z-Image** | `Tongyi-MAI/Z-Image-Turbo` | ✅ | ✅ |
 
 
 ## Performance Benchmarks
@@ -338,13 +339,30 @@ outputs = omni.generate(
 )
 ```
 
+### Using Int8 Quantization
+
+```python
+from vllm_omni import Omni
+from vllm_omni.inputs.data import OmniDiffusionSamplingParams
+
+omni = Omni(
+    model="<your-model>",
+    quantization="int8",
+)
+
+outputs = omni.generate(
+    "A cat sitting on a windowsill",
+    OmniDiffusionSamplingParams(num_inference_steps=50),
+)
+```
+
 ## Documentation
 
 For detailed information on each acceleration method:
 
 - **[TeaCache Guide](diffusion/teacache.md)** - Complete TeaCache documentation, configuration options, and best practices
 - **[Cache-DiT Acceleration Guide](diffusion/cache_dit_acceleration.md)** - Comprehensive Cache-DiT guide covering DBCache, TaylorSeer, SCM, and configuration parameters
-- **[FP8 Quantization Guide](diffusion/quantization/overview.md)** - FP8 quantization for DiT models with per-layer control
+- **[Quantization Guide](diffusion/quantization/overview.md)** - Quantization for DiT models with per-layer control
 - **[Tensor Parallelism](diffusion/parallelism_acceleration.md#tensor-parallelism)** - Guidance on how to enable TP for diffusion models.
 - **[Sequence Parallelism](diffusion/parallelism_acceleration.md#sequence-parallelism)** - Guidance on how to set sequence parallelism with configuration.
 - **[CFG-Parallel](diffusion/parallelism_acceleration.md#cfg-parallel)** - Guidance on how to set CFG-Parallel to run positive/negative branches across ranks.

@@ -131,12 +131,10 @@ def parse_args() -> argparse.Namespace:
         "--quantization",
         type=str,
         default=None,
-        choices=["fp8", "gguf"],
-        help=(
-            "Quantization method for the transformer. "
-            "Options: 'fp8' (FP8 W8A8), 'gguf' (GGUF quantized weights). "
-            "Default: None (no quantization, uses BF16)."
-        ),
+        choices=["fp8", "int8", "gguf"],
+        help="Quantization method for the transformer. "
+        "Options: 'fp8' (FP8 W8A8 on Ada/Hopper, weight-only on older GPUs), 'int8' (Int8 W8A8), 'gguf' (GGUF quantized weights). "
+        "Default: None (no quantization, uses BF16).",
     )
     parser.add_argument(
         "--gguf-model",