vllm-project · lishunyang12 · May 9, 2026 · Apr 17, 2026 · Apr 17, 2026 · Apr 18, 2026
@@ -78,6 +78,7 @@ nav:
       - Online Quantization: user_guide/quantization/online.md
       - FP8 W8A8: user_guide/quantization/fp8.md
       - Int8 W8A8: user_guide/quantization/int8.md
+      - ModelOpt: user_guide/quantization/modelopt.md
       - GGUF: user_guide/quantization/gguf.md
       - AutoRound: user_guide/quantization/autoround.md
       - msModelSlim: user_guide/quantization/msmodelslim.md

@@ -2,11 +2,11 @@
 
 ## Overview
 
-FP8 quantization converts BF16/FP16 weights to FP8 at model load time, or loads
-a checkpoint whose target stage already declares an FP8 quantization config.
-Online activation scaling is the default and does not require calibration.
-Static activation scaling is supported when calibrated scale information is
-available.
+FP8 quantization converts BF16/FP16 weights to FP8 at model load time. Online
+activation scaling is the default and does not require calibration. Static
+activation scaling is supported when calibrated scale information is available.
+For ModelOpt-produced pre-quantized checkpoints, see
+[ModelOpt Quantization](modelopt.md).
 
 Some architectures can quantize all linear layers. Others have
 quality-sensitive layers that should stay in BF16 through `ignored_layers`.
@@ -46,7 +46,7 @@ guide. FP8 on Ampere may use a weight-only path where available.
 
 | Model | Scope | Format | Status |
 |-------|-------|--------|--------|
-| Qwen3-Omni | Thinker language-model stage | ModelOpt `quant_algo=FP8` | Tested for thinker memory reduction |
+| Qwen3-Omni | Thinker language-model stage | [ModelOpt](modelopt.md) `quant_algo=FP8` | Tested for thinker memory reduction |
 | Qwen3-TTS | TTS language-model stage | Checkpoint config | Not validated |
 
 Audio encoder, vision encoder, talker, and code2wav stay in BF16 unless a

@@ -0,0 +1,138 @@
+# ModelOpt Quantization
+
+## Overview
+
+ModelOpt quantization loads checkpoints produced by NVIDIA ModelOpt. The
+quantized weights and scale tensors are generated before serving, so inference
+does not run online calibration or convert a BF16 checkpoint at startup.
+
+vLLM-Omni currently validates the ModelOpt FP8 checkpoint path for diffusion
+transformers. The loader auto-detects supported ModelOpt FP8 checkpoint configs
+and keeps non-transformer components, such as the tokenizer, scheduler, text
+encoder, and VAE, on the base checkpoint unless a model-specific guide says
+otherwise.
+
+!!! note
+    `--force-cutlass-fp8` is an explicit runtime override for diffusion
+    checkpoints that already carry a supported ModelOpt FP8 config. It does not
+    quantize BF16 checkpoints and it does not apply to online `--quantization
+    fp8`. The flag only takes effect for ModelOpt FP8 diffusion stages on CUDA
+    SM89+ devices; other platforms and non-ModelOpt FP8 paths fall back to the
+    normal vLLM kernel selection.
+
+## Supported ModelOpt Checkpoint Formats
+
+vLLM-Omni treats ModelOpt checkpoints as pre-quantized checkpoints. The
+checkpoint config must identify ModelOpt as the quantization method or producer,
+and the quantization algorithm must be one of the validated FP8 algorithms.
+
+| Checkpoint field | Supported value |
+|------------------|-----------------|
+| `method` / `quant_method` | `modelopt` |
+| `producer.name` | `modelopt` |
+| `quant_algo` | `FP8`, `FP8_PER_CHANNEL_PER_TOKEN` |
+
+Other ModelOpt algorithms, such as NVFP4, are not enabled by this diffusion
+FP8 path until they have separate model and quality validation.
+
+## Hardware Support
+
+| Device | Support |
+|--------|---------|
+| NVIDIA Blackwell GPU (SM 100+) | ✅ |
+| NVIDIA Ada/Hopper GPU (SM 89+) | ✅ |
+| NVIDIA Ampere GPU (SM 80+) | ⭕ |
+| AMD ROCm | ⭕ |
+| Intel XPU | ⭕ |
+| Ascend NPU | ❌ |
+
+Legend: `✅` supported, `❌` unsupported, `⭕` not verified in this guide.
+The optional CUTLASS FP8 runtime override requires CUDA SM89+.
+
+## Model Type Support
+
+### Diffusion Model
+
+| Model | HF checkpoint | Scope | Status |
+|-------|---------------|-------|--------|
+| Qwen-Image 2512 | `feizhai123/qwen-image-2512-modelopt-fp8-dynamic-all` | Diffusion transformer | Validated for ModelOpt FP8 checkpoints |
+| Z-Image | `feizhai123/z-image-modelopt-fp8-conservative` | Diffusion transformer | Validated for ModelOpt FP8 checkpoints |
+| FLUX.2-dev | `feizhai123/flux2-dev-modelopt-fp8` | Diffusion transformer | Validated for ModelOpt FP8 checkpoints |
+| FLUX.2-klein 4B | `feizhai123/flux2-klein-4b-modelopt-fp8` | Diffusion transformer | Validated for ModelOpt FP8 checkpoints |
+| HunyuanImage-3.0 | `feizhai123/hunyuan-image3-modelopt-fp8` | MoE diffusion transformer | Validated for ModelOpt FP8 checkpoints |
+| Wan2.2 | Not available | Diffusion transformer | Not validated |
+
+### Multi-Stage Omni/TTS Model
+
+| Model | Scope | Status |
+|-------|-------|--------|
+| Qwen3-Omni | Thinker language-model stage | ModelOpt FP8 checkpoint path |
+| Qwen3-TTS | TTS language-model stage | Not validated |
+
+Audio encoder, vision encoder, talker, and code2wav stages stay in BF16 unless
+a model-specific guide documents otherwise.
+
+### Multi-Stage Diffusion Model
+
+ModelOpt checkpoints must be routed to the stage whose checkpoint contains the
+ModelOpt `quantization_config`. BAGEL and GLM-Image are not listed as validated
+ModelOpt targets yet.
+
+## Configuration
+
+For pre-quantized ModelOpt FP8 checkpoints, no `--quantization fp8` flag is
+needed. The checkpoint config selects the ModelOpt path.
+
+Online serving:
+
+```bash
+vllm serve <modelopt-fp8-checkpoint> \
+  --omni \
+  --tensor-parallel-size <N> \
+  --force-cutlass-fp8
+```
+
+Offline inference:
+
+```bash
+python examples/offline_inference/text_to_image/text_to_image.py \
+  --model <modelopt-fp8-checkpoint> \
+  --tensor-parallel-size <N> \
+  --prompt "a red ceramic teapot on a wooden table" \
+  --height 1024 \
+  --width 1024 \
+  --num-inference-steps 20 \
+  --seed 42 \
+  --output outputs/modelopt_fp8.png
+```
+
+Python API:
+
+```python
+from vllm_omni import Omni
+
+omni = Omni(
+    model="<modelopt-fp8-checkpoint>",
+    tensor_parallel_size=2,
+    force_cutlass_fp8=True,
+)
+```
+
+## Parameters
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `force_cutlass_fp8` / `--force-cutlass-fp8` | bool | `False` | Force CUTLASS FP8 linear kernels for supported ModelOpt FP8 diffusion stages on CUDA SM89+ |
+
+## Validation and Notes
+
+1. Compare the ModelOpt FP8 checkpoint against the BF16 baseline with the same
+   prompt, resolution, seed, and inference steps.
+2. Use `tests/diffusion/quantization/test_quantization_quality.py` with
+   `VLLM_OMNI_QUALITY_CONFIGS` to validate local baseline and quantized model
+   paths.
+3. Report LPIPS, PSNR, MAE, throughput, latency, and peak memory when adding a
+   new validated ModelOpt diffusion checkpoint.
+4. Keep `--quantization fp8` for online FP8 from BF16 checkpoints; use this
+   ModelOpt path only when the checkpoint already contains ModelOpt FP8 weights
+   and scales.
@@ -10,18 +10,18 @@ type has a different quantization scope.
 | Mode | Guide | Description | Methods |
 |------|-------|-------------|---------|
 | Online quantization | [Online Quantization](online.md) | vLLM-Omni computes quantized weights and scales while loading the model. | FP8 W8A8, Int8 W8A8 |
-| Pre-quantized checkpoints | Method-specific guides | The checkpoint or an offline quantizer provides quantized weights and scales before serving. | GGUF, AutoRound, msModelSlim, serialized Int8 |
+| Pre-quantized checkpoints | Method-specific guides | The checkpoint or an offline quantizer provides quantized weights and scales before serving. | ModelOpt, GGUF, AutoRound, msModelSlim, serialized Int8 |
 
 ## Hardware Support
 
-| Device | FP8 W8A8 | Int8 W8A8 | GGUF | AutoRound | msModelSlim |
-|--------|----------|-----------|------|-----------|-------------|
-| NVIDIA Blackwell GPU (SM 100+) | ✅ | ✅ | ✅ | ✅ | ❌ |
-| NVIDIA Ada/Hopper GPU (SM 89+) | ✅ | ✅ | ✅ | ✅ | ❌ |
-| NVIDIA Ampere GPU (SM 80+) | ✅ | ✅ | ✅ | ✅ | ❌ |
-| AMD ROCm | ⭕ | ⭕ | ⭕ | ⭕ | ❌ |
-| Intel XPU | ⭕ | ⭕ | ⭕ | ✅ | ❌ |
-| Ascend NPU | ❌ | ✅ | ❌ | ❌ | ✅ |
+| Device | FP8 W8A8 | Int8 W8A8 | ModelOpt | GGUF | AutoRound | msModelSlim |
+|--------|----------|-----------|----------|------|-----------|-------------|
+| NVIDIA Blackwell GPU (SM 100+) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
+| NVIDIA Ada/Hopper GPU (SM 89+) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
+| NVIDIA Ampere GPU (SM 80+) | ✅ | ✅ | ⭕ | ✅ | ✅ | ❌ |
+| AMD ROCm | ⭕ | ⭕ | ⭕ | ⭕ | ⭕ | ❌ |
+| Intel XPU | ⭕ | ⭕ | ⭕ | ⭕ | ✅ | ❌ |
+| Ascend NPU | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ |
 
 Legend: `✅` supported, `❌` unsupported, `⭕` not verified in this
 guide. FP8 on Ampere may use a weight-only path where available.
@@ -39,6 +39,7 @@ otherwise.
 |--------|-------|------|----------------|--------|
 | FP8 W8A8 | [FP8](fp8.md) | Online W8A8 or checkpoint FP8 | Qwen-Image; Wan2.2 is not validated | Validated for Qwen-Image family and other DiT models |
 | Int8 W8A8 | [Int8](int8.md) | Online or serialized W8A8 | Qwen-Image; Wan2.2 is not validated | Validated for Qwen-Image and Z-Image |
+| ModelOpt | [ModelOpt](modelopt.md) | Pre-quantized FP8 checkpoints | Qwen-Image, Z-Image, FLUX.2, HunyuanImage-3.0 | Validated for ModelOpt FP8 diffusion checkpoints |
 | GGUF | [GGUF](gguf.md) | Pre-quantized transformer weights | Qwen-Image | Validated where a model-specific GGUF adapter exists |
 | AutoRound | [AutoRound](autoround.md) | Pre-quantized W4A16 checkpoints | FLUX.1-dev; Qwen-Image/Wan2.2 not validated | Checkpoint-driven |
 | msModelSlim | [msModelSlim](msmodelslim.md) | Pre-quantized Ascend checkpoints | Wan2.2 recipe; HunyuanImage-3.0 inference target | Ascend/NPU path |
@@ -52,7 +53,7 @@ in BF16 unless the model guide explicitly adds support.
 
 | Method | Guide | Scope | Example models | Status |
 |--------|-------|-------|----------------|--------|
-| FP8 | [FP8](fp8.md) | Thinker or language-model checkpoint config | Qwen3-Omni thinker | ModelOpt checkpoint path |
+| ModelOpt | [ModelOpt](modelopt.md) | Thinker or language-model checkpoint config | Qwen3-Omni thinker | ModelOpt checkpoint path |
 | Int8 | [Int8](int8.md) | Not currently validated for omni/TTS stages | Qwen3-Omni, Qwen3-TTS | Not validated |
 | GGUF | [GGUF](gguf.md) | Not currently validated for omni/TTS stages | Qwen3-Omni, Qwen3-TTS | Not validated |
 | AutoRound | [AutoRound](autoround.md) | Thinker or language-model checkpoint config | Qwen2.5-Omni, Qwen3-Omni | Supported through AutoRound checkpoints |
@@ -67,6 +68,7 @@ attached to the intended stage rather than applied globally.
 |--------|-------|-------|----------------|--------|
 | FP8 | [FP8](fp8.md) | Stage-specific DiT or transformer module | BAGEL, GLM-Image | Requires model-specific validation |
 | Int8 | [Int8](int8.md) | Stage-specific DiT or transformer module | BAGEL, GLM-Image | Requires model-specific validation |
+| ModelOpt | [ModelOpt](modelopt.md) | Checkpoint-defined diffusion stage | BAGEL, GLM-Image | Requires model-specific validation |
 | GGUF | [GGUF](gguf.md) | Stage-specific transformer weights | BAGEL, GLM-Image | No validated adapter listed |
 | AutoRound | [AutoRound](autoround.md) | Checkpoint-defined stage | BAGEL, GLM-Image | No validated checkpoint listed |
 | msModelSlim | [msModelSlim](msmodelslim.md) | Ascend-generated stage weights | GLM-Image | Requires model-specific adaptation |
@@ -94,7 +96,7 @@ config = build_quant_config({
 
 | Component | Default quantized? | Notes |
 |-----------|--------------------|-------|
-| Diffusion transformer | Yes | Primary target for FP8, Int8, GGUF, AutoRound, and msModelSlim |
+| Diffusion transformer | Yes | Primary target for FP8, Int8, ModelOpt, GGUF, AutoRound, and msModelSlim |
 | Text encoder | No | Keep BF16 unless a method-specific guide documents support |
 | VAE | No | Keep BF16; storage-only paths are method-specific |
 | Scheduler/tokenizer | No | Loaded from the base model repository |

@@ -0,0 +1,94 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+from types import SimpleNamespace
+
+import pytest
+import torch
+import torch.nn as nn
+
+from vllm_omni.diffusion.model_loader.checkpoint_adapters import (
+    ModelOptFp8CheckpointAdapter,
+)
+
+pytestmark = [pytest.mark.core_model, pytest.mark.diffusion, pytest.mark.cpu]
+
+
+class _PackedModelOptModel(nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.transformer = nn.Module()
+        self.transformer.block = nn.Module()
+        self.transformer.block.to_qkv = nn.Linear(2, 2, bias=False)
+
+
+class _QuantizedPackedModelOptModel(nn.Module):
+    def __init__(self) -> None:
+        super().__init__()
+        self.transformer = nn.Module()
+        self.transformer.block = nn.Module()
+        self.transformer.block.to_qkv = nn.Module()
+        self.transformer.block.to_qkv.register_parameter(
+            "weight",
+            nn.Parameter(torch.empty(2, 2, dtype=torch.float8_e4m3fn), requires_grad=False),
+        )
+        self.transformer.block.to_qkv.register_parameter(
+            "weight_scale",
+            nn.Parameter(torch.empty(1), requires_grad=False),
+        )
+        self.transformer.block.to_qkv.register_parameter(
+            "input_scale",
+            nn.Parameter(torch.empty(1), requires_grad=False),
+        )
+
+
+def _make_source() -> SimpleNamespace:
+    return SimpleNamespace(
+        subfolder="transformer",
+        prefix="transformer.",
+    )
+
+
+def test_modelopt_adapter_dequantizes_fp8_weight_for_full_precision_target():
+    model = _PackedModelOptModel()
+    adapter = ModelOptFp8CheckpointAdapter(model, _make_source())
+    fp8_weight = torch.tensor([[2.0, -4.0], [1.0, 3.0]], dtype=torch.float32).to(torch.float8_e4m3fn)
+    scale = torch.tensor([0.5], dtype=torch.float32)
+
+    adapted = list(
+        adapter.adapt(
+            iter(
+                [
+                    ("transformer.block.to_q.weight_scale", scale),
+                    ("transformer.block.to_q.input_scale", torch.tensor([1.0])),
+                    ("transformer.block.to_q.weight", fp8_weight),
+                ]
+            )
+        )
+    )
+
+    assert [name for name, _ in adapted] == ["transformer.block.to_q.weight"]
+    assert adapted[0][1].dtype == model.transformer.block.to_qkv.weight.dtype
+    assert torch.allclose(adapted[0][1], fp8_weight.to(torch.float32) * scale)
+
+
+def test_modelopt_adapter_keeps_scale_tensors_for_quantized_target():
+    model = _QuantizedPackedModelOptModel()
+    adapter = ModelOptFp8CheckpointAdapter(model, _make_source())
+    scale = torch.tensor([0.5], dtype=torch.float32)
+
+    adapted = list(
+        adapter.adapt(
+            iter(
+                [
+                    ("transformer.block.to_q.weight_scale", scale),
+                    ("transformer.block.to_q.input_scale", torch.tensor([1.0])),
+                ]
+            )
+        )
+    )
+
+    assert [name for name, _ in adapted] == [
+        "transformer.block.to_q.weight_scale",
+        "transformer.block.to_q.input_scale",
+    ]