Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
8a55f0a
fix
roG0d Apr 17, 2026
3fb34bf
fix
roG0d Apr 17, 2026
001daa4
refactoring
baonudesifeizhai Apr 18, 2026
6cdd2fc
refactoring
baonudesifeizhai Apr 18, 2026
98566b4
continue refacoring
baonudesifeizhai Apr 18, 2026
c75d8e2
fix huawei
baonudesifeizhai Apr 19, 2026
3f0b05a
fix zimage rope problem
baonudesifeizhai Apr 20, 2026
0d08cf4
add e2e test
baonudesifeizhai Apr 20, 2026
c1b470c
fix qwen image online server
baonudesifeizhai Apr 20, 2026
955cc4b
fix pytest e2e problem and remove enforce eager true
baonudesifeizhai Apr 20, 2026
eff7b61
refactoring for test
baonudesifeizhai Apr 20, 2026
9aa53e1
fix conflict
baonudesifeizhai Apr 20, 2026
3f313e5
fix conflict
baonudesifeizhai Apr 23, 2026
b9596fb
fix online running problem
baonudesifeizhai Apr 23, 2026
aff7c6a
fix
baonudesifeizhai Apr 25, 2026
593ac95
rechange to cutlass kernel
baonudesifeizhai Apr 25, 2026
8f65a87
fix start upfusedmoe problem
baonudesifeizhai Apr 25, 2026
8aff0b7
refactoring'
baonudesifeizhai Apr 25, 2026
5dc7e76
add convert example and accuarcy test
baonudesifeizhai Apr 26, 2026
9fa04fc
dd convert example and accuarcy test
baonudesifeizhai Apr 26, 2026
4b2644a
fix reviewer
baonudesifeizhai Apr 28, 2026
e88c277
remove useless test
baonudesifeizhai Apr 28, 2026
d16a2c2
add commets
baonudesifeizhai Apr 28, 2026
b7d58b9
change commit
baonudesifeizhai Apr 29, 2026
8cd88f8
fix by reviewer change
baonudesifeizhai May 5, 2026
0cdef6d
move to right place
baonudesifeizhai May 5, 2026
ca9b329
change
baonudesifeizhai May 5, 2026
5c75fe0
remove
baonudesifeizhai May 5, 2026
0d885ed
rename to modelopt.py
baonudesifeizhai May 5, 2026
818c509
Merge branch 'main' into omni2709
baonudesifeizhai May 5, 2026
a2ac766
fix
baonudesifeizhai May 5, 2026
89cf452
change config
baonudesifeizhai May 5, 2026
22b3f3a
Merge remote-tracking branch 'upstream/main' into omni2709
baonudesifeizhai May 6, 2026
c932b03
fix
baonudesifeizhai May 8, 2026
677f6a1
remove unrelated change
baonudesifeizhai May 8, 2026
96f9080
change to cli
baonudesifeizhai May 8, 2026
9dd0b3b
add doc
baonudesifeizhai May 8, 2026
eff5468
add to feizhai12 huggingface
baonudesifeizhai May 9, 2026
2bf7356
fix ci
baonudesifeizhai May 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ nav:
- Online Quantization: user_guide/quantization/online.md
- FP8 W8A8: user_guide/quantization/fp8.md
- Int8 W8A8: user_guide/quantization/int8.md
- ModelOpt: user_guide/quantization/modelopt.md
- GGUF: user_guide/quantization/gguf.md
- AutoRound: user_guide/quantization/autoround.md
- msModelSlim: user_guide/quantization/msmodelslim.md
Expand Down
12 changes: 6 additions & 6 deletions docs/user_guide/quantization/fp8.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

## Overview

FP8 quantization converts BF16/FP16 weights to FP8 at model load time, or loads
a checkpoint whose target stage already declares an FP8 quantization config.
Online activation scaling is the default and does not require calibration.
Static activation scaling is supported when calibrated scale information is
available.
FP8 quantization converts BF16/FP16 weights to FP8 at model load time. Online
activation scaling is the default and does not require calibration. Static
activation scaling is supported when calibrated scale information is available.
For ModelOpt-produced pre-quantized checkpoints, see
[ModelOpt Quantization](modelopt.md).

Some architectures can quantize all linear layers. Others have
quality-sensitive layers that should stay in BF16 through `ignored_layers`.
Expand Down Expand Up @@ -46,7 +46,7 @@ guide. FP8 on Ampere may use a weight-only path where available.

| Model | Scope | Format | Status |
|-------|-------|--------|--------|
| Qwen3-Omni | Thinker language-model stage | ModelOpt `quant_algo=FP8` | Tested for thinker memory reduction |
| Qwen3-Omni | Thinker language-model stage | [ModelOpt](modelopt.md) `quant_algo=FP8` | Tested for thinker memory reduction |
| Qwen3-TTS | TTS language-model stage | Checkpoint config | Not validated |

Audio encoder, vision encoder, talker, and code2wav stay in BF16 unless a
Expand Down
138 changes: 138 additions & 0 deletions docs/user_guide/quantization/modelopt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# ModelOpt Quantization

## Overview

ModelOpt quantization loads checkpoints produced by NVIDIA ModelOpt. The
quantized weights and scale tensors are generated before serving, so inference
does not run online calibration or convert a BF16 checkpoint at startup.

vLLM-Omni currently validates the ModelOpt FP8 checkpoint path for diffusion
transformers. The loader auto-detects supported ModelOpt FP8 checkpoint configs
and keeps non-transformer components, such as the tokenizer, scheduler, text
encoder, and VAE, on the base checkpoint unless a model-specific guide says
otherwise.

!!! note
`--force-cutlass-fp8` is an explicit runtime override for diffusion
checkpoints that already carry a supported ModelOpt FP8 config. It does not
quantize BF16 checkpoints and it does not apply to online `--quantization
fp8`. The flag only takes effect for ModelOpt FP8 diffusion stages on CUDA
SM89+ devices; other platforms and non-ModelOpt FP8 paths fall back to the
normal vLLM kernel selection.

## Supported ModelOpt Checkpoint Formats

vLLM-Omni treats ModelOpt checkpoints as pre-quantized checkpoints. The
checkpoint config must identify ModelOpt as the quantization method or producer,
and the quantization algorithm must be one of the validated FP8 algorithms.

| Checkpoint field | Supported value |
|------------------|-----------------|
| `method` / `quant_method` | `modelopt` |
| `producer.name` | `modelopt` |
| `quant_algo` | `FP8`, `FP8_PER_CHANNEL_PER_TOKEN` |

Other ModelOpt algorithms, such as NVFP4, are not enabled by this diffusion
FP8 path until they have separate model and quality validation.

## Hardware Support

| Device | Support |
|--------|---------|
| NVIDIA Blackwell GPU (SM 100+) | ✅ |
| NVIDIA Ada/Hopper GPU (SM 89+) | ✅ |
| NVIDIA Ampere GPU (SM 80+) | ⭕ |
| AMD ROCm | ⭕ |
| Intel XPU | ⭕ |
| Ascend NPU | ❌ |

Legend: `✅` supported, `❌` unsupported, `⭕` not verified in this guide.
The optional CUTLASS FP8 runtime override requires CUDA SM89+.

## Model Type Support

### Diffusion Model

| Model | HF checkpoint | Scope | Status |
|-------|---------------|-------|--------|
| Qwen-Image 2512 | `feizhai123/qwen-image-2512-modelopt-fp8-dynamic-all` | Diffusion transformer | Validated for ModelOpt FP8 checkpoints |
| Z-Image | `feizhai123/z-image-modelopt-fp8-conservative` | Diffusion transformer | Validated for ModelOpt FP8 checkpoints |
| FLUX.2-dev | `feizhai123/flux2-dev-modelopt-fp8` | Diffusion transformer | Validated for ModelOpt FP8 checkpoints |
| FLUX.2-klein 4B | `feizhai123/flux2-klein-4b-modelopt-fp8` | Diffusion transformer | Validated for ModelOpt FP8 checkpoints |
| HunyuanImage-3.0 | `feizhai123/hunyuan-image3-modelopt-fp8` | MoE diffusion transformer | Validated for ModelOpt FP8 checkpoints |
| Wan2.2 | Not available | Diffusion transformer | Not validated |

### Multi-Stage Omni/TTS Model

| Model | Scope | Status |
|-------|-------|--------|
| Qwen3-Omni | Thinker language-model stage | ModelOpt FP8 checkpoint path |
| Qwen3-TTS | TTS language-model stage | Not validated |

Audio encoder, vision encoder, talker, and code2wav stages stay in BF16 unless
a model-specific guide documents otherwise.

### Multi-Stage Diffusion Model

ModelOpt checkpoints must be routed to the stage whose checkpoint contains the
ModelOpt `quantization_config`. BAGEL and GLM-Image are not listed as validated
ModelOpt targets yet.

## Configuration

For pre-quantized ModelOpt FP8 checkpoints, no `--quantization fp8` flag is
needed. The checkpoint config selects the ModelOpt path.

Online serving:

```bash
vllm serve <modelopt-fp8-checkpoint> \
--omni \
--tensor-parallel-size <N> \
--force-cutlass-fp8
```

Offline inference:

```bash
python examples/offline_inference/text_to_image/text_to_image.py \
--model <modelopt-fp8-checkpoint> \
--tensor-parallel-size <N> \
--prompt "a red ceramic teapot on a wooden table" \
--height 1024 \
--width 1024 \
--num-inference-steps 20 \
--seed 42 \
--output outputs/modelopt_fp8.png
```

Python API:

```python
from vllm_omni import Omni

omni = Omni(
model="<modelopt-fp8-checkpoint>",
tensor_parallel_size=2,
force_cutlass_fp8=True,
)
```

## Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `force_cutlass_fp8` / `--force-cutlass-fp8` | bool | `False` | Force CUTLASS FP8 linear kernels for supported ModelOpt FP8 diffusion stages on CUDA SM89+ |

## Validation and Notes

1. Compare the ModelOpt FP8 checkpoint against the BF16 baseline with the same
prompt, resolution, seed, and inference steps.
2. Use `tests/diffusion/quantization/test_quantization_quality.py` with
`VLLM_OMNI_QUALITY_CONFIGS` to validate local baseline and quantized model
paths.
3. Report LPIPS, PSNR, MAE, throughput, latency, and peak memory when adding a
new validated ModelOpt diffusion checkpoint.
4. Keep `--quantization fp8` for online FP8 from BF16 checkpoints; use this
ModelOpt path only when the checkpoint already contains ModelOpt FP8 weights
and scales.
24 changes: 13 additions & 11 deletions docs/user_guide/quantization/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,18 @@ type has a different quantization scope.
| Mode | Guide | Description | Methods |
|------|-------|-------------|---------|
| Online quantization | [Online Quantization](online.md) | vLLM-Omni computes quantized weights and scales while loading the model. | FP8 W8A8, Int8 W8A8 |
| Pre-quantized checkpoints | Method-specific guides | The checkpoint or an offline quantizer provides quantized weights and scales before serving. | GGUF, AutoRound, msModelSlim, serialized Int8 |
| Pre-quantized checkpoints | Method-specific guides | The checkpoint or an offline quantizer provides quantized weights and scales before serving. | ModelOpt, GGUF, AutoRound, msModelSlim, serialized Int8 |

## Hardware Support

| Device | FP8 W8A8 | Int8 W8A8 | GGUF | AutoRound | msModelSlim |
|--------|----------|-----------|------|-----------|-------------|
| NVIDIA Blackwell GPU (SM 100+) | ✅ | ✅ | ✅ | ✅ | ❌ |
| NVIDIA Ada/Hopper GPU (SM 89+) | ✅ | ✅ | ✅ | ✅ | ❌ |
| NVIDIA Ampere GPU (SM 80+) | ✅ | ✅ | ✅ | ✅ | ❌ |
| AMD ROCm | ⭕ | ⭕ | ⭕ | ⭕ | ❌ |
| Intel XPU | ⭕ | ⭕ | ⭕ | ✅ | ❌ |
| Ascend NPU | ❌ | ✅ | ❌ | ❌ | ✅ |
| Device | FP8 W8A8 | Int8 W8A8 | ModelOpt | GGUF | AutoRound | msModelSlim |
|--------|----------|-----------|----------|------|-----------|-------------|
| NVIDIA Blackwell GPU (SM 100+) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| NVIDIA Ada/Hopper GPU (SM 89+) | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| NVIDIA Ampere GPU (SM 80+) | ✅ | ✅ | ⭕ | ✅ | ✅ | ❌ |
| AMD ROCm | ⭕ | ⭕ | ⭕ | ⭕ | ⭕ | ❌ |
| Intel XPU | ⭕ | ⭕ | ⭕ | ⭕ | ✅ | ❌ |
| Ascend NPU | ❌ | ✅ | ❌ | ❌ | ❌ | ✅ |

Legend: `✅` supported, `❌` unsupported, `⭕` not verified in this
guide. FP8 on Ampere may use a weight-only path where available.
Expand All @@ -39,6 +39,7 @@ otherwise.
|--------|-------|------|----------------|--------|
| FP8 W8A8 | [FP8](fp8.md) | Online W8A8 or checkpoint FP8 | Qwen-Image; Wan2.2 is not validated | Validated for Qwen-Image family and other DiT models |
| Int8 W8A8 | [Int8](int8.md) | Online or serialized W8A8 | Qwen-Image; Wan2.2 is not validated | Validated for Qwen-Image and Z-Image |
| ModelOpt | [ModelOpt](modelopt.md) | Pre-quantized FP8 checkpoints | Qwen-Image, Z-Image, FLUX.2, HunyuanImage-3.0 | Validated for ModelOpt FP8 diffusion checkpoints |
| GGUF | [GGUF](gguf.md) | Pre-quantized transformer weights | Qwen-Image | Validated where a model-specific GGUF adapter exists |
| AutoRound | [AutoRound](autoround.md) | Pre-quantized W4A16 checkpoints | FLUX.1-dev; Qwen-Image/Wan2.2 not validated | Checkpoint-driven |
| msModelSlim | [msModelSlim](msmodelslim.md) | Pre-quantized Ascend checkpoints | Wan2.2 recipe; HunyuanImage-3.0 inference target | Ascend/NPU path |
Expand All @@ -52,7 +53,7 @@ in BF16 unless the model guide explicitly adds support.

| Method | Guide | Scope | Example models | Status |
|--------|-------|-------|----------------|--------|
| FP8 | [FP8](fp8.md) | Thinker or language-model checkpoint config | Qwen3-Omni thinker | ModelOpt checkpoint path |
| ModelOpt | [ModelOpt](modelopt.md) | Thinker or language-model checkpoint config | Qwen3-Omni thinker | ModelOpt checkpoint path |
| Int8 | [Int8](int8.md) | Not currently validated for omni/TTS stages | Qwen3-Omni, Qwen3-TTS | Not validated |
| GGUF | [GGUF](gguf.md) | Not currently validated for omni/TTS stages | Qwen3-Omni, Qwen3-TTS | Not validated |
| AutoRound | [AutoRound](autoround.md) | Thinker or language-model checkpoint config | Qwen2.5-Omni, Qwen3-Omni | Supported through AutoRound checkpoints |
Expand All @@ -67,6 +68,7 @@ attached to the intended stage rather than applied globally.
|--------|-------|-------|----------------|--------|
| FP8 | [FP8](fp8.md) | Stage-specific DiT or transformer module | BAGEL, GLM-Image | Requires model-specific validation |
| Int8 | [Int8](int8.md) | Stage-specific DiT or transformer module | BAGEL, GLM-Image | Requires model-specific validation |
| ModelOpt | [ModelOpt](modelopt.md) | Checkpoint-defined diffusion stage | BAGEL, GLM-Image | Requires model-specific validation |
| GGUF | [GGUF](gguf.md) | Stage-specific transformer weights | BAGEL, GLM-Image | No validated adapter listed |
| AutoRound | [AutoRound](autoround.md) | Checkpoint-defined stage | BAGEL, GLM-Image | No validated checkpoint listed |
| msModelSlim | [msModelSlim](msmodelslim.md) | Ascend-generated stage weights | GLM-Image | Requires model-specific adaptation |
Expand Down Expand Up @@ -94,7 +96,7 @@ config = build_quant_config({

| Component | Default quantized? | Notes |
|-----------|--------------------|-------|
| Diffusion transformer | Yes | Primary target for FP8, Int8, GGUF, AutoRound, and msModelSlim |
| Diffusion transformer | Yes | Primary target for FP8, Int8, ModelOpt, GGUF, AutoRound, and msModelSlim |
| Text encoder | No | Keep BF16 unless a method-specific guide documents support |
| VAE | No | Keep BF16; storage-only paths are method-specific |
| Scheduler/tokenizer | No | Loaded from the base model repository |
Expand Down
94 changes: 94 additions & 0 deletions tests/diffusion/model_loader/test_modelopt_fp8_adapter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

from types import SimpleNamespace

import pytest
import torch
import torch.nn as nn

from vllm_omni.diffusion.model_loader.checkpoint_adapters import (
ModelOptFp8CheckpointAdapter,
)

pytestmark = [pytest.mark.core_model, pytest.mark.diffusion, pytest.mark.cpu]


class _PackedModelOptModel(nn.Module):
def __init__(self) -> None:
super().__init__()
self.transformer = nn.Module()
self.transformer.block = nn.Module()
self.transformer.block.to_qkv = nn.Linear(2, 2, bias=False)


class _QuantizedPackedModelOptModel(nn.Module):
def __init__(self) -> None:
super().__init__()
self.transformer = nn.Module()
self.transformer.block = nn.Module()
self.transformer.block.to_qkv = nn.Module()
self.transformer.block.to_qkv.register_parameter(
"weight",
nn.Parameter(torch.empty(2, 2, dtype=torch.float8_e4m3fn), requires_grad=False),
)
self.transformer.block.to_qkv.register_parameter(
"weight_scale",
nn.Parameter(torch.empty(1), requires_grad=False),
)
self.transformer.block.to_qkv.register_parameter(
"input_scale",
nn.Parameter(torch.empty(1), requires_grad=False),
)


def _make_source() -> SimpleNamespace:
return SimpleNamespace(
subfolder="transformer",
prefix="transformer.",
)


def test_modelopt_adapter_dequantizes_fp8_weight_for_full_precision_target():
model = _PackedModelOptModel()
adapter = ModelOptFp8CheckpointAdapter(model, _make_source())
fp8_weight = torch.tensor([[2.0, -4.0], [1.0, 3.0]], dtype=torch.float32).to(torch.float8_e4m3fn)
scale = torch.tensor([0.5], dtype=torch.float32)

adapted = list(
adapter.adapt(
iter(
[
("transformer.block.to_q.weight_scale", scale),
("transformer.block.to_q.input_scale", torch.tensor([1.0])),
("transformer.block.to_q.weight", fp8_weight),
]
)
)
)

assert [name for name, _ in adapted] == ["transformer.block.to_q.weight"]
assert adapted[0][1].dtype == model.transformer.block.to_qkv.weight.dtype
assert torch.allclose(adapted[0][1], fp8_weight.to(torch.float32) * scale)


def test_modelopt_adapter_keeps_scale_tensors_for_quantized_target():
model = _QuantizedPackedModelOptModel()
adapter = ModelOptFp8CheckpointAdapter(model, _make_source())
scale = torch.tensor([0.5], dtype=torch.float32)

adapted = list(
adapter.adapt(
iter(
[
("transformer.block.to_q.weight_scale", scale),
("transformer.block.to_q.input_scale", torch.tensor([1.0])),
]
)
)
)

assert [name for name, _ in adapted] == [
"transformer.block.to_q.weight_scale",
"transformer.block.to_q.input_scale",
]
Loading
Loading