Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
df1a0bc
cherry pick 1034
david6666666 Feb 9, 2026
b880190
support gguf fp8 1
david6666666 Feb 10, 2026
2a208d8
support gguf fp8 2
david6666666 Feb 10, 2026
d81c2a9
support gguf fp8 3
david6666666 Feb 10, 2026
cdd0dfb
support gguf fp8 4
david6666666 Feb 10, 2026
887d1c0
support gguf fp8 5
david6666666 Feb 10, 2026
4d77a92
support gguf fp8 add design doc
david6666666 Feb 11, 2026
2b35f3b
support gguf fp8 add design doc 2
david6666666 Feb 11, 2026
f5c6900
patch
Isotr0py Feb 11, 2026
6fc0f8c
support gguf fp8 6
david6666666 Feb 11, 2026
769a5fb
support gguf fp8 7
david6666666 Feb 11, 2026
ce19e1b
support gguf fp8 8
david6666666 Feb 11, 2026
f535ca1
support gguf fp8 9
david6666666 Feb 11, 2026
11ca22f
support gguf fp8 10
david6666666 Feb 11, 2026
b2916b1
support gguf fp8 11
david6666666 Feb 11, 2026
929f1f7
support gguf fp8 12
david6666666 Feb 11, 2026
d958047
Merge remote-tracking branch 'origin/gguf' into gguf_fp8
Isotr0py Feb 11, 2026
a4cefac
support gguf fp8 add design doc 3
david6666666 Feb 11, 2026
c1af5f1
Merge branch 'main' into gguf_fp8
david6666666 Feb 12, 2026
f39760b
support gguf fp8 add design doc 4
david6666666 Feb 12, 2026
f63d509
support gguf fp8 add design doc 5
david6666666 Feb 12, 2026
ceb8c11
support gguf fp8 add qwen-image
david6666666 Feb 12, 2026
c5b1e75
support gguf fp8 add z-image
david6666666 Feb 12, 2026
599bafb
support gguf only
david6666666 Feb 12, 2026
9f84387
support gguf 1
david6666666 Feb 12, 2026
68e5345
support gguf fp8 add qwen-image
david6666666 Feb 12, 2026
3795dc5
support gguf fp8 add qwen-image 2
david6666666 Feb 12, 2026
2d7a409
support gguf fp8 add note
david6666666 Feb 12, 2026
580a18e
support gguf fp8 fix 1
david6666666 Feb 12, 2026
bcc91f6
support gguf fp8 fix z-image
david6666666 Feb 12, 2026
5dcf32d
support gguf fp8 fix z-image 2
david6666666 Feb 12, 2026
172dcf2
support gguf fp8 fix z-image 3
david6666666 Feb 12, 2026
ae08249
fix pre-commit
david6666666 Feb 12, 2026
03217e0
simple doc
david6666666 Feb 12, 2026
d3ab484
fix pre-commit
david6666666 Feb 12, 2026
43dc33a
fix pre-commit
david6666666 Feb 12, 2026
fb43b4f
fix comment 1
david6666666 Feb 13, 2026
7c7bf79
Merge branch 'main' into gguf_fp8
david6666666 Feb 13, 2026
45fac53
fix pre-commit
david6666666 Feb 13, 2026
e5a70d4
fix comment2
david6666666 Feb 13, 2026
5906200
fix bug
david6666666 Feb 13, 2026
0e53b23
Merge branch 'gguf_fp8' of https://github.com/david6666666/vllm-omni …
david6666666 Feb 13, 2026
20307f6
fix comment 2
david6666666 Feb 13, 2026
b35870e
fix pre-commit
david6666666 Feb 13, 2026
c9fceb9
add doc
david6666666 Feb 13, 2026
d1550cf
draft
Isotr0py Feb 18, 2026
9ab7810
Merge branch 'vllm-project:main' into gguf-fp8-draft1
Isotr0py Feb 20, 2026
02f9c1b
update
Isotr0py Feb 20, 2026
b1091fb
update
Isotr0py Feb 20, 2026
a4e0ffe
clean
Isotr0py Feb 20, 2026
1d9f36f
Merge remote-tracking branch 'upstream/main' into gguf_fp8
Isotr0py Feb 20, 2026
a4c6336
Merge branch 'gguf-fp8-flux2-cleanup' into gguf_fp8
Isotr0py Feb 20, 2026
bd2cce6
draft
Isotr0py Feb 20, 2026
8f7edd6
fix
Isotr0py Feb 21, 2026
16b2dd8
fix pre-commit
david6666666 Feb 25, 2026
de8c3a4
Merge branch 'main' into gguf_fp8
david6666666 Feb 25, 2026
b0808f8
fix pre-commit
david6666666 Feb 25, 2026
f684cd6
fix pre-commit
david6666666 Feb 25, 2026
42a2905
fix comment 1
david6666666 Feb 25, 2026
b4c7f6d
remove qwen-image
david6666666 Feb 25, 2026
9c0a8f0
remove qwen-image
david6666666 Feb 25, 2026
3f08d84
remove qwen-image
david6666666 Feb 25, 2026
5205599
fix flux2
david6666666 Feb 26, 2026
740aaab
fix ci
david6666666 Feb 26, 2026
11de53e
fix comment 1
david6666666 Feb 26, 2026
3b281e8
fix comment 2
david6666666 Feb 26, 2026
69a010d
Merge branch 'main' into gguf_fp8
david6666666 Feb 26, 2026
ea69176
fix ci
david6666666 Feb 26, 2026
e996f82
fix ci 2
david6666666 Feb 26, 2026
0b6338f
Merge branch 'main' into gguf_fp8
hsliuustc0106 Feb 26, 2026
fe9457c
fix pre-commit
david6666666 Feb 27, 2026
ad543fe
fix wrong merge
david6666666 Feb 27, 2026
ecb442c
Merge branch 'main' into gguf_fp8
david6666666 Feb 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ nav:
- Quantization:
- Overview: user_guide/diffusion/quantization/overview.md
- FP8: user_guide/diffusion/quantization/fp8.md
- GGUF: user_guide/diffusion/quantization/gguf.md
- Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md
- CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md
- LoRA: user_guide/diffusion/lora.md
Expand Down
1 change: 1 addition & 0 deletions docs/mkdocs/hooks/generate_argparse.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,7 @@ def add_parser(self, name, **kwargs):
"logger": logger,
"DummySubparsers": DummySubparsers,
"argparse": __import__("argparse"),
"json": __import__("json"),
"DESCRIPTION": DESCRIPTION,
}
exec(code, exec_globals, local_vars)
Expand Down
185 changes: 185 additions & 0 deletions docs/user_guide/diffusion/quantization/gguf.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# GGUF Quantization

## Goals
1. Reuse vLLM quantization configs and weight loaders as much as possible.
2. Add native GGUF support to diffusion transformers without changing model definitions.
3. Keep user-facing knobs minimal and consistent across offline and online flows.

## Scope
1. Models: Z-Image, and Flux2-klein.
2. Components: diffusion transformer weights, loader paths, and quantization configs.
3. Modes: native GGUF (transformer-only weights).

## Architecture Overview
1. `OmniDiffusionConfig` accepts `quantization` or `quantization_config`.
2. Diffusion quantization wrapper (`DiffusionGgufConfig`) produces vLLM `QuantizationConfig` objects for linear layers.
3. `DiffusersPipelineLoader` branches on quantization method and loads either HF weights or GGUF weights for the transformer.
4. GGUF transformer loading is routed through model-specific adapters (e.g., Flux2Klein).
5. vLLM GGUF path uses `GGUFConfig` and `GGUFLinearMethod` for matmul.

## Call Chain (Offline)
```
CLI (examples/offline_inference/text_to_image/text_to_image.py)
|
v
Omni (vllm_omni/entrypoints/omni.py)
|
v
OmniStage (diffusion)
|
v
DiffusionWorker
|
v
DiffusionModelRunner
|
v
DiffusersPipelineLoader
|
v
Pipeline.forward (Flux2/Qwen/Z-Image)
|
v
DiffusionEngine
|
v
OmniRequestOutput
|
v
Client (saved PNG)
```

## Call Chain (Online)
```
Client
|
| POST /v1/images/generations
v
APIServer (vllm_omni/entrypoints/openai/api_server.py)
|
v
_generate_with_async_omni
|
v
AsyncOmni
|
v
DiffusionEngine
|
v
OmniRequestOutput
|
v
encode_image_base64
|
v
ImageGenerationResponse
|
v
Client
```

## Call Chain (GGUF Operator Path)
```
Pipeline.forward (Flux2/Qwen/Z-Image)
|
v
Transformer blocks
|
v
QKVParallelLinear / ColumnParallelLinear / RowParallelLinear
|
v
LinearBase.forward
|
v
QuantMethod.apply (GGUFLinearMethod.apply)
|
v
fused_mul_mat_gguf
|
v
_fused_mul_mat_gguf (custom op)
|
v
ops.ggml_dequantize
|
v
x @ weight.T
```

## GGUF Weight Loading Path (Transformer-Only)
1. `DiffusersPipelineLoader.load_model` detects `quantization_config.method == "gguf"`.
2. `gguf_model` is resolved as one of: local file, `repo/file.gguf`, or `repo:quant_type`.
3. GGUF weights are routed through adapters in `vllm_omni/diffusion/model_loader/gguf_adapters/`.
4. Name mapping is applied per-architecture (Z-Image, Flux2Klein).
5. GGUF weights are loaded into transformer modules, remaining non-transformer weights come from the HF checkpoint.

## GGUF Adapter Design
1. `GGUFAdapter` is an abstract base class for model-specific adapters.
2. `Flux2KleinGGUFAdapter` implements Flux2-Klein remapping + qkv split + adaLN swap.
3. `ZImageGGUFAdapter` implements Z-Image qkv + ffn shard handling and linear qweight routing.
4. `get_gguf_adapter(...)` strictly selects by model class/config; unsupported models raise an error (no fallback adapter).

Adapter paths:
- Base: `vllm_omni/diffusion/model_loader/gguf_adapters/base.py`
- Z-Image: `vllm_omni/diffusion/model_loader/gguf_adapters/z_image.py`
- Flux2-Klein: `vllm_omni/diffusion/model_loader/gguf_adapters/flux2_klein.py`

## User Usage (Offline)

### Baseline BF16
```bash
python examples/offline_inference/text_to_image/text_to_image.py \
--model /workspace/models/black-forest-labs/FLUX.2-klein-4B \
--prompt "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture" \
--height 768 \
--width 1360 \
--seed 42 \
--cfg_scale 4.0 \
--num_images_per_prompt 1 \
--num_inference_steps 4 \
--output outputs/flux2_klein_4b.png
```

### Native GGUF (Transformer Only)
```bash
python examples/offline_inference/text_to_image/text_to_image.py \
--model /workspace/models/black-forest-labs/FLUX.2-klein-4B \
--gguf-model "/workspace/models/unsloth/FLUX.2-klein-4B-GGUF/flux-2-klein-4b-Q8_0.gguf" \
--quantization gguf \
--prompt "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture" \
--height 768 \
--width 1360 \
--seed 42 \
--cfg_scale 4.0 \
--num_images_per_prompt 1 \
--num_inference_steps 4 \
--output outputs/flux2_klein_4b_gguf.png
```

Notes for GGUF:
1. Many GGUF repos do not ship `model_index.json` and configs. Use the base repo for `--model` and only pass the GGUF file via `--gguf-model`.
2. `gguf_model` supports local path, `repo/file.gguf`, or `repo:quant_type`.

## User Usage (Online)

### Start Server (Native GGUF via CLI)
```bash
vllm serve /workspace/models/black-forest-labs/FLUX.2-klein-4B \
--omni \
--port 8000 \
--quantization-config '{"method":"gguf","gguf_model":"/workspace/models/unsloth/FLUX.2-klein-4B-GGUF/flux-2-klein-4b-Q8_0.gguf"}'
```

### Online Request (Images API)
```bash
curl -X POST http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "a dragon laying over the spine of the Green Mountains of Vermont",
"size": "1024x1024",
"seed": 42,
"num_inference_steps": 4
}'
```
1 change: 1 addition & 0 deletions docs/user_guide/diffusion/quantization/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ vLLM-Omni supports quantization of DiT linear layers to reduce memory usage and
| Method | Guide |
|--------|-------|
| FP8 | [FP8](fp8.md) |
| GGUF | [GGUF](gguf.md) |

## Device Compatibility

Expand Down
25 changes: 20 additions & 5 deletions examples/offline_inference/text_to_image/text_to_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,10 +132,18 @@ def parse_args() -> argparse.Namespace:
"--quantization",
type=str,
default=None,
choices=["fp8"],
help="Quantization method for the transformer. "
"Options: 'fp8' (FP8 W8A8 on Ada/Hopper, weight-only on older GPUs). "
"Default: None (no quantization, uses BF16).",
choices=["fp8", "gguf"],
help=(
"Quantization method for the transformer. "
"Options: 'fp8' (FP8 W8A8), 'gguf' (GGUF quantized weights). "
"Default: None (no quantization, uses BF16)."
),
)
Comment thread
david6666666 marked this conversation as resolved.
parser.add_argument(
"--gguf-model",
type=str,
default=None,
help=("GGUF file path or HF reference for transformer weights. Required when --quantization gguf is set."),
)
parser.add_argument(
"--ignored-layers",
Expand Down Expand Up @@ -265,7 +273,14 @@ def main():
# ignored_layers is specified so the list flows through OmniDiffusionConfig
quant_kwargs: dict[str, Any] = {}
ignored_layers = [s.strip() for s in args.ignored_layers.split(",") if s.strip()] if args.ignored_layers else None
if args.quantization and ignored_layers:
if args.quantization == "gguf":
if not args.gguf_model:
raise ValueError("--gguf-model is required when --quantization gguf is set.")
quant_kwargs["quantization_config"] = {
"method": "gguf",
"gguf_model": args.gguf_model,
}
elif args.quantization and ignored_layers:
quant_kwargs["quantization_config"] = {
"method": args.quantization,
"ignored_layers": ignored_layers,
Expand Down
73 changes: 73 additions & 0 deletions tests/diffusion/test_diffusers_loader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

import pytest
import torch
import torch.nn as nn

from vllm_omni.diffusion.model_loader.diffusers_loader import DiffusersPipelineLoader

pytestmark = [pytest.mark.core_model, pytest.mark.diffusion, pytest.mark.cpu]


class _DummyPipelineModel(nn.Module):
def __init__(self, *, source_prefix: str):
super().__init__()
self.transformer = nn.Linear(2, 2, bias=False)
self.vae = nn.Linear(2, 2, bias=False)
self.weights_sources = [
DiffusersPipelineLoader.ComponentSource(
model_or_path="dummy",
subfolder="transformer",
revision=None,
prefix=source_prefix,
fall_back_to_pt=True,
)
]

def load_weights(self, weights):
params = dict(self.named_parameters())
loaded: set[str] = set()
for name, tensor in weights:
if name not in params:
continue
params[name].data.copy_(tensor.to(dtype=params[name].dtype))
loaded.add(name)
return loaded


def _make_loader_with_weights(weight_names: list[str]) -> DiffusersPipelineLoader:
loader = object.__new__(DiffusersPipelineLoader)
loader.counter_before_loading_weights = 0.0
loader.counter_after_loading_weights = 0.0

def _iter_weights(_model):
for name in weight_names:
yield name, torch.zeros((2, 2))

loader.get_all_weights = _iter_weights # type: ignore[assignment]
return loader


def test_strict_check_only_validates_source_prefix_parameters():
model = _DummyPipelineModel(source_prefix="transformer.")
loader = _make_loader_with_weights(["transformer.weight"])

# Should not require VAE parameters because they are outside weights_sources.
loader.load_weights(model)


def test_strict_check_raises_when_source_parameters_are_missing():
model = _DummyPipelineModel(source_prefix="transformer.")
loader = _make_loader_with_weights([])

with pytest.raises(ValueError, match="transformer.weight"):
loader.load_weights(model)


def test_empty_source_prefix_keeps_full_model_strict_check():
model = _DummyPipelineModel(source_prefix="")
loader = _make_loader_with_weights(["transformer.weight"])

with pytest.raises(ValueError, match="vae.weight"):
loader.load_weights(model)
14 changes: 12 additions & 2 deletions vllm_omni/diffusion/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import random
from collections.abc import Callable, Mapping
from dataclasses import dataclass, field, fields
from typing import Any
from typing import TYPE_CHECKING, Any

import torch
from pydantic import model_validator
Expand All @@ -20,6 +20,12 @@
)
from vllm_omni.diffusion.utils.network_utils import is_port_available

if TYPE_CHECKING:
from vllm_omni.diffusion.quantization import DiffusionQuantizationConfig

# Import after TYPE_CHECKING to avoid circular imports at runtime
# The actual import is deferred to __post_init__ to avoid import order issues

logger = init_logger(__name__)


Expand Down Expand Up @@ -527,8 +533,12 @@ def __post_init__(self):
# If it's neither dict nor DiffusionCacheConfig, convert to empty config
self.cache_config = DiffusionCacheConfig()

# Convert quantization config
# Convert quantization config (deferred import to avoid circular imports)
if self.quantization is not None or self.quantization_config is not None:
from vllm_omni.diffusion.quantization import (
DiffusionQuantizationConfig,
)

# Handle dict or DictConfig (from OmegaConf) - use Mapping for broader compatibility
if isinstance(self.quantization_config, Mapping):
# Convert DictConfig to dict if needed (OmegaConf compatibility)
Expand Down
Loading