vllm-project · hsliuustc0106 · Feb 27, 2026 · Feb 9, 2026 · Feb 10, 2026 · Feb 10, 2026
diff --git a/docs/.nav.yml b/docs/.nav.yml
@@ -46,6 +46,7 @@ nav:
       - Quantization:
         - Overview: user_guide/diffusion/quantization/overview.md
         - FP8: user_guide/diffusion/quantization/fp8.md
+        - GGUF: user_guide/diffusion/quantization/gguf.md
       - Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md
       - CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md
       - LoRA: user_guide/diffusion/lora.md

diff --git a/docs/mkdocs/hooks/generate_argparse.py b/docs/mkdocs/hooks/generate_argparse.py
@@ -124,6 +124,7 @@ def add_parser(self, name, **kwargs):
                         "logger": logger,
                         "DummySubparsers": DummySubparsers,
                         "argparse": __import__("argparse"),
+                        "json": __import__("json"),
                         "DESCRIPTION": DESCRIPTION,
                     }
                     exec(code, exec_globals, local_vars)

diff --git a/docs/user_guide/diffusion/quantization/gguf.md b/docs/user_guide/diffusion/quantization/gguf.md
@@ -0,0 +1,185 @@
+# GGUF Quantization
+
+## Goals
+1. Reuse vLLM quantization configs and weight loaders as much as possible.
+2. Add native GGUF support to diffusion transformers without changing model definitions.
+3. Keep user-facing knobs minimal and consistent across offline and online flows.
+
+## Scope
+1. Models: Z-Image, and Flux2-klein.
+2. Components: diffusion transformer weights, loader paths, and quantization configs.
+3. Modes: native GGUF (transformer-only weights).
+
+## Architecture Overview
+1. `OmniDiffusionConfig` accepts `quantization` or `quantization_config`.
+2. Diffusion quantization wrapper (`DiffusionGgufConfig`) produces vLLM `QuantizationConfig` objects for linear layers.
+3. `DiffusersPipelineLoader` branches on quantization method and loads either HF weights or GGUF weights for the transformer.
+4. GGUF transformer loading is routed through model-specific adapters (e.g., Flux2Klein).
+5. vLLM GGUF path uses `GGUFConfig` and `GGUFLinearMethod` for matmul.
+
+## Call Chain (Offline)
+```
+CLI (examples/offline_inference/text_to_image/text_to_image.py)
+  |
+  v
+Omni (vllm_omni/entrypoints/omni.py)
+  |
+  v
+OmniStage (diffusion)
+  |
+  v
+DiffusionWorker
+  |
+  v
+DiffusionModelRunner
+  |
+  v
+DiffusersPipelineLoader
+  |
+  v
+Pipeline.forward (Flux2/Qwen/Z-Image)
+  |
+  v
+DiffusionEngine
+  |
+  v
+OmniRequestOutput
+  |
+  v
+Client (saved PNG)
+```
+
+## Call Chain (Online)
+```
+Client
+  |
+  | POST /v1/images/generations
+  v
+APIServer (vllm_omni/entrypoints/openai/api_server.py)
+  |
+  v
+_generate_with_async_omni
+  |
+  v
+AsyncOmni
+  |
+  v
+DiffusionEngine
+  |
+  v
+OmniRequestOutput
+  |
+  v
+encode_image_base64
+  |
+  v
+ImageGenerationResponse
+  |
+  v
+Client
+```
+
+## Call Chain (GGUF Operator Path)
+```
+Pipeline.forward (Flux2/Qwen/Z-Image)
+  |
+  v
+Transformer blocks
+  |
+  v
+QKVParallelLinear / ColumnParallelLinear / RowParallelLinear
+  |
+  v
+LinearBase.forward
+  |
+  v
+QuantMethod.apply (GGUFLinearMethod.apply)
+  |
+  v
+fused_mul_mat_gguf
+  |
+  v
+_fused_mul_mat_gguf (custom op)
+  |
+  v
+ops.ggml_dequantize
+  |
+  v
+x @ weight.T
+```
+
+## GGUF Weight Loading Path (Transformer-Only)
+1. `DiffusersPipelineLoader.load_model` detects `quantization_config.method == "gguf"`.
+2. `gguf_model` is resolved as one of: local file, `repo/file.gguf`, or `repo:quant_type`.
+3. GGUF weights are routed through adapters in `vllm_omni/diffusion/model_loader/gguf_adapters/`.
+4. Name mapping is applied per-architecture (Z-Image, Flux2Klein).
+5. GGUF weights are loaded into transformer modules, remaining non-transformer weights come from the HF checkpoint.
+
+## GGUF Adapter Design
+1. `GGUFAdapter` is an abstract base class for model-specific adapters.
+2. `Flux2KleinGGUFAdapter` implements Flux2-Klein remapping + qkv split + adaLN swap.
+3. `ZImageGGUFAdapter` implements Z-Image qkv + ffn shard handling and linear qweight routing.
+4. `get_gguf_adapter(...)` strictly selects by model class/config; unsupported models raise an error (no fallback adapter).
+
+Adapter paths:
+- Base: `vllm_omni/diffusion/model_loader/gguf_adapters/base.py`
+- Z-Image: `vllm_omni/diffusion/model_loader/gguf_adapters/z_image.py`
+- Flux2-Klein: `vllm_omni/diffusion/model_loader/gguf_adapters/flux2_klein.py`
+
+## User Usage (Offline)
+
+### Baseline BF16
+```bash
+python examples/offline_inference/text_to_image/text_to_image.py \
+  --model /workspace/models/black-forest-labs/FLUX.2-klein-4B \
+  --prompt "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture" \
+  --height 768 \
+  --width 1360 \
+  --seed 42 \
+  --cfg_scale 4.0 \
+  --num_images_per_prompt 1 \
+  --num_inference_steps 4 \
+  --output outputs/flux2_klein_4b.png
+```
+
+### Native GGUF (Transformer Only)
+```bash
+python examples/offline_inference/text_to_image/text_to_image.py \
+  --model /workspace/models/black-forest-labs/FLUX.2-klein-4B \
+  --gguf-model "/workspace/models/unsloth/FLUX.2-klein-4B-GGUF/flux-2-klein-4b-Q8_0.gguf" \
+  --quantization gguf \
+  --prompt "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture" \
+  --height 768 \
+  --width 1360 \
+  --seed 42 \
+  --cfg_scale 4.0 \
+  --num_images_per_prompt 1 \
+  --num_inference_steps 4 \
+  --output outputs/flux2_klein_4b_gguf.png
+```
+
+Notes for GGUF:
+1. Many GGUF repos do not ship `model_index.json` and configs. Use the base repo for `--model` and only pass the GGUF file via `--gguf-model`.
+2. `gguf_model` supports local path, `repo/file.gguf`, or `repo:quant_type`.
+
+## User Usage (Online)
+
+### Start Server (Native GGUF via CLI)
+```bash
+vllm serve /workspace/models/black-forest-labs/FLUX.2-klein-4B \
+  --omni \
+  --port 8000 \
+  --quantization-config '{"method":"gguf","gguf_model":"/workspace/models/unsloth/FLUX.2-klein-4B-GGUF/flux-2-klein-4b-Q8_0.gguf"}'
+```
+
+### Online Request (Images API)
+```bash
+curl -X POST http://localhost:8000/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "prompt": "a dragon laying over the spine of the Green Mountains of Vermont",
+    "size": "1024x1024",
+    "seed": 42,
+    "num_inference_steps": 4
+  }'
+```
diff --git a/docs/user_guide/diffusion/quantization/overview.md b/docs/user_guide/diffusion/quantization/overview.md
@@ -7,6 +7,7 @@ vLLM-Omni supports quantization of DiT linear layers to reduce memory usage and
 | Method | Guide |
 |--------|-------|
 | FP8 | [FP8](fp8.md) |
+| GGUF | [GGUF](gguf.md) |
 
 ## Device Compatibility
 

@@ -132,10 +132,18 @@ def parse_args() -> argparse.Namespace:
         "--quantization",
         type=str,
         default=None,
-        choices=["fp8"],
-        help="Quantization method for the transformer. "
-        "Options: 'fp8' (FP8 W8A8 on Ada/Hopper, weight-only on older GPUs). "
-        "Default: None (no quantization, uses BF16).",
+        choices=["fp8", "gguf"],
+        help=(
+            "Quantization method for the transformer. "
+            "Options: 'fp8' (FP8 W8A8), 'gguf' (GGUF quantized weights). "
+            "Default: None (no quantization, uses BF16)."
+        ),
+    )
+    parser.add_argument(
+        "--gguf-model",
+        type=str,
+        default=None,
+        help=("GGUF file path or HF reference for transformer weights. Required when --quantization gguf is set."),
     )
     parser.add_argument(
         "--ignored-layers",
@@ -265,7 +273,14 @@ def main():
     # ignored_layers is specified so the list flows through OmniDiffusionConfig
     quant_kwargs: dict[str, Any] = {}
     ignored_layers = [s.strip() for s in args.ignored_layers.split(",") if s.strip()] if args.ignored_layers else None
-    if args.quantization and ignored_layers:
+    if args.quantization == "gguf":
+        if not args.gguf_model:
+            raise ValueError("--gguf-model is required when --quantization gguf is set.")
+        quant_kwargs["quantization_config"] = {
+            "method": "gguf",
+            "gguf_model": args.gguf_model,
+        }
+    elif args.quantization and ignored_layers:
         quant_kwargs["quantization_config"] = {
             "method": args.quantization,
             "ignored_layers": ignored_layers,

diff --git a/tests/diffusion/test_diffusers_loader.py b/tests/diffusion/test_diffusers_loader.py
@@ -0,0 +1,73 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+import pytest
+import torch
+import torch.nn as nn
+
+from vllm_omni.diffusion.model_loader.diffusers_loader import DiffusersPipelineLoader
+
+pytestmark = [pytest.mark.core_model, pytest.mark.diffusion, pytest.mark.cpu]
+
+
+class _DummyPipelineModel(nn.Module):
+    def __init__(self, *, source_prefix: str):
+        super().__init__()
+        self.transformer = nn.Linear(2, 2, bias=False)
+        self.vae = nn.Linear(2, 2, bias=False)
+        self.weights_sources = [
+            DiffusersPipelineLoader.ComponentSource(
+                model_or_path="dummy",
+                subfolder="transformer",
+                revision=None,
+                prefix=source_prefix,
+                fall_back_to_pt=True,
+            )
+        ]
+
+    def load_weights(self, weights):
+        params = dict(self.named_parameters())
+        loaded: set[str] = set()
+        for name, tensor in weights:
+            if name not in params:
+                continue
+            params[name].data.copy_(tensor.to(dtype=params[name].dtype))
+            loaded.add(name)
+        return loaded
+
+
+def _make_loader_with_weights(weight_names: list[str]) -> DiffusersPipelineLoader:
+    loader = object.__new__(DiffusersPipelineLoader)
+    loader.counter_before_loading_weights = 0.0
+    loader.counter_after_loading_weights = 0.0
+
+    def _iter_weights(_model):
+        for name in weight_names:
+            yield name, torch.zeros((2, 2))
+
+    loader.get_all_weights = _iter_weights  # type: ignore[assignment]
+    return loader
+
+
+def test_strict_check_only_validates_source_prefix_parameters():
+    model = _DummyPipelineModel(source_prefix="transformer.")
+    loader = _make_loader_with_weights(["transformer.weight"])
+
+    # Should not require VAE parameters because they are outside weights_sources.
+    loader.load_weights(model)
+
+
+def test_strict_check_raises_when_source_parameters_are_missing():
+    model = _DummyPipelineModel(source_prefix="transformer.")
+    loader = _make_loader_with_weights([])
+
+    with pytest.raises(ValueError, match="transformer.weight"):
+        loader.load_weights(model)
+
+
+def test_empty_source_prefix_keeps_full_model_strict_check():
+    model = _DummyPipelineModel(source_prefix="")
+    loader = _make_loader_with_weights(["transformer.weight"])
+
+    with pytest.raises(ValueError, match="vae.weight"):
+        loader.load_weights(model)
@@ -6,7 +6,7 @@
 import random
 from collections.abc import Callable, Mapping
 from dataclasses import dataclass, field, fields
-from typing import Any
+from typing import TYPE_CHECKING, Any
 
 import torch
 from pydantic import model_validator
@@ -20,6 +20,12 @@
 )
 from vllm_omni.diffusion.utils.network_utils import is_port_available
 
+if TYPE_CHECKING:
+    from vllm_omni.diffusion.quantization import DiffusionQuantizationConfig
+
+# Import after TYPE_CHECKING to avoid circular imports at runtime
+# The actual import is deferred to __post_init__ to avoid import order issues
+
 logger = init_logger(__name__)
 
 
@@ -527,8 +533,12 @@ def __post_init__(self):
             # If it's neither dict nor DiffusionCacheConfig, convert to empty config
             self.cache_config = DiffusionCacheConfig()
 
-        # Convert quantization config
+        # Convert quantization config (deferred import to avoid circular imports)
         if self.quantization is not None or self.quantization_config is not None:
+            from vllm_omni.diffusion.quantization import (
+                DiffusionQuantizationConfig,
+            )
+
             # Handle dict or DictConfig (from OmegaConf) - use Mapping for broader compatibility
             if isinstance(self.quantization_config, Mapping):
                 # Convert DictConfig to dict if needed (OmegaConf compatibility)