-
Notifications
You must be signed in to change notification settings - Fork 843
[Feature]: Native GGUF Quantization Support for DiT #1285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
73 commits
Select commit
Hold shift + click to select a range
df1a0bc
cherry pick 1034
david6666666 b880190
support gguf fp8 1
david6666666 2a208d8
support gguf fp8 2
david6666666 d81c2a9
support gguf fp8 3
david6666666 cdd0dfb
support gguf fp8 4
david6666666 887d1c0
support gguf fp8 5
david6666666 4d77a92
support gguf fp8 add design doc
david6666666 2b35f3b
support gguf fp8 add design doc 2
david6666666 f5c6900
patch
Isotr0py 6fc0f8c
support gguf fp8 6
david6666666 769a5fb
support gguf fp8 7
david6666666 ce19e1b
support gguf fp8 8
david6666666 f535ca1
support gguf fp8 9
david6666666 11ca22f
support gguf fp8 10
david6666666 b2916b1
support gguf fp8 11
david6666666 929f1f7
support gguf fp8 12
david6666666 d958047
Merge remote-tracking branch 'origin/gguf' into gguf_fp8
Isotr0py a4cefac
support gguf fp8 add design doc 3
david6666666 c1af5f1
Merge branch 'main' into gguf_fp8
david6666666 f39760b
support gguf fp8 add design doc 4
david6666666 f63d509
support gguf fp8 add design doc 5
david6666666 ceb8c11
support gguf fp8 add qwen-image
david6666666 c5b1e75
support gguf fp8 add z-image
david6666666 599bafb
support gguf only
david6666666 9f84387
support gguf 1
david6666666 68e5345
support gguf fp8 add qwen-image
david6666666 3795dc5
support gguf fp8 add qwen-image 2
david6666666 2d7a409
support gguf fp8 add note
david6666666 580a18e
support gguf fp8 fix 1
david6666666 bcc91f6
support gguf fp8 fix z-image
david6666666 5dcf32d
support gguf fp8 fix z-image 2
david6666666 172dcf2
support gguf fp8 fix z-image 3
david6666666 ae08249
fix pre-commit
david6666666 03217e0
simple doc
david6666666 d3ab484
fix pre-commit
david6666666 43dc33a
fix pre-commit
david6666666 fb43b4f
fix comment 1
david6666666 7c7bf79
Merge branch 'main' into gguf_fp8
david6666666 45fac53
fix pre-commit
david6666666 e5a70d4
fix comment2
david6666666 5906200
fix bug
david6666666 0e53b23
Merge branch 'gguf_fp8' of https://github.com/david6666666/vllm-omni …
david6666666 20307f6
fix comment 2
david6666666 b35870e
fix pre-commit
david6666666 c9fceb9
add doc
david6666666 d1550cf
draft
Isotr0py 9ab7810
Merge branch 'vllm-project:main' into gguf-fp8-draft1
Isotr0py 02f9c1b
update
Isotr0py b1091fb
update
Isotr0py a4e0ffe
clean
Isotr0py 1d9f36f
Merge remote-tracking branch 'upstream/main' into gguf_fp8
Isotr0py a4c6336
Merge branch 'gguf-fp8-flux2-cleanup' into gguf_fp8
Isotr0py bd2cce6
draft
Isotr0py 8f7edd6
fix
Isotr0py 16b2dd8
fix pre-commit
david6666666 de8c3a4
Merge branch 'main' into gguf_fp8
david6666666 b0808f8
fix pre-commit
david6666666 f684cd6
fix pre-commit
david6666666 42a2905
fix comment 1
david6666666 b4c7f6d
remove qwen-image
david6666666 9c0a8f0
remove qwen-image
david6666666 3f08d84
remove qwen-image
david6666666 5205599
fix flux2
david6666666 740aaab
fix ci
david6666666 11de53e
fix comment 1
david6666666 3b281e8
fix comment 2
david6666666 69a010d
Merge branch 'main' into gguf_fp8
david6666666 ea69176
fix ci
david6666666 e996f82
fix ci 2
david6666666 0b6338f
Merge branch 'main' into gguf_fp8
hsliuustc0106 fe9457c
fix pre-commit
david6666666 ad543fe
fix wrong merge
david6666666 ecb442c
Merge branch 'main' into gguf_fp8
david6666666 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,185 @@ | ||
| # GGUF Quantization | ||
|
|
||
| ## Goals | ||
| 1. Reuse vLLM quantization configs and weight loaders as much as possible. | ||
| 2. Add native GGUF support to diffusion transformers without changing model definitions. | ||
| 3. Keep user-facing knobs minimal and consistent across offline and online flows. | ||
|
|
||
| ## Scope | ||
| 1. Models: Z-Image, and Flux2-klein. | ||
| 2. Components: diffusion transformer weights, loader paths, and quantization configs. | ||
| 3. Modes: native GGUF (transformer-only weights). | ||
|
|
||
| ## Architecture Overview | ||
| 1. `OmniDiffusionConfig` accepts `quantization` or `quantization_config`. | ||
| 2. Diffusion quantization wrapper (`DiffusionGgufConfig`) produces vLLM `QuantizationConfig` objects for linear layers. | ||
| 3. `DiffusersPipelineLoader` branches on quantization method and loads either HF weights or GGUF weights for the transformer. | ||
| 4. GGUF transformer loading is routed through model-specific adapters (e.g., Flux2Klein). | ||
| 5. vLLM GGUF path uses `GGUFConfig` and `GGUFLinearMethod` for matmul. | ||
|
|
||
| ## Call Chain (Offline) | ||
| ``` | ||
| CLI (examples/offline_inference/text_to_image/text_to_image.py) | ||
| | | ||
| v | ||
| Omni (vllm_omni/entrypoints/omni.py) | ||
| | | ||
| v | ||
| OmniStage (diffusion) | ||
| | | ||
| v | ||
| DiffusionWorker | ||
| | | ||
| v | ||
| DiffusionModelRunner | ||
| | | ||
| v | ||
| DiffusersPipelineLoader | ||
| | | ||
| v | ||
| Pipeline.forward (Flux2/Qwen/Z-Image) | ||
| | | ||
| v | ||
| DiffusionEngine | ||
| | | ||
| v | ||
| OmniRequestOutput | ||
| | | ||
| v | ||
| Client (saved PNG) | ||
| ``` | ||
|
|
||
| ## Call Chain (Online) | ||
| ``` | ||
| Client | ||
| | | ||
| | POST /v1/images/generations | ||
| v | ||
| APIServer (vllm_omni/entrypoints/openai/api_server.py) | ||
| | | ||
| v | ||
| _generate_with_async_omni | ||
| | | ||
| v | ||
| AsyncOmni | ||
| | | ||
| v | ||
| DiffusionEngine | ||
| | | ||
| v | ||
| OmniRequestOutput | ||
| | | ||
| v | ||
| encode_image_base64 | ||
| | | ||
| v | ||
| ImageGenerationResponse | ||
| | | ||
| v | ||
| Client | ||
| ``` | ||
|
|
||
| ## Call Chain (GGUF Operator Path) | ||
| ``` | ||
| Pipeline.forward (Flux2/Qwen/Z-Image) | ||
| | | ||
| v | ||
| Transformer blocks | ||
| | | ||
| v | ||
| QKVParallelLinear / ColumnParallelLinear / RowParallelLinear | ||
| | | ||
| v | ||
| LinearBase.forward | ||
| | | ||
| v | ||
| QuantMethod.apply (GGUFLinearMethod.apply) | ||
| | | ||
| v | ||
| fused_mul_mat_gguf | ||
| | | ||
| v | ||
| _fused_mul_mat_gguf (custom op) | ||
| | | ||
| v | ||
| ops.ggml_dequantize | ||
| | | ||
| v | ||
| x @ weight.T | ||
| ``` | ||
|
|
||
| ## GGUF Weight Loading Path (Transformer-Only) | ||
| 1. `DiffusersPipelineLoader.load_model` detects `quantization_config.method == "gguf"`. | ||
| 2. `gguf_model` is resolved as one of: local file, `repo/file.gguf`, or `repo:quant_type`. | ||
| 3. GGUF weights are routed through adapters in `vllm_omni/diffusion/model_loader/gguf_adapters/`. | ||
| 4. Name mapping is applied per-architecture (Z-Image, Flux2Klein). | ||
| 5. GGUF weights are loaded into transformer modules, remaining non-transformer weights come from the HF checkpoint. | ||
|
|
||
| ## GGUF Adapter Design | ||
| 1. `GGUFAdapter` is an abstract base class for model-specific adapters. | ||
| 2. `Flux2KleinGGUFAdapter` implements Flux2-Klein remapping + qkv split + adaLN swap. | ||
| 3. `ZImageGGUFAdapter` implements Z-Image qkv + ffn shard handling and linear qweight routing. | ||
| 4. `get_gguf_adapter(...)` strictly selects by model class/config; unsupported models raise an error (no fallback adapter). | ||
|
|
||
| Adapter paths: | ||
| - Base: `vllm_omni/diffusion/model_loader/gguf_adapters/base.py` | ||
| - Z-Image: `vllm_omni/diffusion/model_loader/gguf_adapters/z_image.py` | ||
| - Flux2-Klein: `vllm_omni/diffusion/model_loader/gguf_adapters/flux2_klein.py` | ||
|
|
||
| ## User Usage (Offline) | ||
|
|
||
| ### Baseline BF16 | ||
| ```bash | ||
| python examples/offline_inference/text_to_image/text_to_image.py \ | ||
| --model /workspace/models/black-forest-labs/FLUX.2-klein-4B \ | ||
| --prompt "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture" \ | ||
| --height 768 \ | ||
| --width 1360 \ | ||
| --seed 42 \ | ||
| --cfg_scale 4.0 \ | ||
| --num_images_per_prompt 1 \ | ||
| --num_inference_steps 4 \ | ||
| --output outputs/flux2_klein_4b.png | ||
| ``` | ||
|
|
||
| ### Native GGUF (Transformer Only) | ||
| ```bash | ||
| python examples/offline_inference/text_to_image/text_to_image.py \ | ||
| --model /workspace/models/black-forest-labs/FLUX.2-klein-4B \ | ||
| --gguf-model "/workspace/models/unsloth/FLUX.2-klein-4B-GGUF/flux-2-klein-4b-Q8_0.gguf" \ | ||
| --quantization gguf \ | ||
| --prompt "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture" \ | ||
| --height 768 \ | ||
| --width 1360 \ | ||
| --seed 42 \ | ||
| --cfg_scale 4.0 \ | ||
| --num_images_per_prompt 1 \ | ||
| --num_inference_steps 4 \ | ||
| --output outputs/flux2_klein_4b_gguf.png | ||
| ``` | ||
|
|
||
| Notes for GGUF: | ||
| 1. Many GGUF repos do not ship `model_index.json` and configs. Use the base repo for `--model` and only pass the GGUF file via `--gguf-model`. | ||
| 2. `gguf_model` supports local path, `repo/file.gguf`, or `repo:quant_type`. | ||
|
|
||
| ## User Usage (Online) | ||
|
|
||
| ### Start Server (Native GGUF via CLI) | ||
| ```bash | ||
| vllm serve /workspace/models/black-forest-labs/FLUX.2-klein-4B \ | ||
| --omni \ | ||
| --port 8000 \ | ||
| --quantization-config '{"method":"gguf","gguf_model":"/workspace/models/unsloth/FLUX.2-klein-4B-GGUF/flux-2-klein-4b-Q8_0.gguf"}' | ||
| ``` | ||
|
|
||
| ### Online Request (Images API) | ||
| ```bash | ||
| curl -X POST http://localhost:8000/v1/images/generations \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "prompt": "a dragon laying over the spine of the Green Mountains of Vermont", | ||
| "size": "1024x1024", | ||
| "seed": 42, | ||
| "num_inference_steps": 4 | ||
| }' | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
|
|
||
| import pytest | ||
| import torch | ||
| import torch.nn as nn | ||
|
|
||
| from vllm_omni.diffusion.model_loader.diffusers_loader import DiffusersPipelineLoader | ||
|
|
||
| pytestmark = [pytest.mark.core_model, pytest.mark.diffusion, pytest.mark.cpu] | ||
|
|
||
|
|
||
| class _DummyPipelineModel(nn.Module): | ||
| def __init__(self, *, source_prefix: str): | ||
| super().__init__() | ||
| self.transformer = nn.Linear(2, 2, bias=False) | ||
| self.vae = nn.Linear(2, 2, bias=False) | ||
| self.weights_sources = [ | ||
| DiffusersPipelineLoader.ComponentSource( | ||
| model_or_path="dummy", | ||
| subfolder="transformer", | ||
| revision=None, | ||
| prefix=source_prefix, | ||
| fall_back_to_pt=True, | ||
| ) | ||
| ] | ||
|
|
||
| def load_weights(self, weights): | ||
| params = dict(self.named_parameters()) | ||
| loaded: set[str] = set() | ||
| for name, tensor in weights: | ||
| if name not in params: | ||
| continue | ||
| params[name].data.copy_(tensor.to(dtype=params[name].dtype)) | ||
| loaded.add(name) | ||
| return loaded | ||
|
|
||
|
|
||
| def _make_loader_with_weights(weight_names: list[str]) -> DiffusersPipelineLoader: | ||
| loader = object.__new__(DiffusersPipelineLoader) | ||
| loader.counter_before_loading_weights = 0.0 | ||
| loader.counter_after_loading_weights = 0.0 | ||
|
|
||
| def _iter_weights(_model): | ||
| for name in weight_names: | ||
| yield name, torch.zeros((2, 2)) | ||
|
|
||
| loader.get_all_weights = _iter_weights # type: ignore[assignment] | ||
| return loader | ||
|
|
||
|
|
||
| def test_strict_check_only_validates_source_prefix_parameters(): | ||
| model = _DummyPipelineModel(source_prefix="transformer.") | ||
| loader = _make_loader_with_weights(["transformer.weight"]) | ||
|
|
||
| # Should not require VAE parameters because they are outside weights_sources. | ||
| loader.load_weights(model) | ||
|
|
||
|
|
||
| def test_strict_check_raises_when_source_parameters_are_missing(): | ||
| model = _DummyPipelineModel(source_prefix="transformer.") | ||
| loader = _make_loader_with_weights([]) | ||
|
|
||
| with pytest.raises(ValueError, match="transformer.weight"): | ||
| loader.load_weights(model) | ||
|
|
||
|
|
||
| def test_empty_source_prefix_keeps_full_model_strict_check(): | ||
| model = _DummyPipelineModel(source_prefix="") | ||
| loader = _make_loader_with_weights(["transformer.weight"]) | ||
|
|
||
| with pytest.raises(ValueError, match="vae.weight"): | ||
| loader.load_weights(model) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.