-
Notifications
You must be signed in to change notification settings - Fork 836
[Feature]: FP8 Quantization Support for DiT #1034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
7ed947f
15d8814
6eb0a5f
f0f28ee
d89abe3
84f38e0
84e92be
8003953
ea31d88
6623c8c
eb47067
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # FP8 Quantization | ||
|
|
||
| ## Overview | ||
|
|
||
| FP8 quantization converts BF16/FP16 weights to FP8 at model load time. No calibration or pre-quantized checkpoint needed. | ||
|
|
||
| Depending on the model, either all layers can be quantized, or some sensitive layers should stay in BF16. See the [per-model table](#supported-models) for which case applies. | ||
|
|
||
| Common sensitive layers in DiT-based diffusion models include **image-stream MLPs** (`img_mlp`). These are particularly vulnerable to FP8 precision loss because they process denoising latents whose dynamic range shifts significantly across timesteps, and unlike attention projections (which benefit from QK-Norm stabilization), MLPs have no built-in normalization to absorb quantization error. In deep architectures (e.g., 60+ residual blocks), small per-layer errors compound and degrade output quality. Other layers such as **attention projections** (`to_qkv`, `to_out`) and **text-stream MLPs** (`txt_mlp`) are generally more robust due to normalization or more stable input statistics. | ||
|
|
||
| ## Configuration | ||
|
|
||
| 1. **Python API**: set `quantization="fp8"`. To skip sensitive layers, use `quantization_config` with `ignored_layers`. | ||
|
|
||
| ```python | ||
| from vllm_omni import Omni | ||
| from vllm_omni.inputs.data import OmniDiffusionSamplingParams | ||
|
|
||
| # All layers quantized | ||
| omni = Omni(model="<your-model>", quantization="fp8") | ||
|
|
||
| # Skip sensitive layers | ||
| omni = Omni( | ||
| model="<your-model>", | ||
| quantization_config={ | ||
| "method": "fp8", | ||
| "ignored_layers": ["<layer-name>"], | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's great that we provide flexibility here. But it means we have to maintain detailed examples for models
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed — the per-model table in the doc serves as the single source of truth. As we add models, we'll update that table with the recommended ignored_layers |
||
| }, | ||
| ) | ||
|
|
||
| outputs = omni.generate( | ||
| "A cat sitting on a windowsill", | ||
| OmniDiffusionSamplingParams(num_inference_steps=50), | ||
| ) | ||
| ``` | ||
|
|
||
| 2. **CLI**: pass `--quantization fp8` and optionally `--ignored-layers`. | ||
|
|
||
| ```bash | ||
| # All layers | ||
| python text_to_image.py --model <your-model> --quantization fp8 | ||
|
|
||
| # Skip sensitive layers | ||
| python text_to_image.py --model <your-model> --quantization fp8 --ignored-layers "img_mlp" | ||
|
|
||
| # Online serving | ||
| vllm serve <your-model> --omni --quantization fp8 | ||
| ``` | ||
|
|
||
| | Parameter | Type | Default | Description | | ||
| |-----------|------|---------|-------------| | ||
| | `method` | str | — | Quantization method (`"fp8"`) | | ||
| | `ignored_layers` | list[str] | `[]` | Layer name patterns to keep in BF16 | | ||
| | `activation_scheme` | str | `"dynamic"` | `"dynamic"` (no calibration) or `"static"` | | ||
| | `weight_block_size` | list[int] \| None | `None` | Block size for block-wise weight quantization | | ||
|
|
||
| The available `ignored_layers` names depend on the model architecture (e.g., `to_qkv`, `to_out`, `img_mlp`, `txt_mlp`). Consult the transformer source for your target model. | ||
|
|
||
| ## Supported Models | ||
|
|
||
| | Model | HF Models | Recommendation | `ignored_layers` | | ||
| |-------|-----------|---------------|------------------| | ||
| | Z-Image | `Tongyi-MAI/Z-Image-Turbo` | All layers | None | | ||
| | Qwen-Image | `Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512` | Skip sensitive layers | `img_mlp` | | ||
|
|
||
| ## Combining with Other Features | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'd better add an dev doc for quantization support. It can be done in the following pr
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, will add a developer guide for adding quantization support to new models in a follow-up PR. |
||
| FP8 quantization can be combined with cache acceleration: | ||
|
|
||
| ```python | ||
| omni = Omni( | ||
| model="<your-model>", | ||
| quantization="fp8", | ||
| cache_backend="tea_cache", | ||
| cache_config={"rel_l1_thresh": 0.2}, | ||
| ) | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # Quantization for Diffusion Transformers | ||
|
|
||
| vLLM-Omni supports quantization of DiT linear layers to reduce memory usage and accelerate inference. | ||
|
|
||
| ## Supported Methods | ||
|
|
||
| | Method | Guide | | ||
| |--------|-------| | ||
| | FP8 | [FP8](fp8.md) | | ||
|
|
||
| ## Device Compatibility | ||
|
|
||
| | GPU Generation | Example GPUs | FP8 Mode | | ||
| |---------------|-------------------|----------| | ||
| | Ada/Hopper (SM 89+) | RTX 4090, H100, H200 | Full W8A8 with native hardware | | ||
|
|
||
| Kernel selection is automatic. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to further explain what are the common sensitive layers? Like Norm ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Sensitive layers examples have been added.