-
Notifications
You must be signed in to change notification settings - Fork 1k
Int8 Quantization Support for DiT (Z-Image & Qwen-Image) #1470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
61dee65
[Feature] Add Int8 quantization support for Z-Image and Qwen-Image
yjb767868009 aa3598e
fix process_weights_after_loading
yjb767868009 e394819
fix DiffusionInt8Config init
yjb767868009 f4282df
fix Int8Config's function from_config undefine Int8Config
yjb767868009 b1c29a4
fix format
yjb767868009 2a95e10
fix format
yjb767868009 a3fcc33
fix format
yjb767868009 17aab52
add quant_config_cls and fix import torch_npu
yjb767868009 9ecc62f
fix format
yjb767868009 b8cb90d
Merge branch 'main' into int8-quant
yjb767868009 6173d12
fix invalid character
yjb767868009 a117002
Merge branch 'main' into int8-quant
yjb767868009 6ce717a
add int8 for GPU
yjb767868009 d5ac438
[CI] Add scripts for bechmark collection and email distribution. (#1307)
congw729 b6b5842
Merge branch 'int8-quant' of https://github.com/yjb767868009/vllm-omn…
yjb767868009 806285b
fix import
yjb767868009 23b4252
raise error in int8 unsupported platfrom
yjb767868009 03a7e47
fix npu int8 process_weights_after_loading unclear & complete test_in…
yjb767868009 d6eec8c
fix format
yjb767868009 a17f9bd
fix format
yjb767868009 302f0b6
add smoke test & lazy weight loading
yjb767868009 a380398
fix import torch_npu
yjb767868009 aed5647
fix pytest.mark.skipif
yjb767868009 624460f
fix format
yjb767868009 19dc858
fix format
yjb767868009 ee91bf6
Merge branch 'main' into int8-quant
yjb767868009 141f715
fix problem from path updates in the vllm operator
yjb767868009 871952b
fix format
yjb767868009 516ded1
Merge branch 'main' into int8-quant
david6666666 d571770
Merge branch 'main' into int8-quant
david6666666 084db20
Fix the issue of quantization parameter passing, and add z_image as t…
yjb767868009 9975e1e
Merge branch 'main' into int8-quant
yjb767868009 e1cfe8c
fix format
yjb767868009 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| # Int8 Quantization | ||
|
|
||
| ## Overview | ||
|
|
||
| Int8 quantization converts BF16/FP16 weights to Int8 at model load time. No calibration or pre-quantized checkpoint needed. | ||
|
|
||
| Depending on the model, either all layers can be quantized, or some sensitive layers should stay in BF16/FP16. See the [per-model table](#supported-models) for which case applies. | ||
|
|
||
| ## Configuration | ||
|
|
||
| 1. **Python API**: set `quantization="int8"`. To skip sensitive layers, use `quantization_config` with `ignored_layers`. | ||
|
|
||
| ```python | ||
| from vllm_omni import Omni | ||
| from vllm_omni.inputs.data import OmniDiffusionSamplingParams | ||
|
|
||
| # All layers quantized | ||
| omni = Omni(model="<your-model>", quantization="int8") | ||
|
|
||
| # Skip sensitive layers | ||
| omni = Omni( | ||
| model="<your-model>", | ||
| quantization_config={ | ||
| "method": "int8", | ||
| "ignored_layers": ["<layer-name>"], | ||
| }, | ||
| ) | ||
|
|
||
| outputs = omni.generate( | ||
| "A cat sitting on a windowsill", | ||
| OmniDiffusionSamplingParams(num_inference_steps=50), | ||
| ) | ||
| ``` | ||
|
|
||
| 2. **CLI**: pass `--quantization int8` and optionally `--ignored-layers`. | ||
|
|
||
| ```bash | ||
| # All layers | ||
| python text_to_image.py --model <your-model> --quantization int8 | ||
|
|
||
| # Skip sensitive layers | ||
| python text_to_image.py --model <your-model> --quantization int8 --ignored-layers "img_mlp" | ||
|
|
||
| # Online serving | ||
| vllm serve <your-model> --omni --quantization int8 | ||
| ``` | ||
|
|
||
| | Parameter | Type | Default | Description | | ||
| |-----------|------|---------|-------------| | ||
| | `method` | str | — | Quantization method (`"int8"`) | | ||
| | `ignored_layers` | list[str] | `[]` | Layer name patterns to keep in BF16/FP16 | | ||
| | `activation_scheme` | str | `"dynamic"` | `"dynamic"` (no calibration) | | ||
|
|
||
|
|
||
| The available `ignored_layers` names depend on the model architecture (e.g., `to_qkv`, `to_out`, `img_mlp`, `txt_mlp`). Consult the transformer source for your target model. | ||
|
|
||
| ## Supported Models | ||
|
|
||
| | Model | HF Models | Recommendation | `ignored_layers` | | ||
| |-------|-----------|---------------|------------------| | ||
| | Z-Image | `Tongyi-MAI/Z-Image-Turbo` | All layers | None | | ||
| | Qwen-Image | `Qwen/Qwen-Image`, `Qwen/Qwen-Image-2512` | All layers | None | | ||
|
|
||
| ## Combining with Other Features | ||
|
|
||
| Int8 quantization can be combined with cache acceleration: | ||
|
|
||
| ```python | ||
| omni = Omni( | ||
| model="<your-model>", | ||
| quantization="int8", | ||
| cache_backend="tea_cache", | ||
| cache_config={"rel_l1_thresh": 0.2}, | ||
| ) | ||
| ``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -131,12 +131,10 @@ def parse_args() -> argparse.Namespace: | |
| "--quantization", | ||
| type=str, | ||
| default=None, | ||
| choices=["fp8", "gguf"], | ||
| help=( | ||
| "Quantization method for the transformer. " | ||
| "Options: 'fp8' (FP8 W8A8), 'gguf' (GGUF quantized weights). " | ||
| "Default: None (no quantization, uses BF16)." | ||
| ), | ||
| choices=["fp8", "int8", "gguf"], | ||
| help="Quantization method for the transformer. " | ||
| "Options: 'fp8' (FP8 W8A8 on Ada/Hopper, weight-only on older GPUs), 'int8' (Int8 W8A8), 'gguf' (GGUF quantized weights). " | ||
| "Default: None (no quantization, uses BF16).", | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Missing space between |
||
| ) | ||
| parser.add_argument( | ||
| "--gguf-model", | ||
|
|
||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that int8 is added, should there be a matching 'Device Compatibility for Int8' section?