[AutoRound] Add offline quantized W4A16 model support#1777
[AutoRound] Add offline quantized W4A16 model support#1777lishunyang12 merged 18 commits intovllm-project:mainfrom
W4A16 model support#1777Conversation
|
Since |
Good catch, updated! |
There was a problem hiding this comment.
PR Review: [WIP][AutoRound] Add offline quantized `W4A16` model support
Gate Status: BLOCKED
| Check | Status |
|---|---|
| DCO | ACTION_REQUIRED |
| pre-commit | SUCCESS |
| build (3.11) | SUCCESS |
| build (3.12) | SUCCESS |
| docs | SUCCESS |
The DCO check is failing. Please sign off commits with `git commit -s` and force push.
Critical Findings
1. Missing Tests [HIGH PRIORITY]
The PR body has unchecked test checkboxes and provides no test scripts or results. Per the test coverage requirements, quantization changes need:
- Memory savings measurement (quantized vs FP16)
- Quality impact measurement
- Regression tests for the new code path
Action: Add test scripts showing memory comparison and quality verification.
2. Docstring Error in `inc.py`
```python
"""GPTQ quantization config for diffusion transformers."""
```
This should say "INC/AutoRound quantization config" not "GPTQ".
3. Incomplete Code in `resolve_quantization()`
```python
self.quantization_config._vllm_config.maybe_update_config(f"{self.model}/transformer/")
```
This commented-out line suggests incomplete work. Either implement or remove.
4. Potential Silent Failure in Weight Loading
```python
if od_config.quantization is None:
raise ValueError(f"Following weights were not initialized from checkpoint: {weights_not_loaded}")
else:
logger.warning(...) # Only logs, doesn't fail
```
This could mask real weight loading issues in quantized models. Consider whether missing weights should always be an error, or validate against expected quantized weight names.
Summary
| Validated | Lacks Evidence | Must Change |
|---|---|---|
| FLUX transformer prefix propagation | Memory savings vs FP16 | DCO sign-off |
| AdaLayerNorm quantization support | Quality comparison output | Test scripts & results |
| Quantization registry integration | Benchmark data | Docstring fix in `inc.py` |
🤖 Generated with Claude Code
lishunyang12
left a comment
There was a problem hiding this comment.
left one comment on the quant resolution approach
lishunyang12
left a comment
There was a problem hiding this comment.
A few issues with the quant resolution and the weight loading change.
W4A16 model supportW4A16 model support
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be6f954ee2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
3dade7e to
b102fc4
Compare
Register INC and AutoRound as quantization backends in the unified quantization factory, enable auto-detection of quantization config from transformer config.json, and validate unloaded weights during model loading. Key changes: - Add _build_inc() to quantization factory with checkpoint kwarg normalization (bits→weight_bits) and INCConfig param filtering - Extend TransformerConfig.from_dict to parse embedded quantization_config and build QuantizationConfig automatically - Add OmniDiffusionConfig.set_tf_model_config() for late-stage quantization propagation when config is loaded after construction - Add weight validation in diffusers_loader to distinguish expected quantized weight suffixes from truly missing weights Signed-off-by: yiliu30 <yi4.liu@intel.com>
Replace diffusers AdaLayerNorm with quantization-aware variants that accept quant_config and prefix parameters, enabling GPTQ-Marlin kernels on AdaLayerNorm linear layers. Propagate prefix strings through FLUX transformer blocks for correct weight name resolution. Key changes: - Add AdaLayerNormZero, AdaLayerNormZeroSingle, and AdaLayerNormContinuous using ReplicatedLinear - Pass quant_config and prefix to all AdaLayerNorm constructors in FluxTransformerBlock and FluxSingleTransformerBlock - Pass prefix to ReplicatedLinear in FluxSingleTransformerBlock Signed-off-by: yiliu30 <yi4.liu@intel.com>
Add unit and E2E tests covering the new INC/AutoRound quantization support, quantizable AdaLayerNorm layers, and FLUX transformer prefix propagation. Key changes: - test_inc_config: INC factory, checkpoint kwarg normalization, bits→weight_bits mapping, invalid param filtering - test_adalayernorm: shape, gradient, quantization-config acceptance for AdaLayerNormZero/Single/Continuous - test_flux_prefix_propagation: verify prefix strings propagate correctly through FLUX transformer blocks - test_flux_autoround_w4a16: E2E offline inference with AutoRound W4A16 quantized FLUX model Signed-off-by: yiliu30 <yi4.liu@intel.com>
Add user-facing documentation for AutoRound diffusion quantization covering W4A16 as the first supported scheme, with a roadmap for additional schemes. Update the quantization overview table. Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
Hi @david6666666, could you take a look and help trigger the CI, thanks! |
|
@yiliu30 please fix CI, otherwise LGTM |
Fix AttributeError in _check_unloaded_weights when od_config is absent, and fix ROCm GEMM dispatch crash for CPU-tensor adalayernorm tests. Refs: vllm-project#1777 Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
Hi @david6666666 the |
not related to this PR, just wait buildkite/vllm-omni passed |
|
@hsliuustc0106 ptal thx |
…#1777) Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
…#1777) Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
…ript + standard loader Signed-off-by: lishunyang <lishunyang12@163.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Load the offline quantized model using Autoround. part of #1325
Needs a rebase after #1764 lands -- DONE
Doc: https://github.com/yiliu30/vllm-omni-fork/blob/feats/ar-w4a16/docs/user_guide/diffusion/quantization/autoround.md
CMD and Model
https://huggingface.co/Yi30/FLUX.1-dev-AutoRound-w4a16
python examples/offline_inference/text_to_image/text_to_image.py \ --model Yi30/FLUX.1-dev-AutoRound-w4a16 \ --prompt "a cup of coffee on the table" \ --seed 42 \ --guidance-scale 3.5 \ --num-images-per-prompt 1 \ --num-inference-steps 50 \ --height 1024 \ --width 1024 \ --enforce-eager \ --output outputs/skier_autoround_w4a16.png# prompts: a cup of coffee on the table a cat sitting on a windowsill at sunset a mountain landscape with a lake reflectiTest Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)
cc @hshen14 @thuang6 @lvliang-intel @mengniwang95 @xuechendi