Skip to content

[AutoRound] Add offline quantized W4A16 model support#1777

Merged
lishunyang12 merged 18 commits intovllm-project:mainfrom
yiliu30:feats/ar-w4a16
Apr 2, 2026
Merged

[AutoRound] Add offline quantized W4A16 model support#1777
lishunyang12 merged 18 commits intovllm-project:mainfrom
yiliu30:feats/ar-w4a16

Conversation

@yiliu30
Copy link
Copy Markdown
Contributor

@yiliu30 yiliu30 commented Mar 10, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Load the offline quantized model using Autoround. part of #1325
Needs a rebase after #1764 lands -- DONE

python examples/offline_inference/text_to_image/text_to_image.py \
	--model  Yi30/FLUX.1-dev-AutoRound-w4a16  \
	--prompt "a cup of coffee on the table" \
	--seed 42 \
	--guidance-scale 3.5 \
	--num-images-per-prompt 1 \
	--num-inference-steps 50 \
	--height 1024 \
	--width 1024 \
	--enforce-eager \
	--output outputs/skier_autoround_w4a16.png
# prompts:
a cup of coffee on the table
a cat sitting on a windowsill at sunset
a mountain landscape with a lake reflecti
Metric BF16 Baseline w4a16
Model Mem (GiB) 35.66 20.69
Mem Reduction -- 42%
Mean LPIPS (ref) 0.1513

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

cc @hshen14 @thuang6 @lvliang-intel @mengniwang95 @xuechendi

@xuechendi
Copy link
Copy Markdown
Contributor

Since AdaLayerNormContinuous, AdaLayerNormZero, AdaLayerNormZeroSingle seems to be general change to support quant_config. How about move current impl to vllm_omni/model_executor/layers/layer_norm.py , so we can used in other modeling as well?

@Gaohan123 Gaohan123 added this to the v0.18.0 milestone Mar 18, 2026
@yiliu30
Copy link
Copy Markdown
Contributor Author

yiliu30 commented Mar 18, 2026

Since AdaLayerNormContinuous, AdaLayerNormZero, AdaLayerNormZeroSingle seems to be general change to support quant_config. How about move current impl to vllm_omni/model_executor/layers/layer_norm.py , so we can used in other modeling as well?

Good catch, updated!

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: [WIP][AutoRound] Add offline quantized `W4A16` model support

Gate Status: BLOCKED

Check Status
DCO ACTION_REQUIRED
pre-commit SUCCESS
build (3.11) SUCCESS
build (3.12) SUCCESS
docs SUCCESS

The DCO check is failing. Please sign off commits with `git commit -s` and force push.


Critical Findings

1. Missing Tests [HIGH PRIORITY]

The PR body has unchecked test checkboxes and provides no test scripts or results. Per the test coverage requirements, quantization changes need:

  • Memory savings measurement (quantized vs FP16)
  • Quality impact measurement
  • Regression tests for the new code path

Action: Add test scripts showing memory comparison and quality verification.

2. Docstring Error in `inc.py`

```python
"""GPTQ quantization config for diffusion transformers."""
```

This should say "INC/AutoRound quantization config" not "GPTQ".

3. Incomplete Code in `resolve_quantization()`

```python

self.quantization_config._vllm_config.maybe_update_config(f"{self.model}/transformer/")

```

This commented-out line suggests incomplete work. Either implement or remove.

4. Potential Silent Failure in Weight Loading

```python
if od_config.quantization is None:
raise ValueError(f"Following weights were not initialized from checkpoint: {weights_not_loaded}")
else:
logger.warning(...) # Only logs, doesn't fail
```

This could mask real weight loading issues in quantized models. Consider whether missing weights should always be an error, or validate against expected quantized weight names.


Summary

Validated Lacks Evidence Must Change
FLUX transformer prefix propagation Memory savings vs FP16 DCO sign-off
AdaLayerNorm quantization support Quality comparison output Test scripts & results
Quantization registry integration Benchmark data Docstring fix in `inc.py`

🤖 Generated with Claude Code

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left one comment on the quant resolution approach

Comment thread vllm_omni/diffusion/data.py Outdated
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few issues with the quant resolution and the weight loading change.

Comment thread vllm_omni/diffusion/data.py Outdated
Comment thread vllm_omni/diffusion/model_loader/diffusers_loader.py Outdated
Comment thread vllm_omni/diffusion/quantization/inc.py Outdated
Comment thread vllm_omni/entrypoints/omni_diffusion.py Outdated
@yiliu30 yiliu30 changed the title [WIP][AutoRound] Add offline quantized W4A16 model support [AutoRound] Add offline quantized W4A16 model support Mar 20, 2026
@yiliu30 yiliu30 marked this pull request as ready for review March 20, 2026 06:55
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be6f954ee2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/diffusion/model_loader/diffusers_loader.py Outdated
Comment thread vllm_omni/diffusion/model_loader/diffusers_loader.py Outdated
Comment thread vllm_omni/diffusion/quantization/inc.py Outdated
@hsliuustc0106 hsliuustc0106 self-requested a review March 24, 2026 14:09
@yiliu30 yiliu30 force-pushed the feats/ar-w4a16 branch 5 times, most recently from 3dade7e to b102fc4 Compare March 25, 2026 03:09
Comment thread docs/user_guide/diffusion/quantization/autoround.md
yiliu30 added 4 commits March 25, 2026 03:13
Register INC and AutoRound as quantization backends in the unified
quantization factory, enable auto-detection of quantization config
from transformer config.json, and validate unloaded weights during
model loading.

Key changes:
- Add _build_inc() to quantization factory with checkpoint kwarg
  normalization (bits→weight_bits) and INCConfig param filtering
- Extend TransformerConfig.from_dict to parse embedded
  quantization_config and build QuantizationConfig automatically
- Add OmniDiffusionConfig.set_tf_model_config() for late-stage
  quantization propagation when config is loaded after construction
- Add weight validation in diffusers_loader to distinguish expected
  quantized weight suffixes from truly missing weights

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Replace diffusers AdaLayerNorm with quantization-aware variants that
accept quant_config and prefix parameters, enabling GPTQ-Marlin
kernels on AdaLayerNorm linear layers. Propagate prefix strings
through FLUX transformer blocks for correct weight name resolution.

Key changes:
- Add AdaLayerNormZero, AdaLayerNormZeroSingle, and
  AdaLayerNormContinuous using ReplicatedLinear
- Pass quant_config and prefix to all AdaLayerNorm constructors
  in FluxTransformerBlock and FluxSingleTransformerBlock
- Pass prefix to ReplicatedLinear in FluxSingleTransformerBlock

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Add unit and E2E tests covering the new INC/AutoRound quantization
support, quantizable AdaLayerNorm layers, and FLUX transformer
prefix propagation.

Key changes:
- test_inc_config: INC factory, checkpoint kwarg normalization,
  bits→weight_bits mapping, invalid param filtering
- test_adalayernorm: shape, gradient, quantization-config
  acceptance for AdaLayerNormZero/Single/Continuous
- test_flux_prefix_propagation: verify prefix strings propagate
  correctly through FLUX transformer blocks
- test_flux_autoround_w4a16: E2E offline inference with
  AutoRound W4A16 quantized FLUX model

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Add user-facing documentation for AutoRound diffusion quantization
covering W4A16 as the first supported scheme, with a roadmap for
additional schemes. Update the quantization overview table.

Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30
Copy link
Copy Markdown
Contributor Author

yiliu30 commented Mar 30, 2026

Hi @david6666666, could you take a look and help trigger the CI, thanks!

@david6666666 david6666666 added the ready label to trigger buildkite CI label Mar 30, 2026
@david6666666
Copy link
Copy Markdown
Collaborator

@yiliu30 please fix CI, otherwise LGTM

yiliu30 added 4 commits March 30, 2026 05:26
Fix AttributeError in _check_unloaded_weights
when od_config is absent, and fix ROCm GEMM
dispatch crash for CPU-tensor adalayernorm tests.

Refs: vllm-project#1777
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30
Copy link
Copy Markdown
Contributor Author

yiliu30 commented Mar 30, 2026

Hi @david6666666 the mi325_2: Omni Model Test Qwen3-Omni was failed, could you help check whether it’s related to this PR or just a CI issue?

@david6666666
Copy link
Copy Markdown
Collaborator

Hi @david6666666 the mi325_2: Omni Model Test Qwen3-Omni was failed, could you help check whether it’s related to this PR or just a CI issue?

not related to this PR, just wait buildkite/vllm-omni passed

@david6666666
Copy link
Copy Markdown
Collaborator

david6666666 commented Mar 31, 2026

@hsliuustc0106 ptal thx

@lishunyang12 lishunyang12 enabled auto-merge (squash) April 2, 2026 05:57
@lishunyang12 lishunyang12 merged commit c1d2dcc into vllm-project:main Apr 2, 2026
8 checks passed
linyueqian pushed a commit to JuanPZuluaga/vllm-omni that referenced this pull request Apr 3, 2026
…#1777)

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
…#1777)

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 11, 2026
…ript + standard loader

Signed-off-by: lishunyang <lishunyang12@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants