[AutoRound] Add offline quantized `W4A16` model support by yiliu30 · Pull Request #1777 · vllm-project/vllm-omni

yiliu30 · 2026-03-10T08:42:37Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Load the offline quantized model using Autoround. part of #1325
Needs a rebase after #1764 lands -- DONE

python examples/offline_inference/text_to_image/text_to_image.py \
	--model  Yi30/FLUX.1-dev-AutoRound-w4a16  \
	--prompt "a cup of coffee on the table" \
	--seed 42 \
	--guidance-scale 3.5 \
	--num-images-per-prompt 1 \
	--num-inference-steps 50 \
	--height 1024 \
	--width 1024 \
	--enforce-eager \
	--output outputs/skier_autoround_w4a16.png

# prompts:
a cup of coffee on the table
a cat sitting on a windowsill at sunset
a mountain landscape with a lake reflecti

Metric	BF16 Baseline	w4a16
Model Mem (GiB)	35.66	20.69
Mem Reduction	--	42%
Mean LPIPS	(ref)	0.1513

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

cc @hshen14 @thuang6 @lvliang-intel @mengniwang95 @xuechendi

xuechendi · 2026-03-17T14:23:06Z

Since AdaLayerNormContinuous, AdaLayerNormZero, AdaLayerNormZeroSingle seems to be general change to support quant_config. How about move current impl to vllm_omni/model_executor/layers/layer_norm.py , so we can used in other modeling as well?

yiliu30 · 2026-03-18T05:04:36Z

Since AdaLayerNormContinuous, AdaLayerNormZero, AdaLayerNormZeroSingle seems to be general change to support quant_config. How about move current impl to vllm_omni/model_executor/layers/layer_norm.py , so we can used in other modeling as well?

Good catch, updated!

hsliuustc0106

PR Review: [WIP][AutoRound] Add offline quantized `W4A16` model support

Gate Status: BLOCKED

Check	Status
DCO	ACTION_REQUIRED
pre-commit	SUCCESS
build (3.11)	SUCCESS
build (3.12)	SUCCESS
docs	SUCCESS

The DCO check is failing. Please sign off commits with `git commit -s` and force push.

Critical Findings

1. Missing Tests [HIGH PRIORITY]

The PR body has unchecked test checkboxes and provides no test scripts or results. Per the test coverage requirements, quantization changes need:

Memory savings measurement (quantized vs FP16)
Quality impact measurement
Regression tests for the new code path

Action: Add test scripts showing memory comparison and quality verification.

2. Docstring Error in `inc.py`

```python
"""GPTQ quantization config for diffusion transformers."""
```

This should say "INC/AutoRound quantization config" not "GPTQ".

3. Incomplete Code in `resolve_quantization()`

```python

self.quantization_config._vllm_config.maybe_update_config(f"{self.model}/transformer/")

```

This commented-out line suggests incomplete work. Either implement or remove.

4. Potential Silent Failure in Weight Loading

```python
if od_config.quantization is None:
raise ValueError(f"Following weights were not initialized from checkpoint: {weights_not_loaded}")
else:
logger.warning(...) # Only logs, doesn't fail
```

This could mask real weight loading issues in quantized models. Consider whether missing weights should always be an error, or validate against expected quantized weight names.

Summary

Validated	Lacks Evidence	Must Change
FLUX transformer prefix propagation	Memory savings vs FP16	DCO sign-off
AdaLayerNorm quantization support	Quality comparison output	Test scripts & results
Quantization registry integration	Benchmark data	Docstring fix in `inc.py`

🤖 Generated with Claude Code

lishunyang12

left one comment on the quant resolution approach

lishunyang12

A few issues with the quant resolution and the weight loading change.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: be6f954ee2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Register INC and AutoRound as quantization backends in the unified quantization factory, enable auto-detection of quantization config from transformer config.json, and validate unloaded weights during model loading. Key changes: - Add _build_inc() to quantization factory with checkpoint kwarg normalization (bits→weight_bits) and INCConfig param filtering - Extend TransformerConfig.from_dict to parse embedded quantization_config and build QuantizationConfig automatically - Add OmniDiffusionConfig.set_tf_model_config() for late-stage quantization propagation when config is loaded after construction - Add weight validation in diffusers_loader to distinguish expected quantized weight suffixes from truly missing weights Signed-off-by: yiliu30 <yi4.liu@intel.com>

Replace diffusers AdaLayerNorm with quantization-aware variants that accept quant_config and prefix parameters, enabling GPTQ-Marlin kernels on AdaLayerNorm linear layers. Propagate prefix strings through FLUX transformer blocks for correct weight name resolution. Key changes: - Add AdaLayerNormZero, AdaLayerNormZeroSingle, and AdaLayerNormContinuous using ReplicatedLinear - Pass quant_config and prefix to all AdaLayerNorm constructors in FluxTransformerBlock and FluxSingleTransformerBlock - Pass prefix to ReplicatedLinear in FluxSingleTransformerBlock Signed-off-by: yiliu30 <yi4.liu@intel.com>

Add unit and E2E tests covering the new INC/AutoRound quantization support, quantizable AdaLayerNorm layers, and FLUX transformer prefix propagation. Key changes: - test_inc_config: INC factory, checkpoint kwarg normalization, bits→weight_bits mapping, invalid param filtering - test_adalayernorm: shape, gradient, quantization-config acceptance for AdaLayerNormZero/Single/Continuous - test_flux_prefix_propagation: verify prefix strings propagate correctly through FLUX transformer blocks - test_flux_autoround_w4a16: E2E offline inference with AutoRound W4A16 quantized FLUX model Signed-off-by: yiliu30 <yi4.liu@intel.com>

Add user-facing documentation for AutoRound diffusion quantization covering W4A16 as the first supported scheme, with a roadmap for additional schemes. Update the quantization overview table. Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2026-03-30T00:42:36Z

Hi @david6666666, could you take a look and help trigger the CI, thanks!

david6666666 · 2026-03-30T02:36:43Z

@yiliu30 please fix CI, otherwise LGTM

Fix AttributeError in _check_unloaded_weights when od_config is absent, and fix ROCm GEMM dispatch crash for CPU-tensor adalayernorm tests. Refs: vllm-project#1777 Signed-off-by: yiliu30 <yi4.liu@intel.com>

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 · 2026-03-30T09:56:13Z

Hi @david6666666 the mi325_2: Omni Model Test Qwen3-Omni was failed, could you help check whether it’s related to this PR or just a CI issue?

david6666666 · 2026-03-30T12:25:38Z

Hi @david6666666 the mi325_2: Omni Model Test Qwen3-Omni was failed, could you help check whether it’s related to this PR or just a CI issue?

not related to this PR, just wait buildkite/vllm-omni passed

david6666666 · 2026-03-31T08:46:43Z

@hsliuustc0106 ptal thx

fixed

…#1777) Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

…#1777) Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

…ript + standard loader Signed-off-by: lishunyang <lishunyang12@163.com>

This was referenced Mar 10, 2026

[Core] Unified quantization framework #1764

Merged

[Feature]: Load/Evaluate W4A16 FLUX on vllm-omni intel/auto-round#1494

Closed

This was referenced Mar 12, 2026

[RFC] Q1 Quantization Support #1057

Closed

[RFC]: Continuous Quantization Support #1854

Open

This was referenced Mar 17, 2026

Support GLM-Image model quantizaiton intel/auto-round#1512

Merged

Support Qwen3 and Qwen2.5 Omni model quantization intel/auto-round#1404

Merged

Gaohan123 added this to the v0.18.0 milestone Mar 18, 2026

hsliuustc0106 previously requested changes Mar 18, 2026

View reviewed changes

lishunyang12 reviewed Mar 18, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/data.py Outdated

lishunyang12 requested changes Mar 18, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/data.py Outdated

Comment thread vllm_omni/diffusion/model_loader/diffusers_loader.py Outdated

Comment thread vllm_omni/diffusion/quantization/inc.py Outdated

lvliang-intel reviewed Mar 19, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/omni_diffusion.py Outdated

yiliu30 changed the title ~~[WIP][AutoRound] Add offline quantized W4A16 model support~~ [AutoRound] Add offline quantized W4A16 model support Mar 20, 2026

yiliu30 marked this pull request as ready for review March 20, 2026 06:55

yiliu30 force-pushed the feats/ar-w4a16 branch from be6f954 to c2f3ca5 Compare March 20, 2026 07:02

chatgpt-codex-connector Bot reviewed Mar 20, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/model_loader/diffusers_loader.py Outdated

Comment thread vllm_omni/diffusion/model_loader/diffusers_loader.py Outdated

yiliu30 force-pushed the feats/ar-w4a16 branch from c2f3ca5 to b190c7f Compare March 20, 2026 07:05

lvliang-intel reviewed Mar 20, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/quantization/inc.py Outdated

hsliuustc0106 self-requested a review March 24, 2026 14:09

yiliu30 force-pushed the feats/ar-w4a16 branch 5 times, most recently from 3dade7e to b102fc4 Compare March 25, 2026 03:09

yiliu30 commented Mar 25, 2026

View reviewed changes

Comment thread docs/user_guide/diffusion/quantization/autoround.md

yiliu30 added 4 commits March 25, 2026 03:13

[Docs] Add AutoRound quantization guide

b510ba7

Add user-facing documentation for AutoRound diffusion quantization covering W4A16 as the first supported scheme, with a roadmap for additional schemes. Update the quantization overview table. Signed-off-by: yiliu30 <yi4.liu@intel.com>

Merge branch 'main' into feats/ar-w4a16

21fd0e9

david6666666 added the ready label to trigger buildkite CI label Mar 30, 2026

yiliu30 added 4 commits March 30, 2026 05:26

[Bugfix] fix CI unit test failures

4c318d3

Fix AttributeError in _check_unloaded_weights when od_config is absent, and fix ROCm GEMM dispatch crash for CPU-tensor adalayernorm tests. Refs: vllm-project#1777 Signed-off-by: yiliu30 <yi4.liu@intel.com>

Merge branch 'main' into feats/ar-w4a16

f706df5

fix

7c82e87

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update

bde02bd

Signed-off-by: yiliu30 <yi4.liu@intel.com>

This was referenced Mar 30, 2026

[RFC]: Quantization Support Q2 Roadmap #2330

Closed

[RFC] [0.20.0]: Quantization Support JiusiServe/vllm-omni#182

Open

Merge branch 'main' into feats/ar-w4a16

b06f196

Merge branch 'main' into feats/ar-w4a16

b56a289

david6666666 approved these changes Mar 31, 2026

View reviewed changes

yiliu30 and others added 2 commits April 1, 2026 09:19

Merge branch 'main' into feats/ar-w4a16

23fe86a

Merge branch 'main' into feats/ar-w4a16

13292d5

lishunyang12 enabled auto-merge (squash) April 2, 2026 05:57

lishunyang12 merged commit c1d2dcc into vllm-project:main Apr 2, 2026
8 checks passed

yiliu30 mentioned this pull request Apr 2, 2026

[RFC]: Intel Auto-Round x vLLM-Omni Quantization Support (2026 H1) #1325

Open

3 tasks

vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026

[AutoRound] Add offline quantized W4A16 model support (vllm-project…

4b55f29

…#1777) Signed-off-by: yiliu30 <yi4.liu@intel.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

lvliang-intel mentioned this pull request Apr 10, 2026

[AutoRound] Support Qwen Omni W4A16 quantization model #2670

Merged

5 tasks

lishunyang12 added a commit to lishunyang12/vllm-omni that referenced this pull request Apr 11, 2026

[Diffusion] Rework NVFP4 Flux2 to PR vllm-project#1777 style: prep-sc…

0b74228

…ript + standard loader Signed-off-by: lishunyang <lishunyang12@163.com>

lishunyang12 mentioned this pull request Apr 11, 2026

[Quantization] Support NVFP4 for Flux.2 #2517

Draft

yiliu30 mentioned this pull request Apr 15, 2026

[vllm-omni]: Omni Quant Support intel/auto-round#1507

Open

gcanlin mentioned this pull request Apr 22, 2026

[NPU] [Quant] Support HunyuanImage3 offline quantization with vLLM-Ascend on diffusion path #2979

Merged

8 tasks

lvliang-intel mentioned this pull request Apr 23, 2026

[AutoRound] Support GLM-Image W4A16 quantization model #3059

Open

5 tasks

Conversation

yiliu30 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

xuechendi commented Mar 17, 2026

Uh oh!

yiliu30 commented Mar 18, 2026

Uh oh!

hsliuustc0106 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

PR Review: [WIP][AutoRound] Add offline quantized `W4A16` model support

Gate Status: BLOCKED

Critical Findings

1. Missing Tests [HIGH PRIORITY]

2. Docstring Error in `inc.py`

3. Incomplete Code in `resolve_quantization()`

self.quantization_config._vllm_config.maybe_update_config(f"{self.model}/transformer/")

4. Potential Silent Failure in Weight Loading

Summary

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiliu30 commented Mar 30, 2026

Uh oh!

david6666666 commented Mar 30, 2026

Uh oh!

yiliu30 commented Mar 30, 2026

Uh oh!

david6666666 commented Mar 30, 2026

Uh oh!

david6666666 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

yiliu30 commented Mar 10, 2026 •

edited

Loading

hsliuustc0106 left a comment •

edited

Loading

david6666666 commented Mar 31, 2026 •

edited

Loading