[AutoRound] Support GLM-Image W4A16 quantization model by lvliang-intel · Pull Request #3059 · vllm-project/vllm-omni

lvliang-intel · 2026-04-23T06:32:54Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Support GLM-Image W4A16 AutoRound quantization in vLLM-Omni, extending the existing AutoRound W4A16 infrastructure (originally built for FLUX and Qwen3-Omni) to the GLM-Image diffusion model. This reduces model size by ~4x and GPU memory footprint while preserving generation quality.

https://huggingface.co/Intel/GLM-Image-int4-AutoRound

Related: #1325, #1777, #2670

Key changes:
Replace all nn.Linear / ColumnParallelLinear / RowParallelLinear projection layers in the GLM-Image DiT with their vLLM quantized-aware counterparts (ReplicatedLinear, ColumnParallelLinear, RowParallelLinear with quant_config).
Also added contiguous calls before RowParallelLinear (required for FP8/W4A16 kernels) and tuple-unpacking for ReplicatedLinear output.

Test Plan

E2E offline inference tests added.
TIIF-Bench accuracy evaluation test.
DPG-Bench accuracy evaluation test.

Test Result

TIIF-Bench Accuracy (9 Sub-Attributes Average)

Model	overall-short	overall-long
glm-image-ar-w4a16	0.8175	0.8645
glm-image (BF16 baseline)	0.8277	0.8903

Summary:

W4A16 quantized model retains:
- 98.8% (short)
- 97.1% (long)
  Average accuracy drop: ~1.3% Accuracy degradation is minimal and within an acceptable range for 4-bit quantization.

Model Size Reduction

Component	BF16 Baseline	W4A16 AutoRound	Reduction
Total	~34 GB	~13 GB	~62%

Overall checkpoint is ~3.8× smaller

E2E Generation Smoke Test

✅ Text-to-Image: Functional
✅ Image-to-Image: Functional
✅ Output Quality: Valid, non-blank images
✅ Resolution: 256 × 256

The quantized W4A16 model maintains full pipeline functionality with no critical degradation in generation behavior.

Performance Test Result on A100

Metric	BF16 (Original)	W4A16 (AutoRound)	Δ
Latency Mean	60.21 s	57.98 s	-3.7%
Latency Median	60.21 s	57.96 s	-3.7%
Latency P50	60.21 s	57.96 s	-3.7%
Latency P95	60.38 s	58.20 s	-3.6%
Latency P99	60.43 s	58.26 s	-3.6%
Throughput	0.0166 qps	0.0172 qps	+3.8%
Peak Memory	23294 MB (22.8 GiB)	13680 MB (13.4 GiB)	-41.3%
Requests	64	64	—
Duration	3853 s	3711 s	-3.7%
Failed Requests	0	0	—

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

hsliuustc0106

BLOCKING:

Documentation — AutoRound documentation table should be updated. Please add GLM-Image to the supported models table in docs/user_guide/diffusion/quantization/autoround.md:

| GLM-Image | Intel/GLM-Image-int4-AutoRound | W4A16 | 128 | GPTQ-Marlin |

hsliuustc0106 · 2026-04-23T09:46:03Z

please add the latency test results as well

lvliang-intel · 2026-04-23T14:04:42Z

please add the latency test results as well

Will update the performance test result soon.

lvliang-intel · 2026-04-23T14:07:56Z

BLOCKING:

Documentation — AutoRound documentation table should be updated. Please add GLM-Image to the supported models table in docs/user_guide/diffusion/quantization/autoround.md:
| GLM-Image | Intel/GLM-Image-int4-AutoRound | W4A16 | 128 | GPTQ-Marlin |

Thanks for reminding me this. Doc updated.

zhumingjue138 · 2026-04-24T06:56:48Z

Please add the necessary ut test cases.

lvliang-intel · 2026-04-25T10:43:09Z

Please add the necessary ut test cases.

ut added.

lishunyang12 · 2026-05-02T04:17:51Z

Can you try with longer seq？

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lvliang-intel · 2026-05-04T12:41:27Z

Can you try with longer seq？

Sure, I will run the performance test with longer sequence.

david6666666 · 2026-05-18T06:13:16Z

Merge conflicts need fixing before review.

lvliang-intel requested a review from hsliuustc0106 as a code owner April 23, 2026 06:32

lvliang-intel changed the title ~~Feats/ar w4a16 glm image~~ [AutoRound] Support GLM-Image W4A16 quantization model Apr 23, 2026

lvliang-intel force-pushed the feats/ar-w4a16-glm-image branch from f4723c4 to e194970 Compare April 23, 2026 07:14

lvliang-intel mentioned this pull request Apr 23, 2026

[vllm-omni]: Omni Quant Support intel/auto-round#1507

Open

hsliuustc0106 reviewed Apr 23, 2026

View reviewed changes

lvliang-intel force-pushed the feats/ar-w4a16-glm-image branch from 5f0f515 to ef8ec3f Compare April 23, 2026 14:07

lishunyang12 mentioned this pull request Apr 24, 2026

[RFC]: Continuous Quantization Support #1854

Open

lvliang-intel force-pushed the feats/ar-w4a16-glm-image branch from ef8ec3f to d6af213 Compare April 25, 2026 10:41

lishunyang12 reviewed May 2, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/glm_image/glm_image_transformer.py Outdated

lishunyang12 reviewed May 2, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/models/glm_image/glm_image_transformer.py

lishunyang12 reviewed May 2, 2026

View reviewed changes

Comment thread tests/e2e/offline_inference/test_glm_image_autoround_w4a16.py Outdated

lvliang-intel added 4 commits May 4, 2026 12:32

support glm-image w4a16 with autoround

dbaae4c

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

add e2e test

ec729a9

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

fix lint

07409dd

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

fix pre commit

4bac2d5

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

lvliang-intel force-pushed the feats/ar-w4a16-glm-image branch from 91920fc to 0bf6bbb Compare May 4, 2026 12:38

lvliang-intel added 4 commits May 4, 2026 12:40

update doc

c06e740

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

add ut

acbced7

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

fix pre commit

fc832db

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

fix comments

0bf6bbb

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

yiliu30 mentioned this pull request May 7, 2026

[RFC]: Intel Auto-Round x vLLM-Omni Quantization Support (2026 H1) #1325

Open

3 tasks

david6666666 mentioned this pull request May 8, 2026

[RFC] [0.22.0]: Quantization Support JiusiServe/vllm-omni#182

Open

8 tasks

Gaohan123 added this to the v0.22.0 milestone May 11, 2026

lvliang-intel mentioned this pull request May 13, 2026

[Feature]: Load/Evaluate W4A16 zai-org/GLM-Image on vllm-omni intel/auto-round#1510

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoRound] Support GLM-Image W4A16 quantization model#3059

[AutoRound] Support GLM-Image W4A16 quantization model#3059
lvliang-intel wants to merge 8 commits into
vllm-project:mainfrom
lvliang-intel:feats/ar-w4a16-glm-image

lvliang-intel commented Apr 23, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 left a comment

Uh oh!

hsliuustc0106 commented Apr 23, 2026

Uh oh!

lvliang-intel commented Apr 23, 2026

Uh oh!

lvliang-intel commented Apr 23, 2026

Uh oh!

zhumingjue138 commented Apr 24, 2026

Uh oh!

lvliang-intel commented Apr 25, 2026

Uh oh!

lishunyang12 commented May 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lvliang-intel commented May 4, 2026

Uh oh!

david6666666 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

lvliang-intel commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

TIIF-Bench Accuracy (9 Sub-Attributes Average)

Model Size Reduction

E2E Generation Smoke Test

Performance Test Result on A100

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Apr 23, 2026

Uh oh!

lvliang-intel commented Apr 23, 2026

Uh oh!

lvliang-intel commented Apr 23, 2026

Uh oh!

zhumingjue138 commented Apr 24, 2026

Uh oh!

lvliang-intel commented Apr 25, 2026

Uh oh!

lishunyang12 commented May 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lvliang-intel commented May 4, 2026

Uh oh!

david6666666 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lvliang-intel commented Apr 23, 2026 •

edited

Loading