[AutoRound] Support GLM-Image W4A16 quantization model#3059
Open
lvliang-intel wants to merge 8 commits into
Open
[AutoRound] Support GLM-Image W4A16 quantization model#3059lvliang-intel wants to merge 8 commits into
lvliang-intel wants to merge 8 commits into
Conversation
f4723c4 to
e194970
Compare
Collaborator
hsliuustc0106
left a comment
There was a problem hiding this comment.
BLOCKING:
-
Documentation — AutoRound documentation table should be updated. Please add GLM-Image to the supported models table in
docs/user_guide/diffusion/quantization/autoround.md:| GLM-Image |
Intel/GLM-Image-int4-AutoRound| W4A16 | 128 | GPTQ-Marlin |
Collaborator
|
please add the latency test results as well |
Contributor
Author
Will update the performance test result soon. |
5f0f515 to
ef8ec3f
Compare
Contributor
Author
Thanks for reminding me this. Doc updated. |
Contributor
|
Please add the necessary ut test cases. |
ef8ec3f to
d6af213
Compare
Contributor
Author
ut added. |
Collaborator
|
Can you try with longer seq? |
lishunyang12
reviewed
May 2, 2026
lishunyang12
reviewed
May 2, 2026
lishunyang12
reviewed
May 2, 2026
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
91920fc to
0bf6bbb
Compare
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Contributor
Author
Sure, I will run the performance test with longer sequence. |
3 tasks
8 tasks
2 tasks
Collaborator
|
Merge conflicts need fixing before review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Support GLM-Image W4A16 AutoRound quantization in vLLM-Omni, extending the existing AutoRound W4A16 infrastructure (originally built for FLUX and Qwen3-Omni) to the GLM-Image diffusion model. This reduces model size by ~4x and GPU memory footprint while preserving generation quality.
https://huggingface.co/Intel/GLM-Image-int4-AutoRound
Related: #1325, #1777, #2670
Key changes:
Replace all nn.Linear / ColumnParallelLinear / RowParallelLinear projection layers in the GLM-Image DiT with their vLLM quantized-aware counterparts (ReplicatedLinear, ColumnParallelLinear, RowParallelLinear with quant_config).
Also added contiguous calls before RowParallelLinear (required for FP8/W4A16 kernels) and tuple-unpacking for ReplicatedLinear output.
Test Plan
E2E offline inference tests added.
TIIF-Bench accuracy evaluation test.
DPG-Bench accuracy evaluation test.
Test Result
TIIF-Bench Accuracy (9 Sub-Attributes Average)
Summary:
Average accuracy drop: ~1.3% Accuracy degradation is minimal and within an acceptable range for 4-bit quantization.
Model Size Reduction
Overall checkpoint is ~3.8× smaller
E2E Generation Smoke Test
The quantized W4A16 model maintains full pipeline functionality with no critical degradation in generation behavior.
Performance Test Result on A100
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)