Skip to content

[AutoRound] Support GLM-Image W4A16 quantization model#3059

Open
lvliang-intel wants to merge 8 commits into
vllm-project:mainfrom
lvliang-intel:feats/ar-w4a16-glm-image
Open

[AutoRound] Support GLM-Image W4A16 quantization model#3059
lvliang-intel wants to merge 8 commits into
vllm-project:mainfrom
lvliang-intel:feats/ar-w4a16-glm-image

Conversation

@lvliang-intel
Copy link
Copy Markdown
Contributor

@lvliang-intel lvliang-intel commented Apr 23, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Support GLM-Image W4A16 AutoRound quantization in vLLM-Omni, extending the existing AutoRound W4A16 infrastructure (originally built for FLUX and Qwen3-Omni) to the GLM-Image diffusion model. This reduces model size by ~4x and GPU memory footprint while preserving generation quality.

https://huggingface.co/Intel/GLM-Image-int4-AutoRound

Related: #1325, #1777, #2670

Key changes:
Replace all nn.Linear / ColumnParallelLinear / RowParallelLinear projection layers in the GLM-Image DiT with their vLLM quantized-aware counterparts (ReplicatedLinear, ColumnParallelLinear, RowParallelLinear with quant_config).
Also added contiguous calls before RowParallelLinear (required for FP8/W4A16 kernels) and tuple-unpacking for ReplicatedLinear output.

Test Plan

E2E offline inference tests added.
TIIF-Bench accuracy evaluation test.
DPG-Bench accuracy evaluation test.

Test Result

TIIF-Bench Accuracy (9 Sub-Attributes Average)

Model overall-short overall-long
glm-image-ar-w4a16 0.8175 0.8645
glm-image (BF16 baseline) 0.8277 0.8903

Summary:

  • W4A16 quantized model retains:
    • 98.8% (short)
    • 97.1% (long)
      Average accuracy drop: ~1.3% Accuracy degradation is minimal and within an acceptable range for 4-bit quantization.

Model Size Reduction

Component BF16 Baseline W4A16 AutoRound Reduction
Total ~34 GB ~13 GB ~62%

Overall checkpoint is ~3.8× smaller


E2E Generation Smoke Test

  • ✅ Text-to-Image: Functional
  • ✅ Image-to-Image: Functional
  • ✅ Output Quality: Valid, non-blank images
  • ✅ Resolution: 256 × 256

The quantized W4A16 model maintains full pipeline functionality with no critical degradation in generation behavior.


Performance Test Result on A100

Metric BF16 (Original) W4A16 (AutoRound) Δ
Latency Mean 60.21 s 57.98 s -3.7%
Latency Median 60.21 s 57.96 s -3.7%
Latency P50 60.21 s 57.96 s -3.7%
Latency P95 60.38 s 58.20 s -3.6%
Latency P99 60.43 s 58.26 s -3.6%
Throughput 0.0166 qps 0.0172 qps +3.8%
Peak Memory 23294 MB (22.8 GiB) 13680 MB (13.4 GiB) -41.3%
Requests 64 64
Duration 3853 s 3711 s -3.7%
Failed Requests 0 0

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@lvliang-intel lvliang-intel changed the title Feats/ar w4a16 glm image [AutoRound] Support GLM-Image W4A16 quantization model Apr 23, 2026
@lvliang-intel lvliang-intel force-pushed the feats/ar-w4a16-glm-image branch from f4723c4 to e194970 Compare April 23, 2026 07:14
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKING:

  • Documentation — AutoRound documentation table should be updated. Please add GLM-Image to the supported models table in docs/user_guide/diffusion/quantization/autoround.md:

    | GLM-Image | Intel/GLM-Image-int4-AutoRound | W4A16 | 128 | GPTQ-Marlin |

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

please add the latency test results as well

@lvliang-intel
Copy link
Copy Markdown
Contributor Author

please add the latency test results as well

Will update the performance test result soon.

@lvliang-intel lvliang-intel force-pushed the feats/ar-w4a16-glm-image branch from 5f0f515 to ef8ec3f Compare April 23, 2026 14:07
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

BLOCKING:

  • Documentation — AutoRound documentation table should be updated. Please add GLM-Image to the supported models table in docs/user_guide/diffusion/quantization/autoround.md:
    | GLM-Image | Intel/GLM-Image-int4-AutoRound | W4A16 | 128 | GPTQ-Marlin |

Thanks for reminding me this. Doc updated.

@zhumingjue138
Copy link
Copy Markdown
Contributor

Please add the necessary ut test cases.

@lvliang-intel
Copy link
Copy Markdown
Contributor Author

Please add the necessary ut test cases.

ut added.

@lishunyang12
Copy link
Copy Markdown
Collaborator

Can you try with longer seq?

Comment thread vllm_omni/diffusion/models/glm_image/glm_image_transformer.py Outdated
Comment thread vllm_omni/diffusion/models/glm_image/glm_image_transformer.py
Comment thread tests/e2e/offline_inference/test_glm_image_autoround_w4a16.py Outdated
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@lvliang-intel lvliang-intel force-pushed the feats/ar-w4a16-glm-image branch from 91920fc to 0bf6bbb Compare May 4, 2026 12:38
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

Can you try with longer seq?

Sure, I will run the performance test with longer sequence.

Copy link
Copy Markdown
Collaborator

Merge conflicts need fixing before review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants