[AWQ] Support for a module used in an AWQ mapping to be unquantized && other bug fixes by ZewenShen-Cohere · Pull Request #2158 · vllm-project/llm-compressor

ZewenShen-Cohere · 2025-12-19T06:56:09Z

This PR addresses the following issues:

In some cases, we want to keep specific nn.Modules used in AWQ mappings unquantized to preserve accuracy, such as the MoE gate. This PR introduces a new force_balance field to support this behavior. It also fixes issue #2151 and corrects the default MoE AWQ mapping for the post-attention layer norm, which previously did not account for balancing the MoE gate.
_accumulate_mean produces incorrect output on its first run.
cache_smooth_activations_hook previously performed the averaging computation on the CPU. When both the hidden dimension and sequence length are large, this makes AWQ calibration CPU-bound. The slowdown is especially severe when multiple AWQ quantization jobs run concurrently.
Added more informative error logging to the AWQ calibration grid search.

Test Plan

Compare the accuracy of AWQ results produced by the new version against the previous version on the Qwen-3a30t model.

gemini-code-assist · 2025-12-19T06:56:35Z

Summary of Changes

Hello @ZewenShen-Cohere, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AWQ quantization process by introducing fine-grained control over which modules are quantized versus merely smoothed, which is particularly beneficial for maintaining accuracy in MoE architectures. It also addresses critical performance bottlenecks in the calibration phase by leveraging GPU computation and resolves a bug in mean accumulation. Furthermore, the enhanced logging for the grid search provides clearer insights into the calibration process.

Highlights

Selective Unquantization with force_balance: Introduced a new force_balance parameter in the AWQModifier to allow specific nn.Modules (e.g., MoE gate layers) to be smoothed but not quantized. This preserves accuracy for critical components and addresses issue [Bug]: qwen3-vl-30b-a3b v0.9.0 awq no matches found for input_layernorm #2151, along with correcting the default MoE AWQ mapping.
GPU-accelerated AWQ Calibration: Optimized the cache_smooth_activations_hook to perform activation mean accumulation on the GPU instead of the CPU. This significantly improves calibration speed, especially for large models and when running multiple AWQ quantization jobs concurrently.
_accumulate_mean Bug Fix: Corrected a bug in the _accumulate_mean utility function that caused it to produce incorrect output during its initial run.
Enhanced Grid Search Logging: Added a progress bar (tqdm) and more informative error logging to the AWQ calibration grid search. This now includes initial error, best error, and the error reduction rate, providing better observability into the calibration process.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for keeping certain modules unquantized during AWQ while still including them in the smoothing process, using a new force_balance parameter. It also includes several important bug fixes and improvements, such as correcting the _accumulate_mean function, optimizing activation caching to be less CPU-bound, and enhancing logging for the grid search. My review identifies a critical issue in the grid search logic that could lead to suboptimal scaling factors, and a minor performance issue. The other changes, including bug fixes and usability improvements, are well-implemented.

gemini-code-assist · 2025-12-19T06:58:31Z

src/llmcompressor/modifiers/awq/base.py


                # Q(W * s)
-                for balance_layer in balance_layers_to_patch:
+                for balance_layer in mapping.balance_layers:


Iterating over mapping.balance_layers here appears to be incorrect. This loop is part of a grid search to find the best scaling factor by minimizing quantization error. By including non-quantized layers (which can be in balance_layers due to force_balance), their weights are modified, and the resulting output distortion is included in the loss calculation. This loss should ideally only reflect quantization error from the quantized layers. Using balance_layers_to_patch, which is defined before this block and contains only the layers to be quantized, would be the correct approach. The influence of force_balance layers is already correctly handled in the computation of w_mean, which contributes to the scales.

Gemini’s answer seems incorrect. We also need to account for nn.Modules that should not be quantized, so that the model produced by AWQ remains functionally equivalent to the original network.

gemini-code-assist · 2025-12-19T06:58:32Z

src/llmcompressor/modifiers/awq/base.py

+            awq_ignore = [
+                ign for ign in (self.ignore or [])
+                if ign not in self.force_balance
+            ]


The list comprehension for creating awq_ignore involves checking for membership in self.force_balance, which is a list. This results in a time complexity of O(N*M), where N is the length of self.ignore and M is the length of self.force_balance. You have already computed force_balance_set on line 298, which allows for O(1) average time complexity for membership checking. Using this set would make the operation more efficient, with a total complexity of O(N+M).

Suggested change

awq_ignore = [

ign for ign in (self.ignore or [])

if ign not in self.force_balance

]

awq_ignore = [

ign for ign in (self.ignore or [])

if ign not in force_balance_set

]

ZewenShen-Cohere · 2025-12-19T07:01:24Z

examples/awq/qwen3_moe_example.py

 )
 output = model.generate(input_ids, max_new_tokens=100)
 print(tokenizer.decode(output[0]))
-print("==========================================\n\n")


I'll revert these changes later

github-actions · 2025-12-19T07:01:48Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

HDCharles · 2025-12-19T14:49:46Z

Hey, I fixed the accumulate means bug already in #2114

the intended behavior is that mappings handle AWQ equalization (scale/rescale) and ignore/targets are for quantization, you used to be able to set something in mappings and then ignore to equalize it but not quantize it.

I can see how its valuable to be able to ignore things from the mappings, especially for VL models, but i think the logic doesn't need a new field. I think what should happen is that we should have logic such that first we do a full mapping matching then we compare with the ignore list and if every layer in the set is in the ignore list then we don't do the rescale for it.

brian-dellabetta · 2025-12-19T17:48:21Z

Hi @ZewenShen-Cohere , thanks for the information and suggested changes. I agree with @HDCharles , the force_balance field is prone to be confusing and we are looking into a way to resolve this without it.

Your changes listed for numbers 3 and 4 though look good to me, thanks for catching the CPU bottleneck issue. Do you mind modifying this PR (or creating a separate PR) to have just those changes and we can review them separately?

ZewenShen-Cohere · 2025-12-19T17:51:24Z

Hey, I fixed the accumulate means bug already in #2114

I think the bug that I fixed is different from yours. The bug I fixed is in the calculation of average activation (see this)

I can see how its valuable to be able to ignore things from the mappings, especially for VL models, but i think the logic doesn't need a new field. I think what should happen is that we should have logic such that first we do a full mapping matching then we compare with the ignore list and if every layer in the set is in the ignore list then we don't do the rescale for it.

Thank you for the suggestions. The plan sounds good to me.

ZewenShen-Cohere · 2025-12-19T17:51:38Z

Your changes listed for numbers 3 and 4 though look good to me, thanks for catching the CPU bottleneck issue. Do you mind modifying this PR (or creating a separate PR) to have just those changes and we can review them separately?

Sure, I'll do that.

…, and improve logging (#2161) This PR addresses the following issues: 1. _accumulate_mean produces incorrect output on its first run. 2. cache_smooth_activations_hook previously performed the averaging computation on the CPU. When both the hidden dimension and sequence length are large, this makes AWQ calibration CPU-bound. The slowdown is especially severe when multiple AWQ quantization jobs run concurrently. 3. Added more informative logging to the AWQ calibration grid search, including per-mapping JSON logs. This PR is a subset of #2158 --------- Signed-off-by: ZewenShen-Cohere <zewen.shen@cohere.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

…, and improve logging (vllm-project#2161) This PR addresses the following issues: 1. _accumulate_mean produces incorrect output on its first run. 2. cache_smooth_activations_hook previously performed the averaging computation on the CPU. When both the hidden dimension and sequence length are large, this makes AWQ calibration CPU-bound. The slowdown is especially severe when multiple AWQ quantization jobs run concurrently. 3. Added more informative logging to the AWQ calibration grid search, including per-mapping JSON logs. This PR is a subset of vllm-project#2158 --------- Signed-off-by: ZewenShen-Cohere <zewen.shen@cohere.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Signed-off-by: jwpark33 <pjw9703@gmail.com>

ZewenShen-Cohere added 6 commits December 17, 2025 18:26

fix accumulate_mean bug in awq

ea79421

Add a force_balance field to AWQ

09d0ecf

Fix AWQ compute loss sum bug

df6178c

AWQ: scale the unquantized nn.module in the mapping

9b867dd

Add AWQ grid search error logging

8978e3b

Fix MoE AWQ mapping

f58d013

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

ZewenShen-Cohere mentioned this pull request Dec 19, 2025

[AWQ][Smooth] mapping shouldn't use ignore #2152

Merged

ZewenShen-Cohere commented Dec 19, 2025

View reviewed changes

ZewenShen-Cohere changed the title ~~Support for a module used in an AWQ mapping to be unquantized && other bug fixes for AWQ~~ [AWQ] Support for a module used in an AWQ mapping to be unquantized && other bug fixes Dec 19, 2025

ZewenShen-Cohere mentioned this pull request Dec 19, 2025

[AWQ] Fix _accumulate_mean bug, move AWQ activation averaging off CPU, and improve logging #2161

Merged

ZewenShen-Cohere closed this Jan 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AWQ] Support for a module used in an AWQ mapping to be unquantized && other bug fixes#2158

[AWQ] Support for a module used in an AWQ mapping to be unquantized && other bug fixes#2158
ZewenShen-Cohere wants to merge 6 commits intovllm-project:mainfrom
ZewenShen-Cohere:awq_bugfix

ZewenShen-Cohere commented Dec 19, 2025

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

ZewenShen-Cohere Dec 19, 2025

Uh oh!

gemini-code-assist bot Dec 19, 2025

Uh oh!

ZewenShen-Cohere Dec 19, 2025

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

HDCharles commented Dec 19, 2025 •

edited

Loading

Uh oh!

brian-dellabetta commented Dec 19, 2025 •

edited

Loading

Uh oh!

ZewenShen-Cohere commented Dec 19, 2025

Uh oh!

ZewenShen-Cohere commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

ZewenShen-Cohere commented Dec 19, 2025

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

ZewenShen-Cohere Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

ZewenShen-Cohere Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

HDCharles commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZewenShen-Cohere commented Dec 19, 2025

Uh oh!

ZewenShen-Cohere commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

HDCharles commented Dec 19, 2025 •

edited

Loading

brian-dellabetta commented Dec 19, 2025 •

edited

Loading