Skip to content

[AWQ] Support for a module used in an AWQ mapping to be unquantized && other bug fixes#2158

Closed
ZewenShen-Cohere wants to merge 6 commits intovllm-project:mainfrom
ZewenShen-Cohere:awq_bugfix
Closed

[AWQ] Support for a module used in an AWQ mapping to be unquantized && other bug fixes#2158
ZewenShen-Cohere wants to merge 6 commits intovllm-project:mainfrom
ZewenShen-Cohere:awq_bugfix

Conversation

@ZewenShen-Cohere
Copy link
Contributor

This PR addresses the following issues:

  1. In some cases, we want to keep specific nn.Modules used in AWQ mappings unquantized to preserve accuracy, such as the MoE gate. This PR introduces a new force_balance field to support this behavior. It also fixes issue #2151 and corrects the default MoE AWQ mapping for the post-attention layer norm, which previously did not account for balancing the MoE gate.

  2. _accumulate_mean produces incorrect output on its first run.

  3. cache_smooth_activations_hook previously performed the averaging computation on the CPU. When both the hidden dimension and sequence length are large, this makes AWQ calibration CPU-bound. The slowdown is especially severe when multiple AWQ quantization jobs run concurrently.

  4. Added more informative error logging to the AWQ calibration grid search.

Test Plan

Compare the accuracy of AWQ results produced by the new version against the previous version on the Qwen-3a30t model.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ZewenShen-Cohere, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AWQ quantization process by introducing fine-grained control over which modules are quantized versus merely smoothed, which is particularly beneficial for maintaining accuracy in MoE architectures. It also addresses critical performance bottlenecks in the calibration phase by leveraging GPU computation and resolves a bug in mean accumulation. Furthermore, the enhanced logging for the grid search provides clearer insights into the calibration process.

Highlights

  • Selective Unquantization with force_balance: Introduced a new force_balance parameter in the AWQModifier to allow specific nn.Modules (e.g., MoE gate layers) to be smoothed but not quantized. This preserves accuracy for critical components and addresses issue [Bug]: qwen3-vl-30b-a3b v0.9.0 awq no matches found for input_layernorm #2151, along with correcting the default MoE AWQ mapping.
  • GPU-accelerated AWQ Calibration: Optimized the cache_smooth_activations_hook to perform activation mean accumulation on the GPU instead of the CPU. This significantly improves calibration speed, especially for large models and when running multiple AWQ quantization jobs concurrently.
  • _accumulate_mean Bug Fix: Corrected a bug in the _accumulate_mean utility function that caused it to produce incorrect output during its initial run.
  • Enhanced Grid Search Logging: Added a progress bar (tqdm) and more informative error logging to the AWQ calibration grid search. This now includes initial error, best error, and the error reduction rate, providing better observability into the calibration process.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for keeping certain modules unquantized during AWQ while still including them in the smoothing process, using a new force_balance parameter. It also includes several important bug fixes and improvements, such as correcting the _accumulate_mean function, optimizing activation caching to be less CPU-bound, and enhancing logging for the grid search. My review identifies a critical issue in the grid search logic that could lead to suboptimal scaling factors, and a minor performance issue. The other changes, including bug fixes and usability improvements, are well-implemented.


# Q(W * s)
for balance_layer in balance_layers_to_patch:
for balance_layer in mapping.balance_layers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Iterating over mapping.balance_layers here appears to be incorrect. This loop is part of a grid search to find the best scaling factor by minimizing quantization error. By including non-quantized layers (which can be in balance_layers due to force_balance), their weights are modified, and the resulting output distortion is included in the loss calculation. This loss should ideally only reflect quantization error from the quantized layers. Using balance_layers_to_patch, which is defined before this block and contains only the layers to be quantized, would be the correct approach. The influence of force_balance layers is already correctly handled in the computation of w_mean, which contributes to the scales.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini’s answer seems incorrect. We also need to account for nn.Modules that should not be quantized, so that the model produced by AWQ remains functionally equivalent to the original network.

Comment on lines +308 to +311
awq_ignore = [
ign for ign in (self.ignore or [])
if ign not in self.force_balance
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The list comprehension for creating awq_ignore involves checking for membership in self.force_balance, which is a list. This results in a time complexity of O(N*M), where N is the length of self.ignore and M is the length of self.force_balance. You have already computed force_balance_set on line 298, which allows for O(1) average time complexity for membership checking. Using this set would make the operation more efficient, with a total complexity of O(N+M).

Suggested change
awq_ignore = [
ign for ign in (self.ignore or [])
if ign not in self.force_balance
]
awq_ignore = [
ign for ign in (self.ignore or [])
if ign not in force_balance_set
]

)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert these changes later

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@ZewenShen-Cohere ZewenShen-Cohere changed the title Support for a module used in an AWQ mapping to be unquantized && other bug fixes for AWQ [AWQ] Support for a module used in an AWQ mapping to be unquantized && other bug fixes Dec 19, 2025
@HDCharles
Copy link
Collaborator

HDCharles commented Dec 19, 2025

Hey, I fixed the accumulate means bug already in #2114

the intended behavior is that mappings handle AWQ equalization (scale/rescale) and ignore/targets are for quantization, you used to be able to set something in mappings and then ignore to equalize it but not quantize it.

I can see how its valuable to be able to ignore things from the mappings, especially for VL models, but i think the logic doesn't need a new field. I think what should happen is that we should have logic such that first we do a full mapping matching then we compare with the ignore list and if every layer in the set is in the ignore list then we don't do the rescale for it.

@brian-dellabetta
Copy link
Collaborator

brian-dellabetta commented Dec 19, 2025

Hi @ZewenShen-Cohere , thanks for the information and suggested changes. I agree with @HDCharles , the force_balance field is prone to be confusing and we are looking into a way to resolve this without it.

Your changes listed for numbers 3 and 4 though look good to me, thanks for catching the CPU bottleneck issue. Do you mind modifying this PR (or creating a separate PR) to have just those changes and we can review them separately?

@ZewenShen-Cohere
Copy link
Contributor Author

Hey, I fixed the accumulate means bug already in #2114

I think the bug that I fixed is different from yours. The bug I fixed is in the calculation of average activation (see this)

I can see how its valuable to be able to ignore things from the mappings, especially for VL models, but i think the logic doesn't need a new field. I think what should happen is that we should have logic such that first we do a full mapping matching then we compare with the ignore list and if every layer in the set is in the ignore list then we don't do the rescale for it.

Thank you for the suggestions. The plan sounds good to me.

@ZewenShen-Cohere
Copy link
Contributor Author

Your changes listed for numbers 3 and 4 though look good to me, thanks for catching the CPU bottleneck issue. Do you mind modifying this PR (or creating a separate PR) to have just those changes and we can review them separately?

Sure, I'll do that.

dsikka added a commit that referenced this pull request Jan 8, 2026
…, and improve logging (#2161)

This PR addresses the following issues:

1. _accumulate_mean produces incorrect output on its first run.

2. cache_smooth_activations_hook previously performed the averaging
computation on the CPU. When both the hidden dimension and sequence
length are large, this makes AWQ calibration CPU-bound. The slowdown is
especially severe when multiple AWQ quantization jobs run concurrently.

3. Added more informative logging to the AWQ calibration grid search,
including per-mapping JSON logs.

This PR is a subset of
#2158

---------

Signed-off-by: ZewenShen-Cohere <zewen.shen@cohere.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
jwpark33 pushed a commit to jwpark33/llm-compressor that referenced this pull request Jan 8, 2026
…, and improve logging (vllm-project#2161)

This PR addresses the following issues:

1. _accumulate_mean produces incorrect output on its first run.

2. cache_smooth_activations_hook previously performed the averaging
computation on the CPU. When both the hidden dimension and sequence
length are large, this makes AWQ calibration CPU-bound. The slowdown is
especially severe when multiple AWQ quantization jobs run concurrently.

3. Added more informative logging to the AWQ calibration grid search,
including per-mapping JSON logs.

This PR is a subset of
vllm-project#2158

---------

Signed-off-by: ZewenShen-Cohere <zewen.shen@cohere.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Signed-off-by: jwpark33 <pjw9703@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments