[AWQ] Support for a module used in an AWQ mapping to be unquantized && other bug fixes#2158
[AWQ] Support for a module used in an AWQ mapping to be unquantized && other bug fixes#2158ZewenShen-Cohere wants to merge 6 commits intovllm-project:mainfrom
Conversation
Summary of ChangesHello @ZewenShen-Cohere, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the AWQ quantization process by introducing fine-grained control over which modules are quantized versus merely smoothed, which is particularly beneficial for maintaining accuracy in MoE architectures. It also addresses critical performance bottlenecks in the calibration phase by leveraging GPU computation and resolves a bug in mean accumulation. Furthermore, the enhanced logging for the grid search provides clearer insights into the calibration process. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for keeping certain modules unquantized during AWQ while still including them in the smoothing process, using a new force_balance parameter. It also includes several important bug fixes and improvements, such as correcting the _accumulate_mean function, optimizing activation caching to be less CPU-bound, and enhancing logging for the grid search. My review identifies a critical issue in the grid search logic that could lead to suboptimal scaling factors, and a minor performance issue. The other changes, including bug fixes and usability improvements, are well-implemented.
|
|
||
| # Q(W * s) | ||
| for balance_layer in balance_layers_to_patch: | ||
| for balance_layer in mapping.balance_layers: |
There was a problem hiding this comment.
Iterating over mapping.balance_layers here appears to be incorrect. This loop is part of a grid search to find the best scaling factor by minimizing quantization error. By including non-quantized layers (which can be in balance_layers due to force_balance), their weights are modified, and the resulting output distortion is included in the loss calculation. This loss should ideally only reflect quantization error from the quantized layers. Using balance_layers_to_patch, which is defined before this block and contains only the layers to be quantized, would be the correct approach. The influence of force_balance layers is already correctly handled in the computation of w_mean, which contributes to the scales.
There was a problem hiding this comment.
Gemini’s answer seems incorrect. We also need to account for nn.Modules that should not be quantized, so that the model produced by AWQ remains functionally equivalent to the original network.
| awq_ignore = [ | ||
| ign for ign in (self.ignore or []) | ||
| if ign not in self.force_balance | ||
| ] |
There was a problem hiding this comment.
The list comprehension for creating awq_ignore involves checking for membership in self.force_balance, which is a list. This results in a time complexity of O(N*M), where N is the length of self.ignore and M is the length of self.force_balance. You have already computed force_balance_set on line 298, which allows for O(1) average time complexity for membership checking. Using this set would make the operation more efficient, with a total complexity of O(N+M).
| awq_ignore = [ | |
| ign for ign in (self.ignore or []) | |
| if ign not in self.force_balance | |
| ] | |
| awq_ignore = [ | |
| ign for ign in (self.ignore or []) | |
| if ign not in force_balance_set | |
| ] |
| ) | ||
| output = model.generate(input_ids, max_new_tokens=100) | ||
| print(tokenizer.decode(output[0])) | ||
| print("==========================================\n\n") |
There was a problem hiding this comment.
I'll revert these changes later
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
|
Hey, I fixed the accumulate means bug already in #2114 the intended behavior is that mappings handle AWQ equalization (scale/rescale) and ignore/targets are for quantization, you used to be able to set something in mappings and then ignore to equalize it but not quantize it. I can see how its valuable to be able to ignore things from the mappings, especially for VL models, but i think the logic doesn't need a new field. I think what should happen is that we should have logic such that first we do a full mapping matching then we compare with the ignore list and if every layer in the set is in the ignore list then we don't do the rescale for it. |
|
Hi @ZewenShen-Cohere , thanks for the information and suggested changes. I agree with @HDCharles , the Your changes listed for numbers 3 and 4 though look good to me, thanks for catching the CPU bottleneck issue. Do you mind modifying this PR (or creating a separate PR) to have just those changes and we can review them separately? |
I think the bug that I fixed is different from yours. The bug I fixed is in the calculation of average activation (see this)
Thank you for the suggestions. The plan sounds good to me. |
Sure, I'll do that. |
…, and improve logging (#2161) This PR addresses the following issues: 1. _accumulate_mean produces incorrect output on its first run. 2. cache_smooth_activations_hook previously performed the averaging computation on the CPU. When both the hidden dimension and sequence length are large, this makes AWQ calibration CPU-bound. The slowdown is especially severe when multiple AWQ quantization jobs run concurrently. 3. Added more informative logging to the AWQ calibration grid search, including per-mapping JSON logs. This PR is a subset of #2158 --------- Signed-off-by: ZewenShen-Cohere <zewen.shen@cohere.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
…, and improve logging (vllm-project#2161) This PR addresses the following issues: 1. _accumulate_mean produces incorrect output on its first run. 2. cache_smooth_activations_hook previously performed the averaging computation on the CPU. When both the hidden dimension and sequence length are large, this makes AWQ calibration CPU-bound. The slowdown is especially severe when multiple AWQ quantization jobs run concurrently. 3. Added more informative logging to the AWQ calibration grid search, including per-mapping JSON logs. This PR is a subset of vllm-project#2158 --------- Signed-off-by: ZewenShen-Cohere <zewen.shen@cohere.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Signed-off-by: jwpark33 <pjw9703@gmail.com>
This PR addresses the following issues:
In some cases, we want to keep specific nn.Modules used in AWQ mappings unquantized to preserve accuracy, such as the MoE gate. This PR introduces a new force_balance field to support this behavior. It also fixes issue #2151 and corrects the default MoE AWQ mapping for the post-attention layer norm, which previously did not account for balancing the MoE gate.
_accumulate_mean produces incorrect output on its first run.
cache_smooth_activations_hook previously performed the averaging computation on the CPU. When both the hidden dimension and sequence length are large, this makes AWQ calibration CPU-bound. The slowdown is especially severe when multiple AWQ quantization jobs run concurrently.
Added more informative error logging to the AWQ calibration grid search.
Test Plan
Compare the accuracy of AWQ results produced by the new version against the previous version on the Qwen-3a30t model.