[Fix] SmoothQuant MoE Support: Smooth All Experts, Not Just expert.0#2084
Merged
[Fix] SmoothQuant MoE Support: Smooth All Experts, Not Just expert.0#2084
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Add comprehensive unit tests to verify that SmoothQuant correctly handles Mixture of Experts (MoE) models by smoothing all experts, not just the first one. Tests added: - test_moe_all_experts_smoothed: Verifies all 8 experts in a single MoE layer are included in balance_layers - test_moe_multiple_layers_all_experts_smoothed: Verifies correct behavior across multiple transformer layers with 4 experts each These tests currently fail with the existing implementation, which only matches the first expert due to get_matching_layer() returning a single match instead of iterating over all matches. Signed-off-by: Rahul-Tuli <rtuli@redhat.com>
d7efc95 to
5c4d2d6
Compare
HDCharles
reviewed
Dec 2, 2025
72f9c1a to
5028ebf
Compare
Replace get_matching_layer() with match_named_modules() to iterate over ALL matched layers instead of returning only the first match. This fixes a critical bug where only expert.0 was smoothed in MoE models, leaving all other experts unsmoothed and causing severe accuracy degradation. Changes: - Use match_named_modules from compressed_tensors.utils to iterate over all matching modules - Search for balance layers within the parent module scope for better locality - Follow the same pattern already proven to work in AWQModifier This fix ensures all experts in MoE models (Mixtral, Qwen3, Phi, DeepSeek) are properly smoothed during quantization. Signed-off-by: Rahul-Tuli <rtuli@redhat.com>
Collaborator
Author
|
@fynnsu Great suggestions! I've applied both:
Both changes are now committed. Thanks for the thoughtful review! |
5028ebf to
93ef5ca
Compare
HDCharles
approved these changes
Dec 3, 2025
fynnsu
approved these changes
Dec 3, 2025
fynnsu
added a commit
that referenced
this pull request
Dec 10, 2025
Depends on vllm-project/compressed-tensors#524 Summary: - modified AWQ _set_resolved_mappings - get smoothing and balance layers at same time using match_modules_set - (bugfix) correct logic so that if any balance layers are incompatible, that matching is skipped - added warnings - get rid of tqdm and skip counting @kylesayrs - added helper for module_to_name - remove hardcoded handling for single balance layer by updating get_lowest_common_module to handle that - modified SmoothQuant _resolve_mappings - brought into alignment with AWQ - this is largely a horizontal move though there is handling for situations that would have been missed before like - multiple smooth layer matches in a single set - parent contexts further than 1 layer away. - updated mapping definitions to always be tuple(list[str],str) which is always the case but wasn't required unlike in AWQ - removed get_lowest_common_parent - now we can use CT's get_lowest_common_ancestor_name so only need to check for module_list (it has a lot of bugfixes compared to the get_lowest_common_parent implementation in LLMC) - updated test_base for AWQ and smoothquant - added test case for _set_resolved_mappings to check that partially skipped matches are handled correctly - added tests for MoE matching being handled correctly - added test cases for get_lowest_non_module_list_ancestor - imported Linear and used that instead of torch.nn.Linear - reverted test_pytorch.py for logarithmic_equalizations and smoothquant - The test was updated in #2084 by @rahul-tuli to ignore some modules but in general because of the way the new logic works, you need to ignore the whole set. - if you only ignore one element the matching logic would need to determine whether there's a full set or not *somehow* which it doesn't do. In the previous logic, this was possible because it was assumed the whole set had to be siblings of the smooth_layer, but the new util is trying to be more flexible and so relaxes this assumption which prevents the same approach from working. If this is a common need, perhaps we can add a util that checks for a context parent context of size N or something. TEST PLAN: pytest /home/HDCharles/repos/llm-compressor/tests/llmcompressor/modifiers/awq/test_base.py pytest /home/HDCharles/repos/llm-compressor/tests/llmcompressor/modifiers/smoothquant/test_base.py --------- Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: Kyle Sayers <kylesayrs@gmail.com> Co-authored-by: Fynn Schmitt-Ulms <fynnsu@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
SmoothQuant only smoothed the first expert (
expert.0) in Mixture of Experts (MoE) models, leaving all other experts unsmoothed. This caused severe accuracy degradation for MoE modelsRoot Cause
The
_resolve_mappings()method inSmoothQuantModifierusedget_matching_layer(), which returns only the first regex match instead of iterating over all matches. For MoE models with regex patterns like"re:.*experts.*w1", this meant onlyexpert.0.w1was smoothed while experts 1-N were ignored.Solution
Replace
get_matching_layer()withmatch_named_modules()fromcompressed_tensors.utilsto iterate over ALL matched layers. This follows the same proven pattern used inAWQModifier.Key Changes
match_named_modulesfromcompressed_tensors.utils_resolve_mappings(): Iterate over all matched layers instead of just the firstTests Added
Added unit tests to encompass the issue to verify MoE support, these tests fail on main but pass with current diff:
1.
test_moe_all_experts_smoothedVerifies all 8 experts in a single MoE layer are included in
balance_layers:2.
test_moe_multiple_layers_all_experts_smoothedVerifies correct behavior across multiple transformer layers:
Test Results
All tests pass successfully:
$ python -m pytest tests/llmcompressor/modifiers/smoothquant/test_base.py -v test_smooth_quant_is_registered ✅ PASSED test_smooth_quant_defaults ✅ PASSED test_override_defaults ✅ PASSED test_moe_all_experts_smoothed ✅ PASSED test_moe_multiple_layers_all_experts_smoothed ✅ PASSED ========================= 5 passed in 0.41s =========================Before Fix (Tests Failed)
After Fix (Tests Pass)
Related Issues
Fixes the SmoothQuant MoE bug reported in the community discussion about MoE quantization support.