[Bugfix] Make compressed-tensors MoEs respect ignored layers#28878
[Bugfix] Make compressed-tensors MoEs respect ignored layers#28878mgoin merged 9 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where models with partially quantized Mixture-of-Experts (MoE) layers would fail to load. The fix involves refactoring the quantization scheme retrieval logic and explicitly handling unquantized MoE layers by introducing an UnquantizedFusedMoEMethod. The changes are logical and correctly solve the described problem. My main feedback is regarding the new logic for determining the MoE quantization scheme, which currently only checks the first expert and assumes all others are the same. This could lead to incorrect behavior for models with more complex or heterogeneous expert configurations. I've added a comment with a suggestion to make this more robust.
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
Outdated
Show resolved
Hide resolved
20d9034 to
4f7fca1
Compare
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Show resolved
Hide resolved
|
CC @mgoin |
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py
Show resolved
Hide resolved
|
Can you update the title to be more clear? |
c1c625c to
7907935
Compare
|
Thanks for explaining, LGTM! |
|
@HDCharles the test failure looks related PTAL |
|
FYI - failure is because config was generated using a newer ct nightly whereas ct 12.2 is used by vLLM. We should use 12.2 for test configs until support is upgraded in vLLM (or simply remove the scale_dtype / zp_dtype fields) |
ebc5ef4 to
f91fd86
Compare
Applying quantization to some MoE layers but not others would cause model load errors due to vllm assuming all the layers were quantized since there was no check of the ignore list. Changes: - broke added helper function get_scheme_dict used by get_scheme so only a single interface for MoE and Linear to match layers - MoE matching previously would assume the 'Linear' target is used for MoE. added helper to add 'FusedMoE' to target_scheme_map and then match normally by either layer or Module type Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
3e88c54 to
96dfdb4
Compare
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
| # we can only upgrade after this is resolved | ||
| # TODO(jerryzh168): resolve the above comment | ||
| - uv pip install --system torchao==0.13.0 --index-url https://download.pytorch.org/whl/cu129 | ||
| - uv pip install --system conch-triton-kernels |
There was a problem hiding this comment.
Why is this needed now if we didn't need this before? Is it needed for the new model somehow?
There was a problem hiding this comment.
the new test needs this or else there's no kernel for the test of the tiny model I made
…oject#28878) Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
…oject#28878) Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
…oject#28878) Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Purpose
Applying quantization to some MoE layers but not others would cause model load errors due to vllm assuming all the layers were quantized since there was no check of the ignore list.
Changes
Test Plan
pytest tests/quantization/test_compressed_tensors.py::test_compressed_tensors_moe_ignore_with_model -vs -rs
Test Result
@kylesayrs @dsikka