Skip to content

[Bugfix] Make compressed-tensors MoEs respect ignored layers#28878

Merged
mgoin merged 9 commits intovllm-project:mainfrom
HDCharles:095_bug_ignore_moe
Nov 27, 2025
Merged

[Bugfix] Make compressed-tensors MoEs respect ignored layers#28878
mgoin merged 9 commits intovllm-project:mainfrom
HDCharles:095_bug_ignore_moe

Conversation

@HDCharles
Copy link
Contributor

@HDCharles HDCharles commented Nov 17, 2025

Purpose

Applying quantization to some MoE layers but not others would cause model load errors due to vllm assuming all the layers were quantized since there was no check of the ignore list.

Changes

  • added helper function get_scheme_dict used by get_scheme so only a single interface for MoE and Linear to match layers
  • MoE matching previously would assume the 'Linear' target is used for MoE. instead, added helper which adds 'FusedMoE' to target_scheme_map and then match normally by either layer or Module type
  • New test to check this
  • Added conch-triton-kernels to be installed for the new test otherwise this tiny model has no kernels

Test Plan

pytest tests/quantization/test_compressed_tensors.py::test_compressed_tensors_moe_ignore_with_model -vs -rs

Test Result

PASSED

=============================== warnings summary ===============================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 1 passed, 2 warnings in 25.78s ========================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

@kylesayrs @dsikka

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where models with partially quantized Mixture-of-Experts (MoE) layers would fail to load. The fix involves refactoring the quantization scheme retrieval logic and explicitly handling unquantized MoE layers by introducing an UnquantizedFusedMoEMethod. The changes are logical and correctly solve the described problem. My main feedback is regarding the new logic for determining the MoE quantization scheme, which currently only checks the first expert and assumes all others are the same. This could lead to incorrect behavior for models with more complex or heterogeneous expert configurations. I've added a comment with a suggestion to make this more robust.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@heheda12345
Copy link
Collaborator

CC @mgoin

@mgoin
Copy link
Member

mgoin commented Nov 19, 2025

Can you update the title to be more clear?

@HDCharles HDCharles changed the title [bugfix] ignore MoE layers [bugfix] enable MoE quantized model loading to respect ignored layers Nov 19, 2025
@HDCharles HDCharles requested a review from mgoin November 19, 2025 19:44
@HDCharles HDCharles force-pushed the 095_bug_ignore_moe branch 2 times, most recently from c1c625c to 7907935 Compare November 20, 2025 20:48
@HDCharles HDCharles requested a review from kylesayrs November 20, 2025 20:49
@mgoin mgoin changed the title [bugfix] enable MoE quantized model loading to respect ignored layers [Bugfix] Make compressed-tensors MoEs respect ignored layers Nov 20, 2025
@mgoin mgoin added bug Something isn't working quantization ready ONLY add when PR is ready to merge/full CI is needed labels Nov 20, 2025
@mgoin
Copy link
Member

mgoin commented Nov 20, 2025

Thanks for explaining, LGTM!

@mgoin
Copy link
Member

mgoin commented Nov 21, 2025

@HDCharles the test failure looks related PTAL

[2025-11-21T00:24:56Z] FAILED quantization/test_compressed_tensors.py::test_compressed_tensors_moe_ignore_with_model - pydantic_core._pydantic_core.ValidationError: 2 validation errors for VllmConfig
[2025-11-21T00:24:56Z] scale_dtype
[2025-11-21T00:24:56Z]   Extra inputs are not permitted [type=extra_forbidden, input_value=None, input_type=NoneType]
[2025-11-21T00:24:56Z]     For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden
[2025-11-21T00:24:56Z] zp_dtype
[2025-11-21T00:24:56Z]   Extra inputs are not permitted [type=extra_forbidden, input_value=None, input_type=NoneType]
[2025-11-21T00:24:56Z]     For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden

@dsikka
Copy link
Contributor

dsikka commented Nov 21, 2025

FYI - failure is because config was generated using a newer ct nightly whereas ct 12.2 is used by vLLM.

We should use 12.2 for test configs until support is upgraded in vLLM (or simply remove the scale_dtype / zp_dtype fields)

Applying quantization to some MoE layers but not others would cause
model load errors due to vllm assuming all the layers were quantized
since there was no check of the ignore list.

Changes:
- broke added helper function get_scheme_dict used by get_scheme so only
  a single interface for MoE and Linear to match layers
- MoE matching previously would assume the 'Linear' target is used for
  MoE. added helper to add 'FusedMoE' to target_scheme_map and then
  match normally by either layer or Module type

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
@mergify mergify bot added the ci/build label Nov 25, 2025
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
# we can only upgrade after this is resolved
# TODO(jerryzh168): resolve the above comment
- uv pip install --system torchao==0.13.0 --index-url https://download.pytorch.org/whl/cu129
- uv pip install --system conch-triton-kernels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed now if we didn't need this before? Is it needed for the new model somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new test needs this or else there's no kernel for the test of the tiny model I made

@mgoin mgoin merged commit df01eda into vllm-project:main Nov 27, 2025
55 checks passed
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
…oject#28878)

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
amd-hhashemi pushed a commit to amd-hhashemi/vllm that referenced this pull request Dec 2, 2025
…oject#28878)

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…oject#28878)

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ci/build ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants