[Bugfix] Make compressed-tensors MoEs respect ignored layers by HDCharles · Pull Request #28878 · vllm-project/vllm

HDCharles · 2025-11-17T20:25:28Z

Purpose

Applying quantization to some MoE layers but not others would cause model load errors due to vllm assuming all the layers were quantized since there was no check of the ignore list.

Changes

added helper function get_scheme_dict used by get_scheme so only a single interface for MoE and Linear to match layers
MoE matching previously would assume the 'Linear' target is used for MoE. instead, added helper which adds 'FusedMoE' to target_scheme_map and then match normally by either layer or Module type
New test to check this
Added conch-triton-kernels to be installed for the new test otherwise this tiny model has no kernels

Test Plan

pytest tests/quantization/test_compressed_tensors.py::test_compressed_tensors_moe_ignore_with_model -vs -rs

Test Result

PASSED

=============================== warnings summary ===============================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================== 1 passed, 2 warnings in 25.78s ========================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

@kylesayrs @dsikka

gemini-code-assist

Code Review

This pull request addresses a bug where models with partially quantized Mixture-of-Experts (MoE) layers would fail to load. The fix involves refactoring the quantization scheme retrieval logic and explicitly handling unquantized MoE layers by introducing an UnquantizedFusedMoEMethod. The changes are logical and correctly solve the described problem. My main feedback is regarding the new logic for determining the MoE quantization scheme, which currently only checks the first expert and assumes all others are the same. This could lead to incorrect behavior for models with more complex or heterogeneous expert configurations. I've added a comment with a suggestion to make this more robust.

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

heheda12345 · 2025-11-19T01:25:26Z

CC @mgoin

tests/quantization/test_compressed_tensors.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py

mgoin · 2025-11-19T17:17:01Z

Can you update the title to be more clear?

mgoin · 2025-11-20T23:38:49Z

Thanks for explaining, LGTM!

mgoin · 2025-11-21T02:59:06Z

@HDCharles the test failure looks related PTAL

[2025-11-21T00:24:56Z] FAILED quantization/test_compressed_tensors.py::test_compressed_tensors_moe_ignore_with_model - pydantic_core._pydantic_core.ValidationError: 2 validation errors for VllmConfig
[2025-11-21T00:24:56Z] scale_dtype
[2025-11-21T00:24:56Z]   Extra inputs are not permitted [type=extra_forbidden, input_value=None, input_type=NoneType]
[2025-11-21T00:24:56Z]     For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden
[2025-11-21T00:24:56Z] zp_dtype
[2025-11-21T00:24:56Z]   Extra inputs are not permitted [type=extra_forbidden, input_value=None, input_type=NoneType]
[2025-11-21T00:24:56Z]     For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden

dsikka · 2025-11-21T03:12:00Z

FYI - failure is because config was generated using a newer ct nightly whereas ct 12.2 is used by vLLM.

We should use 12.2 for test configs until support is upgraded in vLLM (or simply remove the scale_dtype / zp_dtype fields)

Applying quantization to some MoE layers but not others would cause model load errors due to vllm assuming all the layers were quantized since there was no check of the ignore list. Changes: - broke added helper function get_scheme_dict used by get_scheme so only a single interface for MoE and Linear to match layers - MoE matching previously would assume the 'Linear' target is used for MoE. added helper to add 'FusedMoE' to target_scheme_map and then match normally by either layer or Module type Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

mgoin · 2025-11-27T01:55:50Z

.buildkite/test-pipeline.yaml

  # we can only upgrade after this is resolved
  # TODO(jerryzh168): resolve the above comment
  - uv pip install --system torchao==0.13.0 --index-url https://download.pytorch.org/whl/cu129
+  - uv pip install --system conch-triton-kernels


Why is this needed now if we didn't need this before? Is it needed for the new model somehow?

the new test needs this or else there's no kernel for the test of the tiny model I made

…oject#28878) Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

…oject#28878) Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

…oject#28878) Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>

HDCharles requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners November 17, 2025 20:25

gemini-code-assist bot reviewed Nov 17, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Nov 17, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py Outdated Show resolved Hide resolved

HDCharles force-pushed the 095_bug_ignore_moe branch from 20d9034 to 4f7fca1 Compare November 18, 2025 02:09

HDCharles commented Nov 18, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py Show resolved Hide resolved

kylesayrs approved these changes Nov 19, 2025

View reviewed changes

mgoin reviewed Nov 19, 2025

View reviewed changes

HDCharles changed the title ~~[bugfix] ignore MoE layers~~ [bugfix] enable MoE quantized model loading to respect ignored layers Nov 19, 2025

HDCharles requested a review from mgoin November 19, 2025 19:44

HDCharles force-pushed the 095_bug_ignore_moe branch 2 times, most recently from c1c625c to 7907935 Compare November 20, 2025 20:48

HDCharles requested a review from kylesayrs November 20, 2025 20:49

kylesayrs approved these changes Nov 20, 2025

View reviewed changes

mgoin changed the title ~~[bugfix] enable MoE quantized model loading to respect ignored layers~~ [Bugfix] Make compressed-tensors MoEs respect ignored layers Nov 20, 2025

mgoin added bug Something isn't working quantization ready ONLY add when PR is ready to merge/full CI is needed labels Nov 20, 2025

mgoin approved these changes Nov 20, 2025

View reviewed changes

HDCharles force-pushed the 095_bug_ignore_moe branch from ebc5ef4 to f91fd86 Compare November 25, 2025 15:57

HDCharles requested a review from kylesayrs November 25, 2025 15:58

HDCharles added 7 commits November 25, 2025 18:41

tests and formatting

951cfd7

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

handling sparsity

dedda19

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

trigger CI rebuild

28bbb1a

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

update model

45ab386

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

range

2d925b9

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

update model to use CT stable

84c74fd

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

add test version for CT 12.3

96dfdb4

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

HDCharles force-pushed the 095_bug_ignore_moe branch from 3e88c54 to 96dfdb4 Compare November 25, 2025 19:06

mergify bot added the ci/build label Nov 25, 2025

install conch for kernel

f051ec4

Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

HDCharles mentioned this pull request Nov 25, 2025

[test] add e2e test for qwen3 moe w4a16 vllm-project/llm-compressor#2071

Closed

mgoin approved these changes Nov 27, 2025

View reviewed changes

mgoin merged commit df01eda into vllm-project:main Nov 27, 2025
55 checks passed

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Bugfix] Make compressed-tensors MoEs respect ignored layers (vllm-pr…

32f4b60

…oject#28878) Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>

Uh oh!

Conversation

HDCharles commented Nov 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

heheda12345 commented Nov 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mgoin commented Nov 19, 2025

Uh oh!

mgoin commented Nov 20, 2025

Uh oh!

mgoin commented Nov 21, 2025

Uh oh!

dsikka commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

HDCharles Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HDCharles commented Nov 17, 2025 •

edited by github-actions bot

Loading

dsikka commented Nov 21, 2025 •

edited

Loading