[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP) by benchislett · Pull Request #25987 · vllm-project/vllm

benchislett · 2025-09-30T20:50:08Z

Purpose

There is no fallback in ModelOptNvfp4Config.get_quant_method for when quant_config should skip an MoE layer.

This is a problem for nvidia/DeepSeek-R1-FP4 when running with MTP since the entire MTP layer is left unquantized, and should be skipped by quantization:

https://huggingface.co/nvidia/DeepSeek-R1-FP4/blob/main/hf_quant_config.json#L188
"exclude_modules": [
...
"model.layers.61*",
...
]

This PR includes some diff from #25953.

Testing

Evaluated in combination with #25984, see results there.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

gemini-code-assist

Code Review

This pull request introduces a bugfix to allow skipping Mixture-of-Experts (MoE) layers during NVFP4 quantization, which is crucial for models like nvidia/DeepSeek-R1-FP4 when using Multi-Token Prediction (MTP).

The main changes are:

In vllm/model_executor/layers/quantization/modelopt.py, the ModelOptNvFp4Config.get_quant_method now checks if an MoE layer is in the exclusion list and returns None if so.
In vllm/model_executor/layers/fused_moe/layer.py, the FusedMoE layer's __init__ method is updated to handle the None return from get_quant_method by falling back to the unquantized method, effectively skipping quantization for that layer.
Several related changes in deepseek_v2.py, deepseek_mtp.py, and deepseek_eagle.py refactor how the model configuration is passed to DeepseekV2DecoderLayer to correctly support draft models in speculative decoding scenarios.

The changes are well-structured and correctly address the identified issue. The refactoring for config propagation is clean and necessary. The overall implementation looks solid.

mgoin

Looks reasonable to me, thanks for the fix

mgoin · 2025-10-01T16:04:09Z

@benchislett The basic model failure seems related

[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1139, in <lambda>
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)     lambda prefix: DeepseekV2DecoderLayer(vllm_config, prefix,
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 997, in __init__
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)     config = config or vllm_config.model_config.hf_config
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 103, in __torch_function__
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)     return func(*args, **kwargs)
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)            ^^^^^^^^^^^^^^^^^^^^^
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151) RuntimeError: Boolean value of Tensor with more than one value is ambiguous
[2025-10-01T00:28:07Z] FAILED

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mgoin · 2025-10-02T16:05:34Z

@benchislett please merge with main to fix the docker

mergify · 2025-10-03T22:33:09Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

fix for skipping quant of nvfp4 moe

0c58615

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett requested review from luccafong, mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 30, 2025 20:50

mergify bot added deepseek Related to DeepSeek models speculative-decoding labels Sep 30, 2025

tlrmchlsmth approved these changes Sep 30, 2025

View reviewed changes

gemini-code-assist bot reviewed Sep 30, 2025

View reviewed changes

tlrmchlsmth mentioned this pull request Sep 30, 2025

[Bugfix][DeepSeek Fix config used for DeepseekV2 Eagle #25953

Closed

benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 30, 2025

mgoin approved these changes Oct 1, 2025

View reviewed changes

benchislett mentioned this pull request Oct 1, 2025

[Spec Decode] Enable efficient speculative decoding with FlashInfer-MLA #25984

Merged

fix bad check

7bf8ca6

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

benchislett added the bug Something isn't working label Oct 2, 2025

mergify bot added the needs-rebase label Oct 3, 2025

Merge branch 'main' into fix-nvfp4-deepseek-mtp

33df5ac

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

mergify bot removed the needs-rebase label Oct 3, 2025

benchislett added 4 commits October 4, 2025 04:21

fix test

cce851b

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Merge branch 'main' into fix-nvfp4-deepseek-mtp

8bb6323

Merge branch 'main' into fix-nvfp4-deepseek-mtp

a29c20b

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Merge branch 'main' into fix-nvfp4-deepseek-mtp

3e55916

benchislett mentioned this pull request Oct 6, 2025

[Perf] Add decode full-graph support to FlashInfer-MLA backend #26313

Merged

benchislett merged commit 2161efe into vllm-project:main Oct 6, 2025
62 checks passed

benchislett deleted the fix-nvfp4-deepseek-mtp branch October 6, 2025 20:16

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP) (vllm-project#25987)

6e79373

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP) (vllm-project#25987)

1b14a77

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP) (vllm-project#25987)

704aff2

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP) (vllm-project#25987)

c66ad78

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP)#25987

[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP)#25987
benchislett merged 7 commits intovllm-project:mainfrom
CentML:fix-nvfp4-deepseek-mtp

benchislett commented Sep 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mgoin left a comment

Uh oh!

mgoin commented Oct 1, 2025

Uh oh!

mgoin commented Oct 2, 2025

Uh oh!

mergify bot commented Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

benchislett commented Sep 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin commented Oct 1, 2025

Uh oh!

mgoin commented Oct 2, 2025

Uh oh!

mergify bot commented Oct 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benchislett commented Sep 30, 2025 •

edited by github-actions bot

Loading