Skip to content

[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP)#25987

Merged
benchislett merged 7 commits intovllm-project:mainfrom
CentML:fix-nvfp4-deepseek-mtp
Oct 6, 2025
Merged

[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP)#25987
benchislett merged 7 commits intovllm-project:mainfrom
CentML:fix-nvfp4-deepseek-mtp

Conversation

@benchislett
Copy link
Copy Markdown
Collaborator

@benchislett benchislett commented Sep 30, 2025

Purpose

There is no fallback in ModelOptNvfp4Config.get_quant_method for when quant_config should skip an MoE layer.

This is a problem for nvidia/DeepSeek-R1-FP4 when running with MTP since the entire MTP layer is left unquantized, and should be skipped by quantization:

https://huggingface.co/nvidia/DeepSeek-R1-FP4/blob/main/hf_quant_config.json#L188
"exclude_modules": [
...
"model.layers.61*",
...
]

This PR includes some diff from #25953.

Testing

Evaluated in combination with #25984, see results there.

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a bugfix to allow skipping Mixture-of-Experts (MoE) layers during NVFP4 quantization, which is crucial for models like nvidia/DeepSeek-R1-FP4 when using Multi-Token Prediction (MTP).

The main changes are:

  1. In vllm/model_executor/layers/quantization/modelopt.py, the ModelOptNvFp4Config.get_quant_method now checks if an MoE layer is in the exclusion list and returns None if so.
  2. In vllm/model_executor/layers/fused_moe/layer.py, the FusedMoE layer's __init__ method is updated to handle the None return from get_quant_method by falling back to the unquantized method, effectively skipping quantization for that layer.
  3. Several related changes in deepseek_v2.py, deepseek_mtp.py, and deepseek_eagle.py refactor how the model configuration is passed to DeepseekV2DecoderLayer to correctly support draft models in speculative decoding scenarios.

The changes are well-structured and correctly address the identified issue. The refactoring for config propagation is clean and necessary. The overall implementation looks solid.

@benchislett benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 30, 2025
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, thanks for the fix

@mgoin
Copy link
Copy Markdown
Member

mgoin commented Oct 1, 2025

@benchislett The basic model failure seems related

[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1139, in <lambda>
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)     lambda prefix: DeepseekV2DecoderLayer(vllm_config, prefix,
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 997, in __init__
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)     config = config or vllm_config.model_config.hf_config
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 103, in __torch_function__
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)     return func(*args, **kwargs)
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151)            ^^^^^^^^^^^^^^^^^^^^^
[2025-10-01T00:28:07Z] (EngineCore_DP0 pid=151) RuntimeError: Boolean value of Tensor with more than one value is ambiguous
[2025-10-01T00:28:07Z] FAILED

Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@benchislett benchislett added the bug Something isn't working label Oct 2, 2025
@mgoin
Copy link
Copy Markdown
Member

mgoin commented Oct 2, 2025

@benchislett please merge with main to fix the docker

@mergify
Copy link
Copy Markdown

mergify bot commented Oct 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @benchislett.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 3, 2025
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@mergify mergify bot removed the needs-rebase label Oct 3, 2025
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
@benchislett benchislett merged commit 2161efe into vllm-project:main Oct 6, 2025
62 checks passed
@benchislett benchislett deleted the fix-nvfp4-deepseek-mtp branch October 6, 2025 20:16
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants