[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP)#25987
[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP)#25987benchislett merged 7 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a bugfix to allow skipping Mixture-of-Experts (MoE) layers during NVFP4 quantization, which is crucial for models like nvidia/DeepSeek-R1-FP4 when using Multi-Token Prediction (MTP).
The main changes are:
- In
vllm/model_executor/layers/quantization/modelopt.py, theModelOptNvFp4Config.get_quant_methodnow checks if an MoE layer is in the exclusion list and returnsNoneif so. - In
vllm/model_executor/layers/fused_moe/layer.py, theFusedMoElayer's__init__method is updated to handle theNonereturn fromget_quant_methodby falling back to the unquantized method, effectively skipping quantization for that layer. - Several related changes in
deepseek_v2.py,deepseek_mtp.py, anddeepseek_eagle.pyrefactor how the model configuration is passed toDeepseekV2DecoderLayerto correctly support draft models in speculative decoding scenarios.
The changes are well-structured and correctly address the identified issue. The refactoring for config propagation is clean and necessary. The overall implementation looks solid.
mgoin
left a comment
There was a problem hiding this comment.
Looks reasonable to me, thanks for the fix
|
@benchislett The basic model failure seems related |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
@benchislett please merge with main to fix the docker |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Purpose
There is no fallback in
ModelOptNvfp4Config.get_quant_methodfor when quant_config should skip an MoE layer.This is a problem for
nvidia/DeepSeek-R1-FP4when running with MTP since the entire MTP layer is left unquantized, and should be skipped by quantization:https://huggingface.co/nvidia/DeepSeek-R1-FP4/blob/main/hf_quant_config.json#L188
"exclude_modules": [
...
"model.layers.61*",
...
]
This PR includes some diff from #25953.
Testing
Evaluated in combination with #25984, see results there.