[Bug] R1 Accuracy: Fix routed_scaling_factor Double Mul Issue#24119
Conversation
Signed-off-by: yewentao256 <zhyanwentao@126.com>
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a bug in the DeepseekV2MoE forward pass where routed_scaling_factor was being applied twice for non-fp16 data types. The self.experts layer (FusedMoE) already applies this scaling factor internally. By removing the redundant multiplication, the code is now more correct. I have also added a review comment highlighting a potential related scaling issue in the handling of shared_output that may require further investigation.
routed_scaling_factor Double Mul Issuerouted_scaling_factor Double Mul Issue
| final_hidden_states = self.experts(hidden_states=hidden_states, | ||
| router_logits=router_logits) |
There was a problem hiding this comment.
This is handled here right?
vllm/vllm/model_executor/layers/fused_moe/fused_moe.py
Lines 1014 to 1015 in 98aee61
Why does main not apply the routed_scaling_factor in the fp16 case? Is there an accuracy issue we need to handle?
There was a problem hiding this comment.
Yes, so I changed another way to fix since deepseek v2 has a lot of logic for routed_scaling_factor outside
Signed-off-by: yewentao256 <zhyanwentao@126.com>
sarckk
left a comment
There was a problem hiding this comment.
thanks for the fix
I think we also need this fix for dots1 and glm4_moe? both use use_grouped_topk=True so they should be affected by the double mul issue as well
vllm/vllm/model_executor/models/dots1.py
Line 163 in 98aee61
vllm/vllm/model_executor/models/glm4_moe.py
Line 189 in 98aee61
Signed-off-by: yewentao256 <zhyanwentao@126.com>
@sarckk Nice find! Feel free to have a pr towards this. |
|
@yewentao256 Thanks for the fix. BTW, can you please add a test in our CI? I think we've seen similar accuracy issues 5+ times on DeepSeek... |
|
Besiding running full deepseek model, is it possible to run a smaller model to detect such issue, like DeepSeek v2, etc. |
* 'main' of https://github.com/845473182/vllm: (457 commits) [BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models (vllm-project#24132) [Misc] Add check for dual_chunk_attention (vllm-project#24070) [Doc]: fix typos in Python comments (vllm-project#24115) [Doc]: fix typos in Python comments (vllm-project#24093) [Compile] Fix Compile Warning for `w4a8_mm_entry.cu` (vllm-project#23660) fix some typos (vllm-project#24071) [V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing (vllm-project#23656) Upgrade xgrammar to 0.1.23 (vllm-project#22988) Update release pipeline post PyTorch 2.8.0 update (vllm-project#24073) [XPU] Fix the bug of LoRA logits on the XPU platform (vllm-project#24081) [CI/Build] Disable SiluMul NVFP4 quant fusion tests (vllm-project#24121) [Bug] R1 Accuracy: Fix `routed_scaling_factor` Double Mul Issue (vllm-project#24119) [AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault (vllm-project#23692) [CI] Enable all hf transformers baselines in test_hybrid (vllm-project#23936) [Log] Only Print Profiler Results on Rank 0 (vllm-project#23370) Fix weights loading for Apertus (vllm-project#24100) [Metrics] Deprecate TPOT in favor of ITL (vllm-project#24110) [Bugfix] Fix packed_factor missing attribute error (vllm-project#23902) Run ruff format on a few files. (vllm-project#24075) [Bugfix] Fix transform_config parsing in Compressed Tensors (vllm-project#23945) ...
…-project#24119) Signed-off-by: yewentao256 <zhyanwentao@126.com>
…-project#24119) Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
Fixes #24118 that is introduced by #23123
Test
lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=deepseek-ai/DeepSeek-R1,num_concurrent=256" --tasks gsm8kOrigin
Now