[ROCm] Cast score correction bias tensor during model construction for DeepSeek/Kimi-K2#39999
Conversation
Signed-off-by: Hemanth Acharya <heachary@amd.com>
There was a problem hiding this comment.
Code Review
This pull request improves the efficiency of the DeepSeek-V2 model by pre-casting the e_score_correction_bias during the initialization phase, which avoids repeated type conversions during each forward pass. Additionally, it adds assertions to the ROCm Aiter fused MoE layers to verify that the bias and gating output types match. Feedback highlights a potential issue where direct mutation of the parameter's data and the use of a static type attribute could cause the new assertions to fail if the model is cast to a different precision after initialization.
Signed-off-by: Hemanth Acharya <heachary@amd.com>
0330c2b to
34dd717
Compare
Signed-off-by: Hemanth Acharya <heachary@amd.com>
Head branch was pushed to by a user without write access
Signed-off-by: Hemanth Acharya <heachary@amd.com>
|
@bnellnm could you take a final look? |
| # Pre-cast the bias to match the gate output dtype so the | ||
| # conversion is not repeated on every forward pass. All | ||
| # downstream references (FusedMoE, router) share the same | ||
| # nn.Parameter object, so mutating .data propagates everywhere. | ||
| # Weight loading uses copy_(), which handles the dtype conversion. | ||
| # Only needed on ROCm where the aiter biased_grouped_topk kernel | ||
| # requires the bias dtype to match the gating output dtype. | ||
| if ( | ||
| self.is_rocm_aiter_moe_enabled | ||
| and self.gate.e_score_correction_bias is not None | ||
| ): | ||
| self.gate.e_score_correction_bias.data = ( | ||
| self.gate.e_score_correction_bias.data.to(self.gate.out_dtype) | ||
| ) |
There was a problem hiding this comment.
I think this block of code could live in fused_moe/layer.py (with any additional appropriate checks, e.g. routing type)
There was a problem hiding this comment.
I mentioned already in my previous comment why thats a harder change that i decided to skip. Let me elaborate with some details here:
Moving the bias pre-cast (lines 354-367) into FusedMoE.init() isn't standalone — it depends on gate.set_out_dtype() which is called just above it, and that call relies on self.experts.quant_method.is_monolithic and self.experts.routing_method_type — both only available after FusedMoE.init() completes. So both blocks (set_out_dtype() and the new bias dtype cast) would need to move together to the end of FusedMoE.init().
The concern is that this becomes more invasive: every model passing gate= to FusedMoE — including qwen3_moe, qwen3_next, step3p5, and AXK1 — would now have set_out_dtype called automatically in FusedMoE.init(), which changes their gate output dtype behavior even though they don't currently call set_out_dtype at all.
If this is not a big concern, I would like to leave this section as is to minimise the impact.
bnellnm
left a comment
There was a problem hiding this comment.
LGTM but I think the initial casting code could probably live in layer.py since it seems like it is generally applicable to ROCm MoE and particular routing methods.
|
@tjtanaa theres a failing unit test, but from the looks of it i think its disconnected to the changes in this PR. let me know if i should investigate further. |
…r DeepSeek/Kimi-K2 (vllm-project#39999) Signed-off-by: Hemanth Acharya <heachary@amd.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…r DeepSeek/Kimi-K2 (vllm-project#39999) Signed-off-by: Hemanth Acharya <heachary@amd.com> Signed-off-by: Adrian <info@zzit.ch>
…r DeepSeek/Kimi-K2 (vllm-project#39999) Signed-off-by: Hemanth Acharya <heachary@amd.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
Purpose
The moe score correction bias tensor was being cast to the gate output dtype on every forward pass. The datatype that this tensor needs to be cast to is known at model construction time and never changes beyond that. So this repeated cast is redundant work that launches an extra GPU kernel per MoE layer per forward call.
This PR moves the cast to the model construction thereby eliminating the per-forward-pass overhead.
Summary
Before
After
The trace shows the elementwise operation kernel responsible for this typecast operation happening before the grouped-topk operation in every forward pass. With the following changes this typecast is shifted to model construction thereby avoiding the call during every forward pass:
vllm/model_executor/models/deepseek_v2.py: Pre-caste_score_correction_biasto matchgate.out_dtypeduringDeepseekV2MoEconstruction. Since all downstream consumers (FusedMoE, router) share the samenn.Parameterobject, this single mutation propagates everywhere.vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py: Replace the runtime.to()cast with an assert that the bias dtype already matches the gating output dtype, catching any future regression where the init-time cast is missed.vllm/model_executor/layers/fused_moe/router/fused_topk_bias_router.py: Same change — replace the runtime.to()cast with a matching assert.Test Result
Accuracy
Performance
Test Plan