[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling#31931
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request correctly addresses an accuracy regression on ROCm for MoE models by reordering operations in the fused_moe_kernel. The change ensures that router weight multiplication is performed on float32 accumulators before any precision conversion to compute_type (e.g., bf16/fp16). This prevents precision loss that was causing non-deterministic expert routing on ROCm. The logic is sound, well-explained in the PR description, and the change is correctly localized. The corresponding fused_moe_kernel_gptq_awq kernel already has the correct order of operations, so no changes are needed there. The updated comment for bias addition also improves clarity. The changes look good to me.
|
Thanks @AndreasKaratzas . I have a question, is this PR mostly for LoRA scenarios? In fact, the ordering: Note: platform is ROCm MI355. |
|
@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of?
I just tested on MI325 using |
Let's prioritize ensuring accuracy. I have added a unit test |
If your new PR gets merge, please test it before merging it with |
Oh, I 'll have a see if can reproduce on MI355. Maybe we need a more robust logic here (covering LoRA, gpt-oss, etc) if so. |
On MI355, both
For MI325, I am looking for an available resource to run the tests. Will update them later. |
I'm getting the feeling that this is architectural. I bisected the bug before I commented on your PR and make this bug fix. On MI325, the results were the exact opposite (without the gpt test which I didn't have in mind). If you make another related PR, please CC me, so that I can test on mi325 as well :) Also for the #29008 PR, PM me so that we coordinate an effort to run an eval with those changes on mi325 too. It's important that the CI is as green as possible. |
Here is a new PR #31962 about the computation flow. I'd welcome your feedback and review. Thanks. |
… weight scaling (vllm-project#31931) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… weight scaling (vllm-project#31931) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… weight scaling (vllm-project#31931) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
… weight scaling (vllm-project#31931) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Fixes MoE accuracy regression on ROCm introduced in #31676.
Problem
PR #31676 reordered post-accumulation operations to add bias after dequantization. However, this also moved
MUL_ROUTED_WEIGHTafter the.to(compute_type)precision conversion, causing router weight multiplication to occur in bf16/fp16 instead of float32.This precision loss causes different outputs on ROCm due to differences in mixed-precision handling between ROCm and CUDA Triton backends. The issue manifests as non-deterministic expert routing, particularly visible in LoRA mixed-adapter tests.
Root Cause
Before #31676 (correct):
After #31676 (broken on ROCm):
Fix
Restore
MUL_ROUTED_WEIGHTbefore precision conversion while preserving the correct bias placement after dequantization:Testing
test_olmoe_lora_mixedpasses on ROCmFixes CI regression from: #31676