Skip to content

[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling#31931

Merged
tjtanaa merged 1 commit intovllm-project:mainfrom
ROCm:akaratza_ci_lora
Jan 8, 2026
Merged

[ROCm][LoRA] Fix MoE accuracy regression by preserving float32 router weight scaling#31931
tjtanaa merged 1 commit intovllm-project:mainfrom
ROCm:akaratza_ci_lora

Conversation

@AndreasKaratzas
Copy link
Collaborator

@AndreasKaratzas AndreasKaratzas commented Jan 7, 2026

Fixes MoE accuracy regression on ROCm introduced in #31676.

Problem

PR #31676 reordered post-accumulation operations to add bias after dequantization. However, this also moved MUL_ROUTED_WEIGHT after the .to(compute_type) precision conversion, causing router weight multiplication to occur in bf16/fp16 instead of float32.

This precision loss causes different outputs on ROCm due to differences in mixed-precision handling between ROCm and CUDA Triton backends. The issue manifests as non-deterministic expert routing, particularly visible in LoRA mixed-adapter tests.

Root Cause

Before #31676 (correct):

accumulator = accumulator * moe_weight  # float32
accumulator = accumulator.to(compute_type)  # float32 to bf16

After #31676 (broken on ROCm):

accumulator = accumulator.to(compute_type)  # float32 to bf16
accumulator = accumulator * moe_weight  # bf16 x float32

Fix

Restore MUL_ROUTED_WEIGHT before precision conversion while preserving the correct bias placement after dequantization:

# Router weights in float32 (before conversion)
if MUL_ROUTED_WEIGHT:
    accumulator = accumulator * moe_weight[:, None]

# Dequantization + precision conversion
accumulator = accumulator.to(compute_type)

# Bias after dequantization (unchanged from #31676)
if HAS_BIAS:
    accumulator = accumulator + bias

Testing

  • test_olmoe_lora_mixed passes on ROCm

Fixes CI regression from: #31676

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas
Copy link
Collaborator Author

cc @tjtanaa @xuebwang-amd

@mergify mergify bot added the rocm Related to AMD ROCm label Jan 7, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses an accuracy regression on ROCm for MoE models by reordering operations in the fused_moe_kernel. The change ensures that router weight multiplication is performed on float32 accumulators before any precision conversion to compute_type (e.g., bf16/fp16). This prevents precision loss that was causing non-deterministic expert routing on ROCm. The logic is sound, well-explained in the PR description, and the change is correctly localized. The corresponding fused_moe_kernel_gptq_awq kernel already has the correct order of operations, so no changes are needed there. The updated comment for bias addition also improves clarity. The changes look good to me.

Copy link
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 8, 2026
@tjtanaa tjtanaa enabled auto-merge (squash) January 8, 2026 02:23
@xuebwang-amd
Copy link
Contributor

xuebwang-amd commented Jan 8, 2026

Thanks @AndreasKaratzas . I have a question, is this PR mostly for LoRA scenarios?

In fact, the ordering: MUL_ROUTED_WEIGHT -> to(compute_type) -> HAS_BIAS (bias adding) was tested in my end on model amd/gpt-oss-20b-WFP8-AFP8-KVFP8, giving all zeros accuracies.
While the ordering: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT gives reasonable accuracy numbers as listed in PR#31676.

Note: platform is ROCm MI355.

@AndreasKaratzas
Copy link
Collaborator Author

AndreasKaratzas commented Jan 8, 2026

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of?

Thanks @AndreasKaratzas . I have a question, is this PR mostly for LoRA scenarios?

In fact, the ordering: MUL_ROUTED_WEIGHT -> to(compute_type) -> HAS_BIAS (bias adding) was tested in my end on model amd/gpt-oss-20b-WFP8-AFP8-KVFP8, giving all zeros accuracies. While the ordering: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT gives reasonable accuracy numbers as listed in PR#31676.

Note: platform is ROCm MI355.

I just tested on MI325 using pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed which tests for accuracy, and the ordering you propose does not pass. So maybe it's a gpt-oss or an MI355 fluke. Is there any other model which we could test?

@xuebwang-amd
Copy link
Contributor

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of? Or shall I simply change the order as you suggest?

Let's prioritize ensuring accuracy. I have added a unit test tests/models/quantization/test_gpt_oss.py which is coming soon (in PR#29008).

@tjtanaa tjtanaa merged commit c4041f3 into vllm-project:main Jan 8, 2026
57 checks passed
@AndreasKaratzas AndreasKaratzas deleted the akaratza_ci_lora branch January 8, 2026 04:18
@AndreasKaratzas
Copy link
Collaborator Author

AndreasKaratzas commented Jan 8, 2026

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of? Or shall I simply change the order as you suggest?

Let's prioritize ensuring accuracy. I have added a unit test tests/models/quantization/test_gpt_oss.py which is coming soon (in PR#29008).

If your new PR gets merge, please test it before merging it with pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed. Also ensure that tests pass on MI325 as well. I can help you with that.

@xuebwang-amd
Copy link
Contributor

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of?

Thanks @AndreasKaratzas . I have a question, is this PR mostly for LoRA scenarios?
In fact, the ordering: MUL_ROUTED_WEIGHT -> to(compute_type) -> HAS_BIAS (bias adding) was tested in my end on model amd/gpt-oss-20b-WFP8-AFP8-KVFP8, giving all zeros accuracies. While the ordering: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT gives reasonable accuracy numbers as listed in PR#31676.
Note: platform is ROCm MI355.

I just tested on MI325 using pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed which tests for accuracy, and the ordering you propose does not pass. So maybe it's a gpt-oss or an MI355 fluke. Is there any other model which we could test?

Oh, I 'll have a see if can reproduce on MI355. Maybe we need a more robust logic here (covering LoRA, gpt-oss, etc) if so.

@xuebwang-amd
Copy link
Contributor

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of? Or shall I simply change the order as you suggest?

Let's prioritize ensuring accuracy. I have added a unit test tests/models/quantization/test_gpt_oss.py which is coming soon (in PR#29008).

If your new PR gets merge, please test it before merging it with pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed. Also ensure that tests pass on MI325 as well. I can help you with that.

On MI355, both pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed and accuracy test on amd/gpt-oss-20b-WFP8-AFP8-KVFP8:

  • order in this PR MUL_ROUTED_WEIGHT -> to(compute_type) -> HAS_BIAS (bias adding): failed
  • order in the PR#31676: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT: passed

For MI325, I am looking for an available resource to run the tests. Will update them later.

@AndreasKaratzas
Copy link
Collaborator Author

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of? Or shall I simply change the order as you suggest?

Let's prioritize ensuring accuracy. I have added a unit test tests/models/quantization/test_gpt_oss.py which is coming soon (in PR#29008).

If your new PR gets merge, please test it before merging it with pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed. Also ensure that tests pass on MI325 as well. I can help you with that.

On MI355, both pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed and accuracy test on amd/gpt-oss-20b-WFP8-AFP8-KVFP8:

  • order in this PR MUL_ROUTED_WEIGHT -> to(compute_type) -> HAS_BIAS (bias adding): failed
  • order in the PR#31676: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT: passed

For MI325, I am looking for an available resource to run the tests. Will update them later.

I'm getting the feeling that this is architectural. I bisected the bug before I commented on your PR and make this bug fix. On MI325, the results were the exact opposite (without the gpt test which I didn't have in mind). If you make another related PR, please CC me, so that I can test on mi325 as well :)

Also for the #29008 PR, PM me so that we coordinate an effort to run an eval with those changes on mi325 too. It's important that the CI is as green as possible.

@xuebwang-amd
Copy link
Contributor

@xuebwang-amd Could you help me with testing accuracy? Is there a command that you may be aware of? Or shall I simply change the order as you suggest?

Let's prioritize ensuring accuracy. I have added a unit test tests/models/quantization/test_gpt_oss.py which is coming soon (in PR#29008).

If your new PR gets merge, please test it before merging it with pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed. Also ensure that tests pass on MI325 as well. I can help you with that.

On MI355, both pytest -s -v lora/test_olmoe_tp.py::test_olmoe_lora_mixed and accuracy test on amd/gpt-oss-20b-WFP8-AFP8-KVFP8:

  • order in this PR MUL_ROUTED_WEIGHT -> to(compute_type) -> HAS_BIAS (bias adding): failed
  • order in the PR#31676: to(compute_type) -> HAS_BIAS (bias adding) -> MUL_ROUTED_WEIGHT: passed

For MI325, I am looking for an available resource to run the tests. Will update them later.

I'm getting the feeling that this is architectural. I bisected the bug before I commented on your PR and make this bug fix. On MI325, the results were the exact opposite (without the gpt test which I didn't have in mind). If you make another related PR, please CC me, so that I can test on mi325 as well :)

Also for the #29008 PR, PM me so that we coordinate an effort to run an eval with those changes on mi325 too. It's important that the CI is as green as possible.

Here is a new PR #31962 about the computation flow. I'd welcome your feedback and review. Thanks.

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
… weight scaling (vllm-project#31931)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
… weight scaling (vllm-project#31931)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
… weight scaling (vllm-project#31931)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
… weight scaling (vllm-project#31931)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants