Skip to content

[ROCm][Bugfix] Fix accuracy issue on fmoe when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS enabled#31523

Merged
tjtanaa merged 5 commits intovllm-project:mainfrom
ROCm:ganyi/fix_fmoe_acc
Dec 30, 2025
Merged

[ROCm][Bugfix] Fix accuracy issue on fmoe when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS enabled#31523
tjtanaa merged 5 commits intovllm-project:mainfrom
ROCm:ganyi/fix_fmoe_acc

Conversation

@ganyi1996ppo
Copy link
Contributor

@ganyi1996ppo ganyi1996ppo commented Dec 30, 2025

Purpose

This PR fix the accuracy issue introduced from #31221, which caused all the model with MOE structure have serious accuracy regression when shared experts fusion enabled. This PR add +grouped_topk into custom_ops to make the dispatch actually happens.

Test Plan

Test deepseek-r1 on gsm8k

Test Result

current vllm main branch:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.0099|±  |0.0027|
|     |       |strict-match    |     5|exact_match|↑  |0.0000|±  |0.0000|

With this fix

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9477|±  |0.0061|
|     |       |strict-match    |     5|exact_match|↑  |0.9454|±  |0.0063|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: ganyi <ygan@amd.com>
@mergify mergify bot added the rocm Related to AMD ROCm label Dec 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical accuracy issue on ROCm for MoE models when VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS is enabled. The fix involves enabling the grouped_topk custom op to ensure correct kernel dispatch. While the approach is correct, the implementation could be more robust. I've suggested an improvement to handle cases where a user might have explicitly disabled this custom op, which would otherwise cause the fix to fail silently.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: ganyi <ygan@amd.com>
@tjtanaa
Copy link
Collaborator

tjtanaa commented Dec 30, 2025

@ganyi1996ppo can you check if when aiter fused moe is enabled, are we using rocm_aiter_grouped_topk by default? Because when VLLM_ROCM_USE_AITER=1, we always consider using the forward_hip from AITER. The custom op PR might have made it so that when +grouped_topk is not added to the custom op list, so even if we add VLLM_ROCM_USE_AITER=1, it most likely will still not use the forward_hip. Can you validate this and try to fix it in this PR as well?

This is the reason why I am very reluctant to change all to custom ops. The default behaviour now is controlled through custom_op list. Since the default behaviour now is controlled through custom_op list. We will need to add explicit custom op unit tests as well to catch this. Regular unit tests still run correctly with wrong code path (forward_native).

@ProExpertProg @MengqingCao @xinyu-intel Can we try to ask developers who are migrating the ops to custom ops in their PR to write unit tests for the custom op classes for all platforms because regular unit tests still run correctly with wrong code path (forward_native), because regular unit tests only tests for correctness.

@ganyi1996ppo
Copy link
Contributor Author

ganyi1996ppo commented Dec 30, 2025

@ganyi1996ppo can you check if when aiter fused moe is enabled, are we using rocm_aiter_grouped_topk by default? Because when VLLM_ROCM_USE_AITER=1, we always consider using the forward_hip from AITER. The custom op PR might have made it so that when +grouped_topk is not added to the custom op list, so even if we add VLLM_ROCM_USE_AITER=1, it most likely will still not use the forward_hip. Can you validate this and try to fix it in this PR as well?

This is the reason why I am very reluctant to change all to custom ops. The default behaviour now is controlled through custom_op list.

@tjtanaa We might still get into native when fused shared expert is not enabled, but what you said is exactly what I thought. We should add +grouped_topk as long as the aiter is enabled.

Signed-off-by: ganyi <ygan@amd.com>
@ganyi1996ppo
Copy link
Contributor Author

@tjtanaa take another look please, I just make +grouped_topk a default configuration.

compilation_config = vllm_config.compilation_config
parallel_config = vllm_config.parallel_config
is_eager_execution = compilation_config == CUDAGraphMode.NONE
use_aiter = rocm_aiter_ops.is_enabled()
Copy link
Collaborator

@tjtanaa tjtanaa Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's call this directly at rocm_aiter_ops.is_fused_moe_enabled() like describe in the doc string of _aiter_ops (https://github.com/vllm-project/vllm/blob/51085c2aebe7df2c53dbe9a44d89bc3ee761793f/vllm/_aiter_ops.py#L772C8-L772C42)
topk routing functions are part of MoE related ops.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

use_aiter_rms_norm = rocm_aiter_ops.is_rmsnorm_enabled()
use_aiter_fp8_linear = rocm_aiter_ops.is_linear_fp8_enabled()
use_aiter_shared_expert = (
rocm_aiter_ops.is_fused_moe_enabled()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the and because rocm_aiter_ops.is_fusion_moe_shared_experts_enabled() already check for both as shown in line

return cls.is_fused_moe_enabled() and cls._MOE_SHARED_EXPERTS_ENABLED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done.

Signed-off-by: ganyi <ygan@amd.com>
Copy link
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 30, 2025
@tjtanaa tjtanaa enabled auto-merge (squash) December 30, 2025 08:28
@tjtanaa tjtanaa merged commit 1a834df into vllm-project:main Dec 30, 2025
46 of 47 checks passed
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Jan 9, 2026
…USION_SHARED_EXPERTS` enabled (vllm-project#31523)

Signed-off-by: ganyi <ygan@amd.com>
akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026
…USION_SHARED_EXPERTS` enabled (vllm-project#31523)

Signed-off-by: ganyi <ygan@amd.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
…USION_SHARED_EXPERTS` enabled (vllm-project#31523)

Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…USION_SHARED_EXPERTS` enabled (vllm-project#31523)

Signed-off-by: ganyi <ygan@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants