Skip to content

[ROCm][Bugfix]: Only save unpadded sizes for shared_experts in MoERunner to fix rmsnorm pad fusion#34636

Merged
vllm-bot merged 5 commits intovllm-project:mainfrom
ROCm:fix_fused_rmsnorm_pad
Feb 21, 2026
Merged

[ROCm][Bugfix]: Only save unpadded sizes for shared_experts in MoERunner to fix rmsnorm pad fusion#34636
vllm-bot merged 5 commits intovllm-project:mainfrom
ROCm:fix_fused_rmsnorm_pad

Conversation

@Rohan138
Copy link
Contributor

@Rohan138 Rohan138 commented Feb 16, 2026

Purpose

#32344 introduced a silent regression for gpt-oss on ROCm by disabling the pattern matching for the RMSNorm+padding fusion introduced in #30976.

Specifically, passing the original_hidden_states into the moe_forward custom op breaks pattern matching for the AddAiterRMSNormPadPattern, since there is an additional user auto_functionalized(moe_forward) of the unpadded hidden states output from the RMSNorm:

constant_pad_nd(getitem, [0, 192], 0.0) multiple_users pattern CallFunction(operator.getitem, CallFunction(vllm.rocm_aiter_rmsnorm2d_fwd_with_add.default, KeywordArg('input'), KeywordArg('residual'), KeywordArg('weight'), *, _users=2), 0, _users=2)
does not match node %getitem : [num_users=3] = call_function[target=operator.getitem](args = (%rocm_aiter_rmsnorm2d_fwd_with_add, 0), kwargs = {})
with users {constant_pad_nd: None, auto_functionalized: None, rocm_unquantized_gemm_1: None} 

Full pattern:
MultiOutputPattern([CallFunction(aten.constant_pad_nd.default, CallFunction(operator.getitem, CallFunction(vllm.rocm_aiter_rmsnorm2d_fwd_with_add.default, KeywordArg('input'), KeywordArg('residual'), KeywordArg('weight'), *, _users=2), 0, _users=2), *, *), CallFunction(operator.getitem, CallFunction(vllm.rocm_aiter_rmsnorm2d_fwd_with_add.default, KeywordArg('input'), KeywordArg('residual'), KeywordArg('weight'), *, _users=2), 1), CallFunction(vllm.rocm_unquantized_gemm.default, CallFunction(operator.getitem, CallFunction(vllm.rocm_aiter_rmsnorm2d_fwd_with_add.default, KeywordArg('input'), KeywordArg('residual'), KeywordArg('weight'), *, _users=2), 0, _users=2), KeywordArg('router_weight'), KeywordArg('router_bias'))])

IMO this is a cleaner fix than rewriting the whole pass, especially since even without the fusion, keeping the original_hidden_states around for the moe_forward pass is additional memory overhead we don't actually need. Alternatively, we could just have the moe_forward and moe_forward_shared ops have different call signatures as before:


        if self.shared_experts is None:
            fused_output = torch.ops.vllm.moe_forward(
                hidden_states, router_logits, encode_layer_name()
            )
            return reduce_output(fused_output)[..., :transformed_hidden_dim]
        else:
          # We pass original tensor for shared experts (not transformed)
          shared_output, fused_output = torch.ops.vllm.moe_forward_shared(
              hidden_states,
              router_logits,
              encode_layer_name(),
              original_hidden_states,
          )

Test Plan

vllm serve openai/gpt-oss-120b --attention-backend ROCM_AITER_UNIFIED_ATTN and checking perf+traces. I'll follow up with expanding our e2e fusions tests on ROCm in a separate PR.

Test Result

Before:

============ Serving Benchmark Result ============
Successful requests:                     320       
Benchmark duration (s):                  118.22    
Total input tokens:                      294646    
Total generated tokens:                  295581    
Request throughput (req/s):              2.71      
Output token throughput (tok/s):         2500.21   
Total Token throughput (tok/s):          4992.51

After:

============ Serving Benchmark Result ============
Successful requests:                     320       
Benchmark duration (s):                  117.02    
Total input tokens:                      294646    
Total generated tokens:                  295581    
Request throughput (req/s):              2.73      
Output token throughput (tok/s):         2525.90   
Total Token throughput (tok/s):          5043.80  

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
@mergify mergify bot added rocm Related to AMD ROCm bug Something isn't working labels Feb 16, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Feb 16, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a silent regression on ROCm where an RMSNorm+padding fusion was being disabled. The root cause was an unconditional creation of a reference to hidden_states before a transformation, which added an extra user to the tensor and broke the fusion pattern.

The fix is to only create this reference (original_hidden_states) when it's actually needed, i.e., when shared_experts are present. This is done by making the assignment conditional. Additionally, a related condition was refactored for better clarity, changing isinstance(fused_output, tuple) to the more explicit self.shared_experts is not None.

The changes are correct, well-targeted, and effectively resolve the issue. The code is now more robust and the fusion is re-enabled as intended.

@ProExpertProg ProExpertProg enabled auto-merge (squash) February 17, 2026 17:29
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 17, 2026
@ProExpertProg
Copy link
Collaborator

Triggered the MoE integration tests manually, please enable those again if pushing

@Rohan138
Copy link
Contributor Author

Failures seem unrelated? cc @ProExpertProg

@DarkLight1337
Copy link
Member

Can you merge from main to fix the CI failures?

@vllm-bot vllm-bot merged commit ded333f into vllm-project:main Feb 21, 2026
56 of 60 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Feb 21, 2026
@dosubot
Copy link

dosubot bot commented Feb 21, 2026

Related Documentation

Checked 0 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@Rohan138 Rohan138 deleted the fix_fused_rmsnorm_pad branch February 21, 2026 08:02
DarkLight1337 pushed a commit to DarkLight1337/vllm that referenced this pull request Feb 21, 2026
…ner to fix rmsnorm pad fusion (vllm-project#34636)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
joeqzzuo pushed a commit to joeqzzuo/vllm that referenced this pull request Feb 21, 2026
…ner to fix rmsnorm pad fusion (vllm-project#34636)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: joezuo <qianzhou.zuo@gmail.com>
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Feb 22, 2026
…ner to fix rmsnorm pad fusion (vllm-project#34636)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026
…ner to fix rmsnorm pad fusion (vllm-project#34636)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
…ner to fix rmsnorm pad fusion (vllm-project#34636)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
…ner to fix rmsnorm pad fusion (vllm-project#34636)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
…ner to fix rmsnorm pad fusion (vllm-project#34636)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…ner to fix rmsnorm pad fusion (vllm-project#34636)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants