[ROCM] Optmize redudent d2d copy of moe. by benenzhu · Pull Request #38597 · vllm-project/vllm

benenzhu · 2026-03-31T01:43:27Z

Purpose

When running MiniMax M2.5, currently the MOE kernel's output will have two d2d copys, which can be optimized.

Test Plan

Run MiniMax M2.5 inference on Mi355.
Confirm output correctness matches the non-aiter fallback path by comparing the accuracy of the model.

Benchmarking

export VLLM_ROCM_USE_AITER=1
vllm serve MiniMaxAI/MiniMax-M2.5 \
    --tensor-parallel-size 8 \
    --enable_expert-parallel \
    --max-num-batched-tokens 196608 \
    --max-model-len=10240 \
    --max-num-seqs 512 \
    --block-size=32 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --port=30000

vllm bench serve \
  --model MiniMaxAI/MiniMax-M2.5 \
  --dataset-name sharegpt \
  --dataset-path /A/datasets/ShareGPT_V3_unfiltered_cleaned_split.json \
  --sharegpt-output-len 300 \
  --port 30000 \
  --max-concurrency 8 \
  --num-prompts 1000 \
  --num-warmups 50 \
  --ignore-eos \
  --temperature 0

Before:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  442.59    
Total input tokens:                      225133    
Total generated tokens:                  300000    
Request throughput (req/s):              2.26      
Output token throughput (tok/s):         677.83    
Peak output token throughput (tok/s):    704.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1186.49   
---------------Time to First Token----------------
Mean TTFT (ms):                          57.49     
Median TTFT (ms):                        50.00     
P99 TTFT (ms):                           106.37    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.65     
Median TPOT (ms):                        11.65     
P99 TPOT (ms):                           11.90     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.65     
Median ITL (ms):                         11.54     
P99 ITL (ms):                            13.49     
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  427.85    
Total input tokens:                      225133    
Total generated tokens:                  300000    
Request throughput (req/s):              2.34      
Output token throughput (tok/s):         701.18    
Peak output token throughput (tok/s):    736.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1227.38   
---------------Time to First Token----------------
Mean TTFT (ms):                          57.12     
Median TTFT (ms):                        47.68     
P99 TTFT (ms):                           107.23    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.25     
Median TPOT (ms):                        11.26     
P99 TPOT (ms):                           11.54     
---------------Inter-token Latency----------------
Mean ITL (ms):                           11.25     
Median ITL (ms):                         11.12     
P99 ITL (ms):                            14.14     
==================================================

e2e TPOT drop by about 3.4%.

Accuracy Testing

python3 -m lm_eval --model local-completions \
  --model_args model=MiniMaxAI/MiniMax-M2.5,base_url=http://127.0.0.1:30000/v1/completions,num_concurrent=64 \
  --tasks gsm8k

main:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9234|±  |0.0073|
|     |       |strict-match    |     5|exact_match|↑  |0.9212|±  |0.0074|```

PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9318|±  |0.0069|
|     |       |strict-match    |     5|exact_match|↑  |0.9280|±  |0.0071|

There is no significant difference in accuracy.

cc @gshtras @chunfangamd
prev: #38346

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request introduces optimizations to the MoE layers by replacing tensor copies with pointer reassignments or conditional copies. However, the review identifies critical safety concerns: reassigning the output to a transient workspace buffer in modular_kernel.py could lead to data corruption as the memory may be overwritten by subsequent operations, and using .data reassignment in rocm_aiter_fused_moe.py is unsafe for CUDA graph capture and bypasses proper buffer management.

gemini-code-assist · 2026-03-31T01:45:29Z

+        if (
+            not self.inplace
+            and fused_out.shape == output.shape
+            and fused_out.is_contiguous()
+        ):
+            output = fused_out


Reassigning output = fused_out is highly dangerous because fused_out is a view into the transient workspace memory (allocated via current_workspace_manager). Returning this tensor means the MoE layer's output can be silently overwritten by any subsequent operation that requests workspace buffers, leading to data corruption or non-deterministic results. The copy from the workspace buffer to the persistent output tensor (allocated at line 1350) is mandatory to ensure the result persists correctly throughout the model's execution.

Reusing fused_out for output is dangerous beyond the fact that it might point to the temporary buffers. Doing this basically forces all the finalize methods to operate inplace which may or may not be supported.

Added platform.is_rocm(), only effect in rocm's finalize

I think this check should be rocm_aiter_ops.is_fused_moe_enabled(). For other MOE backends, ROCm devices that don't support AITER, AITER fused_moe explicitly disabled, etc. we still want to keep the default behavior.

Yeah thanks. Should work only when AITER enabled.

gemini-code-assist · 2026-03-31T01:45:29Z

            output_dtype=output.dtype,
        )
-        output.copy_(result)
+        output.data = result


Using output.data = result is unsafe and discouraged. This shallow pointer swap is incompatible with CUDA graphs, as the graph capture records the memory address of the tensor's storage. If result is a new allocation (which it appears to be, as rocm_aiter_fused_experts returns a new tensor), its memory address will change in subsequent iterations, invalidating the captured graph. Additionally, it detaches the tensor from the pre-allocated buffer provided by the modular kernel's workspace management. To safely avoid a copy, the underlying AITER kernel should be modified to accept the destination buffer as an argument and write into it directly.

Aiter will use torch.empty to create a new tensor, and it's competible with cuda graph I think.

Rohan138 · 2026-03-31T20:14:28Z

@benenzhu can you please run pre-commit? We'd like to merge this and get it cherry-picked into 0.19.0 to fix ROCm regressions on gpt-oss and deepseek, cc @zyongye

bnellnm · 2026-03-31T17:34:30Z

+        if (
+            not self.inplace
+            and fused_out.shape == output.shape
+            and fused_out.is_contiguous()
+        ):
+            output = fused_out


Reusing fused_out for output is dangerous beyond the fact that it might point to the temporary buffers. Doing this basically forces all the finalize methods to operate inplace which may or may not be supported.

bnellnm · 2026-03-31T20:49:53Z

@benenzhu can you please run pre-commit? We'd like to merge this and get it cherry-picked into 0.19.0 to fix ROCm regressions on gpt-oss and deepseek, cc @zyongye

This will likely break things if it gets merged.

benenzhu · 2026-04-01T00:41:44Z

@benenzhu can you please run pre-commit? We'd like to merge this and get it cherry-picked into 0.19.0 to fix ROCm regressions on gpt-oss and deepseek, cc @zyongye

This will likely break things if it gets merged.

@bnellnm Thanks, I added a platform.is_rocm() for it. So Rocm uses aiter's moe. And it will always use torch.empty() as the output buffer.
Also I checked the shape is equal, so it should only be the TopKWeightAndReduceNoOp, other ops will have different shapes.

benenzhu · 2026-04-01T00:56:55Z

@Rohan138 The precommit failed because I don't have 'ready' label or the I don't have at least 4 merged.

gshtras · 2026-04-01T14:47:03Z

@benenzhu can you please run pre-commit? We'd like to merge this and get it cherry-picked into 0.19.0 to fix ROCm regressions on gpt-oss and deepseek, cc @zyongye

This will likely break things if it gets merged.

@bnellnm Thanks, I added a platform.is_rocm() for it. So Rocm uses aiter's moe. And it will always use torch.empty() as the output buffer. Also I checked the shape is equal, so it should only be the TopKWeightAndReduceNoOp, other ops will have different shapes.

Are you saying it's tailored to work just with AITER MoE?

benenzhu · 2026-04-01T15:21:35Z

@benenzhu can you please run pre-commit? We'd like to merge this and get it cherry-picked into 0.19.0 to fix ROCm regressions on gpt-oss and deepseek, cc @zyongye

This will likely break things if it gets merged.

@bnellnm Thanks, I added a platform.is_rocm() for it. So Rocm uses aiter's moe. And it will always use torch.empty() as the output buffer. Also I checked the shape is equal, so it should only be the TopKWeightAndReduceNoOp, other ops will have different shapes.

Are you saying it's tailored to work just with AITER MoE?

@gshtras Yeah, others allocated moe output form here. https://github.com/vllm-project/vllm/blob/v0.19.0rc0/vllm/model_executor/layers/fused_moe/modular_kernel.py#L1009-L1070
AITER's MOE don't use this one, and create one inside kernel with torch.empty(), and we change it to the aiter's pointer with output.data = result https://github.com/vllm-project/vllm/blob/v0.19.0rc0/vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py#L425-L439

So I think it's tailored for aiter to safely skip with this copy.

zyongye

LGTM. We can merge this and I can take a look to nv side so we can fix that for all.

Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>

robertgshaw2-redhat

why cant we update aiter to optionally accept an output buffer like every other kernel we have?

then aiter would fit into the structure that we have and we could avoid these confusing, bug-prone edge cases

benenzhu · 2026-04-03T05:09:23Z

why cant we update aiter to optionally accept an output buffer like every other kernel we have?

then aiter would fit into the structure that we have and we could avoid these confusing, bug-prone edge cases

Yeah, thanks, the copy inside vllm/model_executor/layers/fused_moe/rocm_aiter_fused_moe.py can be move into aiter.
But the second in vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py can only change in VLLM I think.

Cherry-pick of vllm-project#38597 onto v0.19.0. Eliminates two device-to-device memory copies in the AITER MoE path: 1. Replace output.copy_(result) with output.data = result in rocm_aiter_fused_moe.py 2. Skip copy in TopKWeightAndReduceNoOP when output already points to fused_out 3. Add conditional in modular_kernel.py to reuse fused_out as output when shapes match

Skip unnecessary d2d copies in the MoE path when using AITER: - modular_kernel: skip copy when AITER output is contiguous - rocm_aiter_fused_moe: use output.data assignment instead of copy_ - topk_weight_and_reduce: guard copy_ with data_ptr check Signed-off-by: Tres Popp <tres.popp@amd.com> Made-with: Cursor

tpopp · 2026-04-21T07:53:11Z

How important is removing allocation overhead which is the goal of pre-creating the workspaces. This is only a concern for non CUDAGraph style eager mode executions where the workspace allocation is a small overhead relative to the rest. The buffer allocation and subsequent call is only used in a single location rather than being a larger interface consideration, so this is forcing a specific calling convention on libraries and even how many workspace buffers can be used just to remove a couple of allocation calls in a less performant mode for certain backends.

frida-andersson · 2026-04-29T05:56:50Z

Missed this PR but I hit the same redundant copies independently and have a draft at #41020 that does what @robertgshaw2-redhat is asking for. I have an AITER-side change that adds output_buffer_override (ROCm/aiter@da318d0) so it writes directly into the caller's buffer. @benenzhu let me know if it makes sense to combine the two, the AITER-side changes referenced in #41020 should address the review concerns here

benenzhu · 2026-04-29T06:00:57Z

Missed this PR but I hit the same redundant copies independently and have a draft at #41020 that does what @robertgshaw2-redhat is asking for. I have an AITER-side change that adds output_buffer_override (ROCm/aiter@da318d0) so it writes directly into the caller's buffer. @benenzhu let me know if it makes sense to combine the two, the AITER-side changes referenced in #41020 should address the review concerns here

Yeah, I will close this for now. Aiter side change should be better.

benenzhu requested review from mgoin, pavanimajety and tjtanaa as code owners March 31, 2026 01:43

mergify Bot added the rocm Related to AMD ROCm label Mar 31, 2026

github-project-automation Bot added this to AMD Mar 31, 2026

github-project-automation Bot moved this to Todo in AMD Mar 31, 2026

benenzhu mentioned this pull request Mar 31, 2026

[ROCM] Optmize redudent d2d copy of moe. #38346

Closed

5 tasks

gemini-code-assist Bot reviewed Mar 31, 2026

View reviewed changes

Rohan138 approved these changes Mar 31, 2026

View reviewed changes

bnellnm suggested changes Mar 31, 2026

View reviewed changes

benenzhu force-pushed the amd/opt_moe branch from b020e7d to 245d075 Compare April 1, 2026 00:37

zyongye approved these changes Apr 1, 2026

View reviewed changes

zyongye added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 1, 2026

benenzhu added 3 commits April 1, 2026 16:18

moe opt

0ba8ca3

Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>

moe opt

9dd60da

Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>

aiter moe

b7a96e0

Signed-off-by: zhutaoyu <zhutaoyu97@gmail.com>

benenzhu force-pushed the amd/opt_moe branch from 9481c4e to b7a96e0 Compare April 1, 2026 16:20

Merge branch 'main' into amd/opt_moe

53625db

robertgshaw2-redhat requested changes Apr 2, 2026

View reviewed changes

benenzhu closed this Apr 29, 2026

github-project-automation Bot moved this from Todo to Done in AMD Apr 29, 2026

amd-mghanimi mentioned this pull request May 5, 2026

Eliminate redundant MoE buffer copies in AITER fused experts (without dependency on AITER changes) #41713

Merged

8 tasks

Uh oh!

Conversation

benenzhu commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Benchmarking

Accuracy Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

bnellnm Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

benenzhu Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Rohan138 Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benenzhu Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

benenzhu Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Rohan138 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnellnm Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

bnellnm commented Mar 31, 2026

Uh oh!

benenzhu commented Apr 1, 2026

Uh oh!

benenzhu commented Apr 1, 2026

Uh oh!

gshtras commented Apr 1, 2026

Uh oh!

benenzhu commented Apr 1, 2026

Uh oh!

zyongye left a comment

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benenzhu commented Apr 3, 2026

Uh oh!

tpopp commented Apr 21, 2026

Uh oh!

frida-andersson commented Apr 29, 2026

Uh oh!

benenzhu commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

benenzhu commented Mar 31, 2026 •

edited

Loading

Rohan138 Apr 1, 2026 •

edited

Loading

Rohan138 commented Mar 31, 2026 •

edited

Loading

robertgshaw2-redhat left a comment •

edited

Loading