[ROCM] Optmize redudent d2d copy of moe.#38597
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces optimizations to the MoE layers by replacing tensor copies with pointer reassignments or conditional copies. However, the review identifies critical safety concerns: reassigning the output to a transient workspace buffer in modular_kernel.py could lead to data corruption as the memory may be overwritten by subsequent operations, and using .data reassignment in rocm_aiter_fused_moe.py is unsafe for CUDA graph capture and bypasses proper buffer management.
| if ( | ||
| not self.inplace | ||
| and fused_out.shape == output.shape | ||
| and fused_out.is_contiguous() | ||
| ): | ||
| output = fused_out |
There was a problem hiding this comment.
Reassigning output = fused_out is highly dangerous because fused_out is a view into the transient workspace memory (allocated via current_workspace_manager). Returning this tensor means the MoE layer's output can be silently overwritten by any subsequent operation that requests workspace buffers, leading to data corruption or non-deterministic results. The copy from the workspace buffer to the persistent output tensor (allocated at line 1350) is mandatory to ensure the result persists correctly throughout the model's execution.
There was a problem hiding this comment.
Reusing fused_out for output is dangerous beyond the fact that it might point to the temporary buffers. Doing this basically forces all the finalize methods to operate inplace which may or may not be supported.
There was a problem hiding this comment.
Added platform.is_rocm(), only effect in rocm's finalize
There was a problem hiding this comment.
I think this check should be rocm_aiter_ops.is_fused_moe_enabled(). For other MOE backends, ROCm devices that don't support AITER, AITER fused_moe explicitly disabled, etc. we still want to keep the default behavior.
There was a problem hiding this comment.
Yeah thanks. Should work only when AITER enabled.
| output_dtype=output.dtype, | ||
| ) | ||
| output.copy_(result) | ||
| output.data = result |
There was a problem hiding this comment.
Using output.data = result is unsafe and discouraged. This shallow pointer swap is incompatible with CUDA graphs, as the graph capture records the memory address of the tensor's storage. If result is a new allocation (which it appears to be, as rocm_aiter_fused_experts returns a new tensor), its memory address will change in subsequent iterations, invalidating the captured graph. Additionally, it detaches the tensor from the pre-allocated buffer provided by the modular kernel's workspace management. To safely avoid a copy, the underlying AITER kernel should be modified to accept the destination buffer as an argument and write into it directly.
There was a problem hiding this comment.
Aiter will use torch.empty to create a new tensor, and it's competible with cuda graph I think.
| if ( | ||
| not self.inplace | ||
| and fused_out.shape == output.shape | ||
| and fused_out.is_contiguous() | ||
| ): | ||
| output = fused_out |
There was a problem hiding this comment.
Reusing fused_out for output is dangerous beyond the fact that it might point to the temporary buffers. Doing this basically forces all the finalize methods to operate inplace which may or may not be supported.
@bnellnm Thanks, I added a |
@Rohan138 The precommit failed because I don't have 'ready' label or the I don't have at least 4 merged.
|
Are you saying it's tailored to work just with AITER MoE? |
@gshtras Yeah, others allocated moe output form here. https://github.com/vllm-project/vllm/blob/v0.19.0rc0/vllm/model_executor/layers/fused_moe/modular_kernel.py#L1009-L1070 So I think it's tailored for aiter to safely skip with this copy. |
zyongye
left a comment
There was a problem hiding this comment.
LGTM. We can merge this and I can take a look to nv side so we can fix that for all.
Yeah, thanks, the copy inside |
Cherry-pick of vllm-project#38597 onto v0.19.0. Eliminates two device-to-device memory copies in the AITER MoE path: 1. Replace output.copy_(result) with output.data = result in rocm_aiter_fused_moe.py 2. Skip copy in TopKWeightAndReduceNoOP when output already points to fused_out 3. Add conditional in modular_kernel.py to reuse fused_out as output when shapes match
Skip unnecessary d2d copies in the MoE path when using AITER: - modular_kernel: skip copy when AITER output is contiguous - rocm_aiter_fused_moe: use output.data assignment instead of copy_ - topk_weight_and_reduce: guard copy_ with data_ptr check Signed-off-by: Tres Popp <tres.popp@amd.com> Made-with: Cursor
|
How important is removing allocation overhead which is the goal of pre-creating the workspaces. This is only a concern for non CUDAGraph style eager mode executions where the workspace allocation is a small overhead relative to the rest. The buffer allocation and subsequent call is only used in a single location rather than being a larger interface consideration, so this is forcing a specific calling convention on libraries and even how many workspace buffers can be used just to remove a couple of allocation calls in a less performant mode for certain backends. |
|
Missed this PR but I hit the same redundant copies independently and have a draft at #41020 that does what @robertgshaw2-redhat is asking for. I have an AITER-side change that adds output_buffer_override (ROCm/aiter@da318d0) so it writes directly into the caller's buffer. @benenzhu let me know if it makes sense to combine the two, the AITER-side changes referenced in #41020 should address the review concerns here |
Yeah, I will close this for now. Aiter side change should be better. |

Purpose
When running MiniMax M2.5, currently the MOE kernel's output will have two d2d copys, which can be optimized.

Test Plan
Run MiniMax M2.5 inference on Mi355.
Confirm output correctness matches the non-aiter fallback path by comparing the accuracy of the model.
Benchmarking
export VLLM_ROCM_USE_AITER=1 vllm serve MiniMaxAI/MiniMax-M2.5 \ --tensor-parallel-size 8 \ --enable_expert-parallel \ --max-num-batched-tokens 196608 \ --max-model-len=10240 \ --max-num-seqs 512 \ --block-size=32 \ --trust-remote-code \ --no-enable-prefix-caching \ --port=30000Before:
After:
e2e TPOT drop by about 3.4%.
Accuracy Testing
main:
PR:
There is no significant difference in accuracy.
cc @gshtras @chunfangamd
prev: #38346
supported_models.mdandexamplesfor a new model.