Skip to content

[CK] fix moe memset size which is bigger than alloc#5225

Merged
lalala-sh merged 4 commits intodevelopfrom
wjx/fix_moe_memset_size
Mar 16, 2026
Merged

[CK] fix moe memset size which is bigger than alloc#5225
lalala-sh merged 4 commits intodevelopfrom
wjx/fix_moe_memset_size

Conversation

@lalala-sh
Copy link
Copy Markdown
Contributor

@lalala-sh lalala-sh commented Mar 9, 2026

Motivation

Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that crashes split-K MOE GEMM with "HIP runtime error: invalid argument".
When KBatch > 1, the invoker zeroes the output buffer using arg.M * arg.N as the byte count. However, arg.M is the padded sorted-token-id length from MOE routing, which can be much larger than the actual output allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write beyond the buffer, and the silently-swallowed HIP error propagates to the subsequent kernel launch via hipGetLastError().
This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset matches the actual output size.

Technical Details

Test Plan

Test Result

Submission Checklist

@lalala-sh lalala-sh requested review from DDEle and asleepzzz March 13, 2026 06:37
Copy link
Copy Markdown
Contributor

@asleepzzz asleepzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lalala-sh lalala-sh merged commit 880166b into develop Mar 16, 2026
30 checks passed
@lalala-sh lalala-sh deleted the wjx/fix_moe_memset_size branch March 16, 2026 09:30
assistant-librarian Bot pushed a commit to ROCm/composable_kernel that referenced this pull request Mar 16, 2026
[CK] fix moe memset size which is bigger than alloc

## Motivation
Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that
crashes split-K MOE GEMM with "HIP runtime error: invalid argument".
When KBatch > 1, the invoker zeroes the output buffer using arg.M *
arg.N as the byte count. However, arg.M is the padded sorted-token-id
length from MOE routing, which can be much larger than the actual output
allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write
beyond the buffer, and the silently-swallowed HIP error propagates to
the subsequent kernel launch via hipGetLastError().
This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset
matches the actual output size.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
jovanau pushed a commit to jovanau/rocm-libraries that referenced this pull request Mar 19, 2026
## Motivation
Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that
crashes split-K MOE GEMM with "HIP runtime error: invalid argument".
When KBatch > 1, the invoker zeroes the output buffer using arg.M *
arg.N as the byte count. However, arg.M is the padded sorted-token-id
length from MOE routing, which can be much larger than the actual output
allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write
beyond the buffer, and the silently-swallowed HIP error propagates to
the subsequent kernel launch via hipGetLastError().
This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset
matches the actual output size.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
johannes-graner pushed a commit that referenced this pull request Mar 20, 2026
## Motivation
Fix an out-of-bounds hipMemsetAsync in DeviceMoeGemmBlockScale that
crashes split-K MOE GEMM with "HIP runtime error: invalid argument".
When KBatch > 1, the invoker zeroes the output buffer using arg.M *
arg.N as the byte count. However, arg.M is the padded sorted-token-id
length from MOE routing, which can be much larger than the actual output
allocation (NumTokens * TopK * N). This causes hipMemsetAsync to write
beyond the buffer, and the silently-swallowed HIP error propagates to
the subsequent kernel launch via hipGetLastError().
This patch replaces arg.M with arg.NumTokens * arg.TopK so the memset
matches the actual output size.

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants