(alternative to #2508) Fix/splitk tmp_out undersized buffer avoid double-zeroing by rbrugaro-amd · Pull Request #2551 · ROCm/aiter

rbrugaro-amd · 2026-03-31T07:56:42Z

The C++ side of this crash was fixed in ROCm/rocm-libraries#5225, which corrected the hipMemsetAsync size from arg.M * arg.N to arg.NumTokens * arg.TopK * arg.N. This PR fixes the Python side to match.

Technical Details

Root Cause

In ck_moe_stage1(), the tmp_out buffer was allocated as (token_num, topk, w1.shape[1]) which is undersized when splitK > 1. The CK kernel operates on sorted_size = min(token_num * topk * block_m, sorted_token_ids.shape[0]) rows, so the buffer must be (sorted_size, w1.shape[1]).

For DeepSeek V3 decode (token_num=1, topk=8, block_m=16):

Old Python buffer: 1 * 8 * 4096 * 4 = 128 KB
CK kernel expects: 128 * 2048 * 4 * 2 = 2 MB

Fix

ck_moe_stage1: Allocate tmp_out with sorted_size rows using torch.empty (CK kernel zeros the buffer via hipMemsetAsync, avoiding redundant double-zeroing). After the kernel, slice valid_out = tmp_out[:token_num * topk, :] before silu_and_mul/gelu_and_mul.
cktile_moe_stage1: Added warning comment flagging the same undersized buffer pattern. The code is left unchanged since fp32 splitk is not yet active (see existing TODO: support fp32 splitk), but the comment documents the fix to apply when that path is enabled.

Why `torch.empty` instead of `torch.zeros`

With ROCm/rocm-libraries#5225 merged, the CK kernel correctly zeros the buffer via hipMemsetAsync when KBatch > 1. Using torch.empty avoids double-zeroing (once by Python, once by CK), eliminating a redundant GPU kernel launch.

Test Result

DeepSeek-R1-0528 (FP8, 8xMI355X): runtime error resolved, splitK active with correct results.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: rbrugaro <rita.brugarolasbrufau@amd.com>

…s.fp32) Signed-off-by: rbrugaro <rita.brugarolasbrufau@amd.com>

…ble-zeroing (#2551) * fix: align ck_moe_stage1 split-K tmp_out buffer with CK kernel * Update fused_moe.py * tmp_out to use torch.empty vs. torch.zeros to avoid double zeroing Signed-off-by: rbrugaro <rita.brugarolasbrufau@amd.com> * tighten valid_out slice: drop redundant .contiguous() Signed-off-by: rbrugaro <rita.brugarolasbrufau@amd.com> * restore .view(dtypes.fp32) on valid_out for silu_and_mul/gelu_and_mul --------- Signed-off-by: rbrugaro <rita.brugarolasbrufau@amd.com> Co-authored-by: Karan Verma <karan.verma@amd.com>

Verified via canary tests + tp=2 inference that the three +2 row padding fixes from commit 68fc7d48b are not needed for the BF16 no-quant path on gfx950. CK kernels skip the entire block when expert_id is the sentinel (=E), so sorted_ids sentinel (K<<24)|T never triggers OOB scatter to a2[T*K+K] / moe_out[M]. Reverted three locations: - L339-345: drop moe_out_padded, pass moe_buf directly - L1262-1264: zeros((token_num+2, ...)) -> empty((token_num, ...)) - L1349: a2.view(token_num+2, ...) -> a2.view(token_num, ...) Verification: - /tmp/test_moe_canary.py: a2[T*K+K] pristine after stage1 - /tmp/test_moe_canary_stage2.py: moe_out[M] pristine after stage2 - tp=2 Step-3.5-Flash inference: 4 prompts complete normally, no NaN, no crash, latency 1.97s/req - Required fixes still in place: V1->V3 force (block_m=128) and shuffle_weight() preprocessing Note: PR #2551 +2 padding is only required for split-K + per_1x128 quant path, which is a different code branch.

karverma-amd and others added 5 commits March 27, 2026 11:50

fix: align ck_moe_stage1 split-K tmp_out buffer with CK kernel

f7ddc47

Merge branch 'main' into main

3f50525

Update fused_moe.py

857c9d2

Merge branch 'main' into main

5063b5b

tmp_out to use torch.empty vs. torch.zeros to avoid double zeroing

3c8ef17

Signed-off-by: rbrugaro <rita.brugarolasbrufau@amd.com>

rbrugaro-amd requested a review from a team March 31, 2026 07:56

Merge branch 'main' into fix/splitk-tmpout-and-ck-memset

61850e8

rbrugaro-amd mentioned this pull request Mar 31, 2026

fix: align ck_moe_stage1 split-K tmp_out buffer with CK kernel #2508

Closed

1 task

rbrugaro-amd changed the title ~~Fix/splitk tmpout and ck memset~~ (alternative to #2508) Fix/splitk tmp_out undersized buffer Mar 31, 2026

rbrugaro-amd changed the title ~~(alternative to #2508) Fix/splitk tmp_out undersized buffer~~ (alternative to #2508) Fix/splitk tmp_out undersized buffer avoid double-zeroing Mar 31, 2026

rbrugaro-amd added 2 commits March 31, 2026 03:36

tighten valid_out slice: drop redundant .contiguous() and .view(dtype…

7b9fe7d

…s.fp32) Signed-off-by: rbrugaro <rita.brugarolasbrufau@amd.com>

restore .view(dtypes.fp32) on valid_out for silu_and_mul/gelu_and_mul

2633e68

valarLip approved these changes Mar 31, 2026

View reviewed changes

rbrugaro-amd merged commit e47cc0e into ROCm:main Mar 31, 2026
38 of 39 checks passed

sogalin mentioned this pull request Apr 29, 2026

[Issue]: GPT-OSS-120b generated rubbish results with v0.1.12.post1 in SGL #2960

Open

ChuanLi1101 mentioned this pull request May 1, 2026

fix: ck_moe_stage1 split-K buffer overflow from padding scatter (alternative to #2508) #2547

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(alternative to #2508) Fix/splitk tmp_out undersized buffer avoid double-zeroing#2551

(alternative to #2508) Fix/splitk tmp_out undersized buffer avoid double-zeroing#2551
rbrugaro-amd merged 8 commits intoROCm:mainfrom
rbrugaro-amd:fix/splitk-tmpout-and-ck-memset

rbrugaro-amd commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rbrugaro-amd commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Technical Details

Root Cause

Fix

Why torch.empty instead of torch.zeros

Test Result

Submission Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rbrugaro-amd commented Mar 31, 2026 •

edited

Loading

Why `torch.empty` instead of `torch.zeros`