fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm_… by Liang-jianhao97 · Pull Request #2482 · ROCm/aiter

Liang-jianhao97 · 2026-03-26T08:07:03Z

Motivation

fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm_a16w16

Technical Details

The do_not_specialize=["M","N"] change (commit 3170a51) prevents kernel
recompilation when M/N change, but removes tt.divisibility=16 from M/N,
causing AxisInfo to lose contiguity through RemOp and CmpOp, which degrades
buffer_store from vectorized dwordx2 (vec=4) to scalar short (vec=1).

Introduce EVEN_MN constexpr heuristic that checks M%BLOCK_SIZE_M==0 and
N%BLOCK_SIZE_N==0 at compile time. When true, skip the modulo wrap on
offs_am/offs_bn and use unmasked tl.store, restoring contiguity for
vectorized memory operations without sacrificing the recompilation benefit.

Test Plan

bench_gemm_a16w16.py

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…a16w16 The do_not_specialize=["M","N"] change (commit 3170a51) prevents kernel recompilation when M/N change, but removes tt.divisibility=16 from M/N, causing AxisInfo to lose contiguity through RemOp and CmpOp, which degrades buffer_store from vectorized dwordx2 (vec=4) to scalar short (vec=1). Introduce EVEN_MN constexpr heuristic that checks M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0 at compile time. When true, skip the modulo wrap on offs_am/offs_bn and use unmasked tl.store, restoring contiguity for vectorized memory operations without sacrificing the recompilation benefit. Made-with: Cursor

github-actions · 2026-03-26T08:07:22Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-355`	Run Triton tests on MI355 in addition to MI325
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2482 --add-label <label>

Same fix as gemm_a16w16: the do_not_specialize=["M","N"] removes tt.divisibility=16, breaking vectorized store/load. Add EVEN_MN constexpr heuristic to conditionally skip modulo and mask when M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0, restoring contiguity. Made-with: Cursor

Dewei-Wang-sh

overall, lgtm.
further question, what if m/n not aligned to 8/16, can we do better in this case? like break the loop into aligned loop and tail loop.
and can "do_not_specialize_on_alignment" help?

* add fused_qknorm hip kernel (#2442) * base impl * improve big M * optimize v2 * fallback split norm for bigger M * fix rebase failed * code format * fix include error * support outplace works for torch compile * add output args --------- Co-authored-by: Guanbao Yu <gyu@amd.com> * feat: add fast gelu (#2220) * add fast gelu * fix ut * refactor * add more log * change to empty to avoid perf downgrade * make black lint happy * make ruff happy * fix import error * Fix build CK pipeline (#2399) * Fix renamed CK pipeline * Fix CK version * Rename eight_warps to eight_waves for consistency --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * [OPUS] gfx1250 support for opus wmma scale and moe_sorting kernel (#2449) * [OPUS] Add gfx1250 support for opus moe_sorting kernel Add gfx1250 (MI450) to the GPU architecture maps and fix the opus moe_sorting kernel to use the correct wait counter instruction on gfx12 ISA (s_wait_dscnt instead of s_waitcnt_lgkmcnt). * [OPUS] Add WMMA scale instruction support for gfx1250 Add scaled WMMA dispatch (BX32 int / BX16 long) to struct wmma for: - wmma_scale[16]_f32_16x16x128_f8f6f4 (fp8 fmt=0, fp4 fmt=4) - wmma_scale[16]_f32_32x16x128_f4 (dedicated fp4) Key changes in opus.hpp: - Add fmt_a/fmt_b format codes and per-lane E8M0 scale operator() overloads with compile-time scale_sel (OPSEL) for lane group selection - Fix vtype for packed fp4: use reg_bytes_a/b to compute correct hardware register size (elem * bits / 8 bytes, not elem * sizeof) - Add zero-pad helper for fp4-via-f8f6f4 (i32x8 -> i32x16) - Add scaled overloads to wmma_adaptor_swap_ab Device tests (op_tests/opus/device/test_wmma_scale.cu): - Raw warp-level: 6 variants (fp8/fp4 x BX32/BX16 + 32x16 fp4) - Tiled MMA: 1x1, 2x2, 4x1 wave configurations via make_tiled_mma - Per-lane random scale: E8M0 exponents in [122..133], bitwise exact - All tests PASS on gfx1250, SKIP on gfx942/gfx950 * [OPUS] Fix black/ruff lint: rename ambiguous var `l` to `lane` * reduce wasted get_module overhead for module with custom module name (#2455) * fix moe gemm tuned config (#2463) * fix * add e=256 k=8 tuned config * revert t=1024,2048 back to ck * update * update * Skip `test_metadata_redirect.py` on archs other than `gfx942` (#2456) This test was designed to run only on `gfx942`. We want to run all Triton tests on `gfx950` as well - this is the reason behind the proposed changes. * [Triton] Fix bench_mha (#2317) * fix bench_mha * fix bench_mha parsing logic of bench_attn_models * fix bench_mla_decode parsing logic of bench_attn_models --------- Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> * test_mla_persistent.py split kv reference fix max_seq_q != 1 error (#2363) * CI: fix dubious ownership for sglang checkout (#2477) * CI: mark sglang checkout paths as safe directories Add safe.directory entries before dependency installation for both host and container checkout paths to avoid git dubious ownership failures on /sglang-checkout. * CI: align sglang safe.directory fix with upstream Move safe.directory configuration to run right after container startup and only mark /sglang-checkout, matching upstream sglang handling for dubious ownership. * CI: use pip editable install and safe.directory in runtime CI (#2474) * Fix CKTile blockscale GEMM to read strides from tensor metadata (#2466) The CKTile blockscale GEMM wrapper hardcoded leading-dimension strides as stride_A = K, stride_B = K, stride_C = N, assuming fully contiguous row-major layout. This produced silently wrong results when input tensors had non-standard strides. In vLLM on ROCm, _maybe_pad_fp8_weight pads FP8 weight tensors for alignment and then creates a narrowed view, producing tensors whose logical shape is [N, K] but whose physical stride is [K+pad, 1]. The hardcoded stride_B = K caused the kernel to read from wrong memory offsets, leading to garbage output. Fix: read leading-dimension strides from the PyTorch tensor metadata (XQ.stride(0), WQ.stride(0), etc.) instead of assuming dense layout. Add TORCH_CHECK assertions to verify inner-dimension contiguity (stride == 1), which is required by the CKTile kernel. The old CK backend (gemm_a8w8_blockscale_common.cuh) already reads strides from tensor metadata, which is why it was unaffected. Made-with: Cursor * Add LSE output support for MLA decode qseqlen=1 persistent kernel (gf… (#2440) * Add LSE output support for MLA decode qseqlen=1 persistent kernel (gfx950) * Add qseqlen fold for MLA on gfx950: use qh64 kernel instead of qh16 * Add qseqlen fold for MLA on gfx950: use qh64 kernel instead of qh16 --------- Co-authored-by: root <root@gbt350-odcdh2-a11-1.png-odc.dcgpu> * tuned qwen3.5 gemm (#2485) Signed-off-by: Guanbao Yu <gyu@amd.com> Co-authored-by: Guanbao Yu <gyu@amd.com> * [Triton] Flash Attention Triton Windows build support (#2433) * Initial FA-2 Triton Windows build support * Continue Work lint minimize diff windows smoke test address copilot fixes remove is_windows error improve windows message --------- Co-authored-by: 0xDELUXA <djernovevo@gmail.com> * fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm(#2482) * fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm_a16w16 The do_not_specialize=["M","N"] change (commit 3170a51) prevents kernel recompilation when M/N change, but removes tt.divisibility=16 from M/N, causing AxisInfo to lose contiguity through RemOp and CmpOp, which degrades buffer_store from vectorized dwordx2 (vec=4) to scalar short (vec=1). Introduce EVEN_MN constexpr heuristic that checks M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0 at compile time. When true, skip the modulo wrap on offs_am/offs_bn and use unmasked tl.store, restoring contiguity for vectorized memory operations without sacrificing the recompilation benefit. Made-with: Cursor * fix(gemm): add EVEN_MN heuristic to batched_gemm_a8w8 kernel Same fix as gemm_a16w16: the do_not_specialize=["M","N"] removes tt.divisibility=16, breaking vectorized store/load. Add EVEN_MN constexpr heuristic to conditionally skip modulo and mask when M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0, restoring contiguity. Made-with: Cursor --------- Co-authored-by: jianlian <jianlian@amd.com> * rm gemm_common bind (#2425) * rm gemm_commona= and quant type bind * retune failed shape in a8w8_bpreshuffle_tuned_gemm.csv * recover enum * assert when found duplicated tuned shape (#2376) * assert when found duplicated tuned shape * rm duplicated tuned shape and update tuned file in model_configs * fix lint --------- Signed-off-by: Guanbao Yu <gyu@amd.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: ChenYou <youchen@amd.com> Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: Elton <zhimding@amd.com> Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> Co-authored-by: Michael Melesse <micmelesse@gmail.com> Co-authored-by: minmengdie <memin@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: fangche123 <Fang.Che@amd.com> Co-authored-by: root <root@gbt350-odcdh2-a11-1.png-odc.dcgpu> Co-authored-by: Pleaplusone <ygan@amd.com> Co-authored-by: 0xDELUXA <djernovevo@gmail.com> Co-authored-by: jianhao <Jianhao.Liang@amd.com> Co-authored-by: jianlian <jianlian@amd.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>

…in gemm(ROCm#2482)" This reverts commit 61804c6. This is an experiment to double check the following failures in `op_tests/triton_tests/gemm/basic/test_gemm_a16w16.py`: * test_gemm_a16_w16[True-dtype0-1024-1024-1024] * test_gemm_a16_w16[True-dtype0-2048-2048-2048] * test_gemm_a16_w16[True-dtype0-3072-3072-3072] * test_gemm_a16_w16[True-dtype0-4096-4096-4096] * test_gemm_a16_w16[True-dtype0-5120-5120-5120] * test_gemm_a16_w16[True-dtype0-6144-6144-6144] * test_gemm_a16_w16[True-dtype0-7168-7168-7168] * test_gemm_a16_w16[True-dtype0-8192-8192-8192] * test_gemm_a16_w16[True-dtype0-4864-4096-8192] * test_gemm_a16_w16[True-dtype0-9728-8192-65536] * test_gemm_a16_w16[True-dtype0-4864-8192-4160] * test_gemm_a16_w16[True-dtype0-64-1280-8192] * test_gemm_a16_w16[True-dtype0-128-1280-8192] * test_gemm_a16_w16[True-dtype0-192-1280-8192] * test_gemm_a16_w16[True-dtype0-256-1280-8192] * test_gemm_a16_w16[True-dtype0-320-1280-8192] * test_gemm_a16_w16[True-dtype0-512-1280-8192] * test_gemm_a16_w16[True-dtype0-1024-1280-8192] * test_gemm_a16_w16[True-dtype0-2048-1280-8192] * test_gemm_a16_w16[True-dtype0-4096-1280-8192] * test_gemm_a16_w16[True-dtype0-8192-1280-8192] * test_gemm_a16_w16[True-dtype0-16384-1280-8192] * test_gemm_a16_w16[True-dtype0-1-8192-1024] * test_gemm_a16_w16[True-dtype0-512-8192-1024] * test_gemm_a16_w16[True-dtype0-1024-8192-1024] * test_gemm_a16_w16[True-dtype0-2048-8192-1024] * test_gemm_a16_w16[True-dtype0-4096-8192-1024] * test_gemm_a16_w16[True-dtype0-8192-8192-1024] * test_gemm_a16_w16[True-dtype0-16384-8192-1024]

Liang-jianhao97 requested a review from a team March 26, 2026 08:07

Dewei-Wang-sh approved these changes Mar 26, 2026

View reviewed changes

valarLip approved these changes Mar 26, 2026

View reviewed changes

Liang-jianhao97 merged commit 61804c6 into main Mar 27, 2026
38 checks passed

Liang-jianhao97 deleted the fix/gemm_a16w16_even_mn_vectorize branch March 27, 2026 01:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm_…#2482

fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm_…#2482
Liang-jianhao97 merged 2 commits intomainfrom
fix/gemm_a16w16_even_mn_vectorize

Liang-jianhao97 commented Mar 26, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

Dewei-Wang-sh left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Liang-jianhao97 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions bot commented Mar 26, 2026

🏷️ CI Guide

Uh oh!

Dewei-Wang-sh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Liang-jianhao97 commented Mar 26, 2026 •

edited

Loading

Dewei-Wang-sh left a comment •

edited

Loading