fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm_…#2482
Merged
Liang-jianhao97 merged 2 commits intomainfrom Mar 27, 2026
Merged
fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm_…#2482Liang-jianhao97 merged 2 commits intomainfrom
Liang-jianhao97 merged 2 commits intomainfrom
Conversation
…a16w16 The do_not_specialize=["M","N"] change (commit 3170a51) prevents kernel recompilation when M/N change, but removes tt.divisibility=16 from M/N, causing AxisInfo to lose contiguity through RemOp and CmpOp, which degrades buffer_store from vectorized dwordx2 (vec=4) to scalar short (vec=1). Introduce EVEN_MN constexpr heuristic that checks M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0 at compile time. When true, skip the modulo wrap on offs_am/offs_bn and use unmasked tl.store, restoring contiguity for vectorized memory operations without sacrificing the recompilation benefit. Made-with: Cursor
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
Same fix as gemm_a16w16: the do_not_specialize=["M","N"] removes tt.divisibility=16, breaking vectorized store/load. Add EVEN_MN constexpr heuristic to conditionally skip modulo and mask when M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0, restoring contiguity. Made-with: Cursor
Dewei-Wang-sh
approved these changes
Mar 26, 2026
valarLip
approved these changes
Mar 26, 2026
azaidy
added a commit
that referenced
this pull request
Mar 27, 2026
* add fused_qknorm hip kernel (#2442) * base impl * improve big M * optimize v2 * fallback split norm for bigger M * fix rebase failed * code format * fix include error * support outplace works for torch compile * add output args --------- Co-authored-by: Guanbao Yu <gyu@amd.com> * feat: add fast gelu (#2220) * add fast gelu * fix ut * refactor * add more log * change to empty to avoid perf downgrade * make black lint happy * make ruff happy * fix import error * Fix build CK pipeline (#2399) * Fix renamed CK pipeline * Fix CK version * Rename eight_warps to eight_waves for consistency --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * [OPUS] gfx1250 support for opus wmma scale and moe_sorting kernel (#2449) * [OPUS] Add gfx1250 support for opus moe_sorting kernel Add gfx1250 (MI450) to the GPU architecture maps and fix the opus moe_sorting kernel to use the correct wait counter instruction on gfx12 ISA (s_wait_dscnt instead of s_waitcnt_lgkmcnt). * [OPUS] Add WMMA scale instruction support for gfx1250 Add scaled WMMA dispatch (BX32 int / BX16 long) to struct wmma for: - wmma_scale[16]_f32_16x16x128_f8f6f4 (fp8 fmt=0, fp4 fmt=4) - wmma_scale[16]_f32_32x16x128_f4 (dedicated fp4) Key changes in opus.hpp: - Add fmt_a/fmt_b format codes and per-lane E8M0 scale operator() overloads with compile-time scale_sel (OPSEL) for lane group selection - Fix vtype for packed fp4: use reg_bytes_a/b to compute correct hardware register size (elem * bits / 8 bytes, not elem * sizeof) - Add zero-pad helper for fp4-via-f8f6f4 (i32x8 -> i32x16) - Add scaled overloads to wmma_adaptor_swap_ab Device tests (op_tests/opus/device/test_wmma_scale.cu): - Raw warp-level: 6 variants (fp8/fp4 x BX32/BX16 + 32x16 fp4) - Tiled MMA: 1x1, 2x2, 4x1 wave configurations via make_tiled_mma - Per-lane random scale: E8M0 exponents in [122..133], bitwise exact - All tests PASS on gfx1250, SKIP on gfx942/gfx950 * [OPUS] Fix black/ruff lint: rename ambiguous var `l` to `lane` * reduce wasted get_module overhead for module with custom module name (#2455) * fix moe gemm tuned config (#2463) * fix * add e=256 k=8 tuned config * revert t=1024,2048 back to ck * update * update * Skip `test_metadata_redirect.py` on archs other than `gfx942` (#2456) This test was designed to run only on `gfx942`. We want to run all Triton tests on `gfx950` as well - this is the reason behind the proposed changes. * [Triton] Fix bench_mha (#2317) * fix bench_mha * fix bench_mha parsing logic of bench_attn_models * fix bench_mla_decode parsing logic of bench_attn_models --------- Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> * test_mla_persistent.py split kv reference fix max_seq_q != 1 error (#2363) * CI: fix dubious ownership for sglang checkout (#2477) * CI: mark sglang checkout paths as safe directories Add safe.directory entries before dependency installation for both host and container checkout paths to avoid git dubious ownership failures on /sglang-checkout. * CI: align sglang safe.directory fix with upstream Move safe.directory configuration to run right after container startup and only mark /sglang-checkout, matching upstream sglang handling for dubious ownership. * CI: use pip editable install and safe.directory in runtime CI (#2474) * Fix CKTile blockscale GEMM to read strides from tensor metadata (#2466) The CKTile blockscale GEMM wrapper hardcoded leading-dimension strides as stride_A = K, stride_B = K, stride_C = N, assuming fully contiguous row-major layout. This produced silently wrong results when input tensors had non-standard strides. In vLLM on ROCm, _maybe_pad_fp8_weight pads FP8 weight tensors for alignment and then creates a narrowed view, producing tensors whose logical shape is [N, K] but whose physical stride is [K+pad, 1]. The hardcoded stride_B = K caused the kernel to read from wrong memory offsets, leading to garbage output. Fix: read leading-dimension strides from the PyTorch tensor metadata (XQ.stride(0), WQ.stride(0), etc.) instead of assuming dense layout. Add TORCH_CHECK assertions to verify inner-dimension contiguity (stride == 1), which is required by the CKTile kernel. The old CK backend (gemm_a8w8_blockscale_common.cuh) already reads strides from tensor metadata, which is why it was unaffected. Made-with: Cursor * Add LSE output support for MLA decode qseqlen=1 persistent kernel (gf… (#2440) * Add LSE output support for MLA decode qseqlen=1 persistent kernel (gfx950) * Add qseqlen fold for MLA on gfx950: use qh64 kernel instead of qh16 * Add qseqlen fold for MLA on gfx950: use qh64 kernel instead of qh16 --------- Co-authored-by: root <root@gbt350-odcdh2-a11-1.png-odc.dcgpu> * tuned qwen3.5 gemm (#2485) Signed-off-by: Guanbao Yu <gyu@amd.com> Co-authored-by: Guanbao Yu <gyu@amd.com> * [Triton] Flash Attention Triton Windows build support (#2433) * Initial FA-2 Triton Windows build support * Continue Work lint minimize diff windows smoke test address copilot fixes remove is_windows error improve windows message --------- Co-authored-by: 0xDELUXA <djernovevo@gmail.com> * fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm(#2482) * fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm_a16w16 The do_not_specialize=["M","N"] change (commit 3170a51) prevents kernel recompilation when M/N change, but removes tt.divisibility=16 from M/N, causing AxisInfo to lose contiguity through RemOp and CmpOp, which degrades buffer_store from vectorized dwordx2 (vec=4) to scalar short (vec=1). Introduce EVEN_MN constexpr heuristic that checks M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0 at compile time. When true, skip the modulo wrap on offs_am/offs_bn and use unmasked tl.store, restoring contiguity for vectorized memory operations without sacrificing the recompilation benefit. Made-with: Cursor * fix(gemm): add EVEN_MN heuristic to batched_gemm_a8w8 kernel Same fix as gemm_a16w16: the do_not_specialize=["M","N"] removes tt.divisibility=16, breaking vectorized store/load. Add EVEN_MN constexpr heuristic to conditionally skip modulo and mask when M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0, restoring contiguity. Made-with: Cursor --------- Co-authored-by: jianlian <jianlian@amd.com> * rm gemm_common bind (#2425) * rm gemm_commona= and quant type bind * retune failed shape in a8w8_bpreshuffle_tuned_gemm.csv * recover enum * assert when found duplicated tuned shape (#2376) * assert when found duplicated tuned shape * rm duplicated tuned shape and update tuned file in model_configs * fix lint --------- Signed-off-by: Guanbao Yu <gyu@amd.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: ChenYou <youchen@amd.com> Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: Elton <zhimding@amd.com> Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> Co-authored-by: Michael Melesse <micmelesse@gmail.com> Co-authored-by: minmengdie <memin@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: fangche123 <Fang.Che@amd.com> Co-authored-by: root <root@gbt350-odcdh2-a11-1.png-odc.dcgpu> Co-authored-by: Pleaplusone <ygan@amd.com> Co-authored-by: 0xDELUXA <djernovevo@gmail.com> Co-authored-by: jianhao <Jianhao.Liang@amd.com> Co-authored-by: jianlian <jianlian@amd.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>
brunomazzottiamd
added a commit
to brunomazzottiamd/aiter
that referenced
this pull request
Mar 31, 2026
…in gemm(ROCm#2482)" This reverts commit 61804c6. This is an experiment to double check the following failures in `op_tests/triton_tests/gemm/basic/test_gemm_a16w16.py`: * test_gemm_a16_w16[True-dtype0-1024-1024-1024] * test_gemm_a16_w16[True-dtype0-2048-2048-2048] * test_gemm_a16_w16[True-dtype0-3072-3072-3072] * test_gemm_a16_w16[True-dtype0-4096-4096-4096] * test_gemm_a16_w16[True-dtype0-5120-5120-5120] * test_gemm_a16_w16[True-dtype0-6144-6144-6144] * test_gemm_a16_w16[True-dtype0-7168-7168-7168] * test_gemm_a16_w16[True-dtype0-8192-8192-8192] * test_gemm_a16_w16[True-dtype0-4864-4096-8192] * test_gemm_a16_w16[True-dtype0-9728-8192-65536] * test_gemm_a16_w16[True-dtype0-4864-8192-4160] * test_gemm_a16_w16[True-dtype0-64-1280-8192] * test_gemm_a16_w16[True-dtype0-128-1280-8192] * test_gemm_a16_w16[True-dtype0-192-1280-8192] * test_gemm_a16_w16[True-dtype0-256-1280-8192] * test_gemm_a16_w16[True-dtype0-320-1280-8192] * test_gemm_a16_w16[True-dtype0-512-1280-8192] * test_gemm_a16_w16[True-dtype0-1024-1280-8192] * test_gemm_a16_w16[True-dtype0-2048-1280-8192] * test_gemm_a16_w16[True-dtype0-4096-1280-8192] * test_gemm_a16_w16[True-dtype0-8192-1280-8192] * test_gemm_a16_w16[True-dtype0-16384-1280-8192] * test_gemm_a16_w16[True-dtype0-1-8192-1024] * test_gemm_a16_w16[True-dtype0-512-8192-1024] * test_gemm_a16_w16[True-dtype0-1024-8192-1024] * test_gemm_a16_w16[True-dtype0-2048-8192-1024] * test_gemm_a16_w16[True-dtype0-4096-8192-1024] * test_gemm_a16_w16[True-dtype0-8192-8192-1024] * test_gemm_a16_w16[True-dtype0-16384-8192-1024]
brunomazzottiamd
added a commit
to brunomazzottiamd/aiter
that referenced
this pull request
Apr 1, 2026
…in gemm(ROCm#2482)" This reverts commit 61804c6. This is an experiment to double check the following failures in `op_tests/triton_tests/gemm/basic/test_gemm_a16w16.py`: * test_gemm_a16_w16[True-dtype0-1024-1024-1024] * test_gemm_a16_w16[True-dtype0-2048-2048-2048] * test_gemm_a16_w16[True-dtype0-3072-3072-3072] * test_gemm_a16_w16[True-dtype0-4096-4096-4096] * test_gemm_a16_w16[True-dtype0-5120-5120-5120] * test_gemm_a16_w16[True-dtype0-6144-6144-6144] * test_gemm_a16_w16[True-dtype0-7168-7168-7168] * test_gemm_a16_w16[True-dtype0-8192-8192-8192] * test_gemm_a16_w16[True-dtype0-4864-4096-8192] * test_gemm_a16_w16[True-dtype0-9728-8192-65536] * test_gemm_a16_w16[True-dtype0-4864-8192-4160] * test_gemm_a16_w16[True-dtype0-64-1280-8192] * test_gemm_a16_w16[True-dtype0-128-1280-8192] * test_gemm_a16_w16[True-dtype0-192-1280-8192] * test_gemm_a16_w16[True-dtype0-256-1280-8192] * test_gemm_a16_w16[True-dtype0-320-1280-8192] * test_gemm_a16_w16[True-dtype0-512-1280-8192] * test_gemm_a16_w16[True-dtype0-1024-1280-8192] * test_gemm_a16_w16[True-dtype0-2048-1280-8192] * test_gemm_a16_w16[True-dtype0-4096-1280-8192] * test_gemm_a16_w16[True-dtype0-8192-1280-8192] * test_gemm_a16_w16[True-dtype0-16384-1280-8192] * test_gemm_a16_w16[True-dtype0-1-8192-1024] * test_gemm_a16_w16[True-dtype0-512-8192-1024] * test_gemm_a16_w16[True-dtype0-1024-8192-1024] * test_gemm_a16_w16[True-dtype0-2048-8192-1024] * test_gemm_a16_w16[True-dtype0-4096-8192-1024] * test_gemm_a16_w16[True-dtype0-8192-8192-1024] * test_gemm_a16_w16[True-dtype0-16384-8192-1024]
brunomazzottiamd
added a commit
to brunomazzottiamd/aiter
that referenced
this pull request
Apr 1, 2026
…in gemm(ROCm#2482)" This reverts commit 61804c6. This is an experiment to double check the following failures in `op_tests/triton_tests/gemm/basic/test_gemm_a16w16.py`: * test_gemm_a16_w16[True-dtype0-1024-1024-1024] * test_gemm_a16_w16[True-dtype0-2048-2048-2048] * test_gemm_a16_w16[True-dtype0-3072-3072-3072] * test_gemm_a16_w16[True-dtype0-4096-4096-4096] * test_gemm_a16_w16[True-dtype0-5120-5120-5120] * test_gemm_a16_w16[True-dtype0-6144-6144-6144] * test_gemm_a16_w16[True-dtype0-7168-7168-7168] * test_gemm_a16_w16[True-dtype0-8192-8192-8192] * test_gemm_a16_w16[True-dtype0-4864-4096-8192] * test_gemm_a16_w16[True-dtype0-9728-8192-65536] * test_gemm_a16_w16[True-dtype0-4864-8192-4160] * test_gemm_a16_w16[True-dtype0-64-1280-8192] * test_gemm_a16_w16[True-dtype0-128-1280-8192] * test_gemm_a16_w16[True-dtype0-192-1280-8192] * test_gemm_a16_w16[True-dtype0-256-1280-8192] * test_gemm_a16_w16[True-dtype0-320-1280-8192] * test_gemm_a16_w16[True-dtype0-512-1280-8192] * test_gemm_a16_w16[True-dtype0-1024-1280-8192] * test_gemm_a16_w16[True-dtype0-2048-1280-8192] * test_gemm_a16_w16[True-dtype0-4096-1280-8192] * test_gemm_a16_w16[True-dtype0-8192-1280-8192] * test_gemm_a16_w16[True-dtype0-16384-1280-8192] * test_gemm_a16_w16[True-dtype0-1-8192-1024] * test_gemm_a16_w16[True-dtype0-512-8192-1024] * test_gemm_a16_w16[True-dtype0-1024-8192-1024] * test_gemm_a16_w16[True-dtype0-2048-8192-1024] * test_gemm_a16_w16[True-dtype0-4096-8192-1024] * test_gemm_a16_w16[True-dtype0-8192-8192-1024] * test_gemm_a16_w16[True-dtype0-16384-8192-1024]
brunomazzottiamd
added a commit
to brunomazzottiamd/aiter
that referenced
this pull request
Apr 1, 2026
…in gemm(ROCm#2482)" This reverts commit 61804c6. This is an experiment to double check the following failures in `op_tests/triton_tests/gemm/basic/test_gemm_a16w16.py`: * test_gemm_a16_w16[True-dtype0-1024-1024-1024] * test_gemm_a16_w16[True-dtype0-2048-2048-2048] * test_gemm_a16_w16[True-dtype0-3072-3072-3072] * test_gemm_a16_w16[True-dtype0-4096-4096-4096] * test_gemm_a16_w16[True-dtype0-5120-5120-5120] * test_gemm_a16_w16[True-dtype0-6144-6144-6144] * test_gemm_a16_w16[True-dtype0-7168-7168-7168] * test_gemm_a16_w16[True-dtype0-8192-8192-8192] * test_gemm_a16_w16[True-dtype0-4864-4096-8192] * test_gemm_a16_w16[True-dtype0-9728-8192-65536] * test_gemm_a16_w16[True-dtype0-4864-8192-4160] * test_gemm_a16_w16[True-dtype0-64-1280-8192] * test_gemm_a16_w16[True-dtype0-128-1280-8192] * test_gemm_a16_w16[True-dtype0-192-1280-8192] * test_gemm_a16_w16[True-dtype0-256-1280-8192] * test_gemm_a16_w16[True-dtype0-320-1280-8192] * test_gemm_a16_w16[True-dtype0-512-1280-8192] * test_gemm_a16_w16[True-dtype0-1024-1280-8192] * test_gemm_a16_w16[True-dtype0-2048-1280-8192] * test_gemm_a16_w16[True-dtype0-4096-1280-8192] * test_gemm_a16_w16[True-dtype0-8192-1280-8192] * test_gemm_a16_w16[True-dtype0-16384-1280-8192] * test_gemm_a16_w16[True-dtype0-1-8192-1024] * test_gemm_a16_w16[True-dtype0-512-8192-1024] * test_gemm_a16_w16[True-dtype0-1024-8192-1024] * test_gemm_a16_w16[True-dtype0-2048-8192-1024] * test_gemm_a16_w16[True-dtype0-4096-8192-1024] * test_gemm_a16_w16[True-dtype0-8192-8192-1024] * test_gemm_a16_w16[True-dtype0-16384-8192-1024]
brunomazzottiamd
added a commit
to brunomazzottiamd/aiter
that referenced
this pull request
Apr 2, 2026
…in gemm(ROCm#2482)" This reverts commit 61804c6. This is an experiment to double check the following failures in `op_tests/triton_tests/gemm/basic/test_gemm_a16w16.py`: * test_gemm_a16_w16[True-dtype0-1024-1024-1024] * test_gemm_a16_w16[True-dtype0-2048-2048-2048] * test_gemm_a16_w16[True-dtype0-3072-3072-3072] * test_gemm_a16_w16[True-dtype0-4096-4096-4096] * test_gemm_a16_w16[True-dtype0-5120-5120-5120] * test_gemm_a16_w16[True-dtype0-6144-6144-6144] * test_gemm_a16_w16[True-dtype0-7168-7168-7168] * test_gemm_a16_w16[True-dtype0-8192-8192-8192] * test_gemm_a16_w16[True-dtype0-4864-4096-8192] * test_gemm_a16_w16[True-dtype0-9728-8192-65536] * test_gemm_a16_w16[True-dtype0-4864-8192-4160] * test_gemm_a16_w16[True-dtype0-64-1280-8192] * test_gemm_a16_w16[True-dtype0-128-1280-8192] * test_gemm_a16_w16[True-dtype0-192-1280-8192] * test_gemm_a16_w16[True-dtype0-256-1280-8192] * test_gemm_a16_w16[True-dtype0-320-1280-8192] * test_gemm_a16_w16[True-dtype0-512-1280-8192] * test_gemm_a16_w16[True-dtype0-1024-1280-8192] * test_gemm_a16_w16[True-dtype0-2048-1280-8192] * test_gemm_a16_w16[True-dtype0-4096-1280-8192] * test_gemm_a16_w16[True-dtype0-8192-1280-8192] * test_gemm_a16_w16[True-dtype0-16384-1280-8192] * test_gemm_a16_w16[True-dtype0-1-8192-1024] * test_gemm_a16_w16[True-dtype0-512-8192-1024] * test_gemm_a16_w16[True-dtype0-1024-8192-1024] * test_gemm_a16_w16[True-dtype0-2048-8192-1024] * test_gemm_a16_w16[True-dtype0-4096-8192-1024] * test_gemm_a16_w16[True-dtype0-8192-8192-1024] * test_gemm_a16_w16[True-dtype0-16384-8192-1024]
brunomazzottiamd
added a commit
to brunomazzottiamd/aiter
that referenced
this pull request
Apr 6, 2026
…in gemm(ROCm#2482)" This reverts commit 61804c6. This is an experiment to double check the following failures in `op_tests/triton_tests/gemm/basic/test_gemm_a16w16.py`: * test_gemm_a16_w16[True-dtype0-1024-1024-1024] * test_gemm_a16_w16[True-dtype0-2048-2048-2048] * test_gemm_a16_w16[True-dtype0-3072-3072-3072] * test_gemm_a16_w16[True-dtype0-4096-4096-4096] * test_gemm_a16_w16[True-dtype0-5120-5120-5120] * test_gemm_a16_w16[True-dtype0-6144-6144-6144] * test_gemm_a16_w16[True-dtype0-7168-7168-7168] * test_gemm_a16_w16[True-dtype0-8192-8192-8192] * test_gemm_a16_w16[True-dtype0-4864-4096-8192] * test_gemm_a16_w16[True-dtype0-9728-8192-65536] * test_gemm_a16_w16[True-dtype0-4864-8192-4160] * test_gemm_a16_w16[True-dtype0-64-1280-8192] * test_gemm_a16_w16[True-dtype0-128-1280-8192] * test_gemm_a16_w16[True-dtype0-192-1280-8192] * test_gemm_a16_w16[True-dtype0-256-1280-8192] * test_gemm_a16_w16[True-dtype0-320-1280-8192] * test_gemm_a16_w16[True-dtype0-512-1280-8192] * test_gemm_a16_w16[True-dtype0-1024-1280-8192] * test_gemm_a16_w16[True-dtype0-2048-1280-8192] * test_gemm_a16_w16[True-dtype0-4096-1280-8192] * test_gemm_a16_w16[True-dtype0-8192-1280-8192] * test_gemm_a16_w16[True-dtype0-16384-1280-8192] * test_gemm_a16_w16[True-dtype0-1-8192-1024] * test_gemm_a16_w16[True-dtype0-512-8192-1024] * test_gemm_a16_w16[True-dtype0-1024-8192-1024] * test_gemm_a16_w16[True-dtype0-2048-8192-1024] * test_gemm_a16_w16[True-dtype0-4096-8192-1024] * test_gemm_a16_w16[True-dtype0-8192-8192-1024] * test_gemm_a16_w16[True-dtype0-16384-8192-1024]
brunomazzottiamd
added a commit
to brunomazzottiamd/aiter
that referenced
this pull request
Apr 6, 2026
…in gemm(ROCm#2482)" This reverts commit 61804c6. This is an experiment to double check the following failures in `op_tests/triton_tests/gemm/basic/test_gemm_a16w16.py`: * test_gemm_a16_w16[True-dtype0-1024-1024-1024] * test_gemm_a16_w16[True-dtype0-2048-2048-2048] * test_gemm_a16_w16[True-dtype0-3072-3072-3072] * test_gemm_a16_w16[True-dtype0-4096-4096-4096] * test_gemm_a16_w16[True-dtype0-5120-5120-5120] * test_gemm_a16_w16[True-dtype0-6144-6144-6144] * test_gemm_a16_w16[True-dtype0-7168-7168-7168] * test_gemm_a16_w16[True-dtype0-8192-8192-8192] * test_gemm_a16_w16[True-dtype0-4864-4096-8192] * test_gemm_a16_w16[True-dtype0-9728-8192-65536] * test_gemm_a16_w16[True-dtype0-4864-8192-4160] * test_gemm_a16_w16[True-dtype0-64-1280-8192] * test_gemm_a16_w16[True-dtype0-128-1280-8192] * test_gemm_a16_w16[True-dtype0-192-1280-8192] * test_gemm_a16_w16[True-dtype0-256-1280-8192] * test_gemm_a16_w16[True-dtype0-320-1280-8192] * test_gemm_a16_w16[True-dtype0-512-1280-8192] * test_gemm_a16_w16[True-dtype0-1024-1280-8192] * test_gemm_a16_w16[True-dtype0-2048-1280-8192] * test_gemm_a16_w16[True-dtype0-4096-1280-8192] * test_gemm_a16_w16[True-dtype0-8192-1280-8192] * test_gemm_a16_w16[True-dtype0-16384-1280-8192] * test_gemm_a16_w16[True-dtype0-1-8192-1024] * test_gemm_a16_w16[True-dtype0-512-8192-1024] * test_gemm_a16_w16[True-dtype0-1024-8192-1024] * test_gemm_a16_w16[True-dtype0-2048-8192-1024] * test_gemm_a16_w16[True-dtype0-4096-8192-1024] * test_gemm_a16_w16[True-dtype0-8192-8192-1024] * test_gemm_a16_w16[True-dtype0-16384-8192-1024]
brunomazzottiamd
added a commit
to brunomazzottiamd/aiter
that referenced
this pull request
Apr 6, 2026
…in gemm(ROCm#2482)" This reverts commit 61804c6. This is an experiment to double check the following failures in `op_tests/triton_tests/gemm/basic/test_gemm_a16w16.py`: * test_gemm_a16_w16[True-dtype0-1024-1024-1024] * test_gemm_a16_w16[True-dtype0-2048-2048-2048] * test_gemm_a16_w16[True-dtype0-3072-3072-3072] * test_gemm_a16_w16[True-dtype0-4096-4096-4096] * test_gemm_a16_w16[True-dtype0-5120-5120-5120] * test_gemm_a16_w16[True-dtype0-6144-6144-6144] * test_gemm_a16_w16[True-dtype0-7168-7168-7168] * test_gemm_a16_w16[True-dtype0-8192-8192-8192] * test_gemm_a16_w16[True-dtype0-4864-4096-8192] * test_gemm_a16_w16[True-dtype0-9728-8192-65536] * test_gemm_a16_w16[True-dtype0-4864-8192-4160] * test_gemm_a16_w16[True-dtype0-64-1280-8192] * test_gemm_a16_w16[True-dtype0-128-1280-8192] * test_gemm_a16_w16[True-dtype0-192-1280-8192] * test_gemm_a16_w16[True-dtype0-256-1280-8192] * test_gemm_a16_w16[True-dtype0-320-1280-8192] * test_gemm_a16_w16[True-dtype0-512-1280-8192] * test_gemm_a16_w16[True-dtype0-1024-1280-8192] * test_gemm_a16_w16[True-dtype0-2048-1280-8192] * test_gemm_a16_w16[True-dtype0-4096-1280-8192] * test_gemm_a16_w16[True-dtype0-8192-1280-8192] * test_gemm_a16_w16[True-dtype0-16384-1280-8192] * test_gemm_a16_w16[True-dtype0-1-8192-1024] * test_gemm_a16_w16[True-dtype0-512-8192-1024] * test_gemm_a16_w16[True-dtype0-1024-8192-1024] * test_gemm_a16_w16[True-dtype0-2048-8192-1024] * test_gemm_a16_w16[True-dtype0-4096-8192-1024] * test_gemm_a16_w16[True-dtype0-8192-8192-1024] * test_gemm_a16_w16[True-dtype0-16384-8192-1024]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm_a16w16
Technical Details
The do_not_specialize=["M","N"] change (commit 3170a51) prevents kernel
recompilation when M/N change, but removes tt.divisibility=16 from M/N,
causing AxisInfo to lose contiguity through RemOp and CmpOp, which degrades
buffer_store from vectorized dwordx2 (vec=4) to scalar short (vec=1).
Introduce EVEN_MN constexpr heuristic that checks M%BLOCK_SIZE_M==0 and
N%BLOCK_SIZE_N==0 at compile time. When true, skip the modulo wrap on
offs_am/offs_bn and use unmasked tl.store, restoring contiguity for
vectorized memory operations without sacrificing the recompilation benefit.
Test Plan
bench_gemm_a16w16.py
Test Result
previous

current

Submission Checklist