rm gemm_common bind by yzhou103 · Pull Request #2425 · ROCm/aiter

yzhou103 · 2026-03-23T09:08:14Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR removes two small pybind-based helper modules (module_gemm_common bindings and module_aiter_enum) and replaces them with (1) a ctypes-exposed getPaddedM function for padded-M calculation and (2) Python IntEnum definitions for ActivationType/QuantType. It also adjusts the ctypes JIT caller to better match function signatures/return types.

Changes:

Drop module_aiter_enum and module_gemm_common pybind glue, shifting getPaddedM to a ctypes-callable exported C symbol and enums to Python IntEnum.
Enhance compile_ops(..., ffi_type="ctypes") calling path to infer restype (int/float) and to only append hipStream_t when tensor arguments exist.
Update build/prebuild configuration and tuned GEMM CSV entries accordingly.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
setup.py	Adjusts prebuild module filtering (removes `module_aiter_enum` special-casing).
csrc/pybind/gemm_common_pybind.cu	Removed pybind module for GEMM common helper.
csrc/pybind/aiter_enum_pybind.cu	Removed pybind module for enum exposure.
csrc/py_itfs_cu/gemm_common.cu	Exports `getPaddedM` as a default-visible `extern "C"` symbol for ctypes use.
csrc/include/rocm_ops.hpp	Removes pybind macro helpers for enums and `get_padded_m`.
csrc/include/gemm_common.h	Makes `getPaddedM` a plain C ABI header (no Torch include).
csrc/include/aiter_enum.h	Adds a source-of-truth comment for enums.
aiter/ops/gemm_op_common.py	Switches `get_padded_m` to `compile_ops(..., ffi_type="ctypes", fc_name="getPaddedM")`.
aiter/ops/enum.py	Replaces JIT/pybind-derived enum types with Python `IntEnum` values.
aiter/jit/optCompilerConfig.json	Removes `module_aiter_enum` and removes gemm_common pybind source from `module_gemm_common`.
aiter/jit/core.py	Updates ctypes call wrapper: return type inference + conditional stream argument + returns underlying C result.
aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv	Removes a subset of tuned entries (notably kernelId 161 rows shown in diff).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* add fused_qknorm hip kernel (#2442) * base impl * improve big M * optimize v2 * fallback split norm for bigger M * fix rebase failed * code format * fix include error * support outplace works for torch compile * add output args --------- Co-authored-by: Guanbao Yu <gyu@amd.com> * feat: add fast gelu (#2220) * add fast gelu * fix ut * refactor * add more log * change to empty to avoid perf downgrade * make black lint happy * make ruff happy * fix import error * Fix build CK pipeline (#2399) * Fix renamed CK pipeline * Fix CK version * Rename eight_warps to eight_waves for consistency --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * [OPUS] gfx1250 support for opus wmma scale and moe_sorting kernel (#2449) * [OPUS] Add gfx1250 support for opus moe_sorting kernel Add gfx1250 (MI450) to the GPU architecture maps and fix the opus moe_sorting kernel to use the correct wait counter instruction on gfx12 ISA (s_wait_dscnt instead of s_waitcnt_lgkmcnt). * [OPUS] Add WMMA scale instruction support for gfx1250 Add scaled WMMA dispatch (BX32 int / BX16 long) to struct wmma for: - wmma_scale[16]_f32_16x16x128_f8f6f4 (fp8 fmt=0, fp4 fmt=4) - wmma_scale[16]_f32_32x16x128_f4 (dedicated fp4) Key changes in opus.hpp: - Add fmt_a/fmt_b format codes and per-lane E8M0 scale operator() overloads with compile-time scale_sel (OPSEL) for lane group selection - Fix vtype for packed fp4: use reg_bytes_a/b to compute correct hardware register size (elem * bits / 8 bytes, not elem * sizeof) - Add zero-pad helper for fp4-via-f8f6f4 (i32x8 -> i32x16) - Add scaled overloads to wmma_adaptor_swap_ab Device tests (op_tests/opus/device/test_wmma_scale.cu): - Raw warp-level: 6 variants (fp8/fp4 x BX32/BX16 + 32x16 fp4) - Tiled MMA: 1x1, 2x2, 4x1 wave configurations via make_tiled_mma - Per-lane random scale: E8M0 exponents in [122..133], bitwise exact - All tests PASS on gfx1250, SKIP on gfx942/gfx950 * [OPUS] Fix black/ruff lint: rename ambiguous var `l` to `lane` * reduce wasted get_module overhead for module with custom module name (#2455) * fix moe gemm tuned config (#2463) * fix * add e=256 k=8 tuned config * revert t=1024,2048 back to ck * update * update * Skip `test_metadata_redirect.py` on archs other than `gfx942` (#2456) This test was designed to run only on `gfx942`. We want to run all Triton tests on `gfx950` as well - this is the reason behind the proposed changes. * [Triton] Fix bench_mha (#2317) * fix bench_mha * fix bench_mha parsing logic of bench_attn_models * fix bench_mla_decode parsing logic of bench_attn_models --------- Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> * test_mla_persistent.py split kv reference fix max_seq_q != 1 error (#2363) * CI: fix dubious ownership for sglang checkout (#2477) * CI: mark sglang checkout paths as safe directories Add safe.directory entries before dependency installation for both host and container checkout paths to avoid git dubious ownership failures on /sglang-checkout. * CI: align sglang safe.directory fix with upstream Move safe.directory configuration to run right after container startup and only mark /sglang-checkout, matching upstream sglang handling for dubious ownership. * CI: use pip editable install and safe.directory in runtime CI (#2474) * Fix CKTile blockscale GEMM to read strides from tensor metadata (#2466) The CKTile blockscale GEMM wrapper hardcoded leading-dimension strides as stride_A = K, stride_B = K, stride_C = N, assuming fully contiguous row-major layout. This produced silently wrong results when input tensors had non-standard strides. In vLLM on ROCm, _maybe_pad_fp8_weight pads FP8 weight tensors for alignment and then creates a narrowed view, producing tensors whose logical shape is [N, K] but whose physical stride is [K+pad, 1]. The hardcoded stride_B = K caused the kernel to read from wrong memory offsets, leading to garbage output. Fix: read leading-dimension strides from the PyTorch tensor metadata (XQ.stride(0), WQ.stride(0), etc.) instead of assuming dense layout. Add TORCH_CHECK assertions to verify inner-dimension contiguity (stride == 1), which is required by the CKTile kernel. The old CK backend (gemm_a8w8_blockscale_common.cuh) already reads strides from tensor metadata, which is why it was unaffected. Made-with: Cursor * Add LSE output support for MLA decode qseqlen=1 persistent kernel (gf… (#2440) * Add LSE output support for MLA decode qseqlen=1 persistent kernel (gfx950) * Add qseqlen fold for MLA on gfx950: use qh64 kernel instead of qh16 * Add qseqlen fold for MLA on gfx950: use qh64 kernel instead of qh16 --------- Co-authored-by: root <root@gbt350-odcdh2-a11-1.png-odc.dcgpu> * tuned qwen3.5 gemm (#2485) Signed-off-by: Guanbao Yu <gyu@amd.com> Co-authored-by: Guanbao Yu <gyu@amd.com> * [Triton] Flash Attention Triton Windows build support (#2433) * Initial FA-2 Triton Windows build support * Continue Work lint minimize diff windows smoke test address copilot fixes remove is_windows error improve windows message --------- Co-authored-by: 0xDELUXA <djernovevo@gmail.com> * fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm(#2482) * fix(gemm): add EVEN_MN heuristic to restore vectorized store in gemm_a16w16 The do_not_specialize=["M","N"] change (commit 3170a51) prevents kernel recompilation when M/N change, but removes tt.divisibility=16 from M/N, causing AxisInfo to lose contiguity through RemOp and CmpOp, which degrades buffer_store from vectorized dwordx2 (vec=4) to scalar short (vec=1). Introduce EVEN_MN constexpr heuristic that checks M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0 at compile time. When true, skip the modulo wrap on offs_am/offs_bn and use unmasked tl.store, restoring contiguity for vectorized memory operations without sacrificing the recompilation benefit. Made-with: Cursor * fix(gemm): add EVEN_MN heuristic to batched_gemm_a8w8 kernel Same fix as gemm_a16w16: the do_not_specialize=["M","N"] removes tt.divisibility=16, breaking vectorized store/load. Add EVEN_MN constexpr heuristic to conditionally skip modulo and mask when M%BLOCK_SIZE_M==0 and N%BLOCK_SIZE_N==0, restoring contiguity. Made-with: Cursor --------- Co-authored-by: jianlian <jianlian@amd.com> * rm gemm_common bind (#2425) * rm gemm_commona= and quant type bind * retune failed shape in a8w8_bpreshuffle_tuned_gemm.csv * recover enum * assert when found duplicated tuned shape (#2376) * assert when found duplicated tuned shape * rm duplicated tuned shape and update tuned file in model_configs * fix lint --------- Signed-off-by: Guanbao Yu <gyu@amd.com> Co-authored-by: XiaobingZhang <xiaobingzhangupc@gmail.com> Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: ChenYou <youchen@amd.com> Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: Elton <zhimding@amd.com> Co-authored-by: Bruno Mazzotti <bruno.mazzotti@amd.com> Co-authored-by: Michael Melesse <micmelesse@gmail.com> Co-authored-by: minmengdie <memin@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Sami Remes <samremes@amd.com> Co-authored-by: fangche123 <Fang.Che@amd.com> Co-authored-by: root <root@gbt350-odcdh2-a11-1.png-odc.dcgpu> Co-authored-by: Pleaplusone <ygan@amd.com> Co-authored-by: 0xDELUXA <djernovevo@gmail.com> Co-authored-by: jianhao <Jianhao.Liang@amd.com> Co-authored-by: jianlian <jianlian@amd.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>

rm gemm_commona= and quant type bind

b6d6e0d

yzhou103 requested review from a team and Copilot March 23, 2026 09:08

Copilot started reviewing on behalf of yzhou103 March 23, 2026 09:09 View session

retune failed shape in a8w8_bpreshuffle_tuned_gemm.csv

1d3f455

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Comment thread csrc/include/aiter_enum.h

Comment thread setup.py Outdated

yzhou103 changed the title ~~rm gemm_commona= and quant type bind~~ rm gemm_common and quant type bind Mar 23, 2026

yzhou103 added 3 commits March 23, 2026 17:26

Merge branch 'main' into refactor_gemm_common_quantype_bind

d747886

recover enum

60f06c3

Merge branch 'main' into refactor_gemm_common_quantype_bind

4a10863

yzhou103 changed the title ~~rm gemm_common and quant type bind~~ rm gemm_common bind Mar 23, 2026

valarLip added the ci:all label Mar 23, 2026

yzhou103 added 2 commits March 26, 2026 10:01

Merge branch 'main' into refactor_gemm_common_quantype_bind

ba21c86

Merge branch 'main' into refactor_gemm_common_quantype_bind

5669ebd

valarLip approved these changes Mar 27, 2026

View reviewed changes

valarLip merged commit acb21e5 into main Mar 27, 2026
30 checks passed

valarLip deleted the refactor_gemm_common_quantype_bind branch March 27, 2026 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rm gemm_common bind#2425

rm gemm_common bind#2425
valarLip merged 7 commits intomainfrom
refactor_gemm_common_quantype_bind

yzhou103 commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yzhou103 commented Mar 23, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants