add gen_fake for 4 gemm operators by mqhc2020 · Pull Request #1456 · ROCm/aiter

mqhc2020 · 2025-11-21T03:40:49Z

Motivation

Enable torch compile for gemms

Technical Details

Add torch compile guard and gen_fake function
serialize and de-serialize the dictionary in argument because it blocks the torch compile

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

This reverts commit d0c445d.

Copilot

Pull Request Overview

This PR enables torch compile support for 4 GEMM operations by introducing serialization/deserialization utilities for configuration dictionaries and adding torch compile guards with fake tensor generation functions.

Key Changes:

Added serialize_dict and deserialize_string utilities to handle config dictionaries for torch compile compatibility
Refactored 4 GEMM functions to use torch compile guards with separate fake tensor implementations
Updated dtype references in fused_moe_bf16_asm.py to use the unified dtypes.fp8 instead of hardcoded torch.float8_e4m3fnuz

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
aiter/ops/triton/utils/common_utils.py	Adds serialization utilities for converting config dicts to hashable strings and back for torch compile support
aiter/ops/triton/gemm_afp4wfp4_pre_quant_atomic.py	Refactors to add torch compile guard, fake tensor function, and config serialization
aiter/ops/triton/gemm_afp4wfp4.py	Refactors to add torch compile guard, fake tensor function, and config serialization
aiter/ops/triton/gemm_a16w16_atomic.py	Refactors to add torch compile guard, fake tensor function, and config serialization
aiter/ops/triton/batched_gemm_afp4wfp4_pre_quant.py	Refactors to add torch compile guard, fake tensor function, and config serialization
aiter/fused_moe_bf16_asm.py	Updates dtype references to use unified `dtypes.fp8` constant with improved error message

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* fix sink error for asm fmha (#1652) Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * add guard in case pynccl init failed (#1671) * One shot pa (#1670) * add one shot pa kernel * fix buffer load in sliding window kernel * fix typo * revert --------- Co-authored-by: root <root@hjbog-srdc-24.amd.com> * fix(pa_ps): fix pa_ps_asm .co for gfx950 (#1669) Signed-off-by: Double Young <yang.yang2@amd.com> * modify test_bf16gemm_test (#1678) * Fix Ruff command in pre-checks (#1675) * fix mha bwd golden perf issue (#1666) * topk uplift v1 (#1662) /lgtm The customer has tested the code. It can work. * topk uplift v1 * topk add api for choose topk_v1 or topk_v2 --------- Co-authored-by: yonshuai <yonshuai@amd.com> Co-authored-by: yongshuai <yongshuai@amd.com> * fix missing return in mha_bwd (#1688) * Remove the input parameter "out" in gemm_a4w4 (#1679) * Remove the input parameter "out" in gemm_a4w4 * update * format --------- Co-authored-by: valarLip <Lingpeng.Jin@amd.com> * fwd v3 hd192 optimize inst alignment for causal mode (#1663) Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * fix swa case mismatch (#1694) * fixing the fp4 gemm tune script Exception caused by tile_m name inconsistency (#1686) * CI: Migrate Triton tests to aiter-1gpu-runner (#1690) * add ntile 128 for a8 blkQ moe 1 stage (#1695) * add fmoe co with tilesize 32x128 * add ps co * fix pertoken co bug * add co to csv * add 128ntile logic for one stage asm * fix mem fault during perf turn * en vs for pertoken kernel --------- Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: zufayu <zufayu@amd.com> * Optimize RoPE in the cases that hdim is small. (#1698) * Introduce new grid config strategy for compatibility with cases that hdim is small. * add launch bound to make sure that occu is always 8 * follow Copilot the suggestions * rm garbage from whl (#1696) * enhance prebuild logic (#1672) * enhance prebuild logic * ATen.h build issues * bug fix * bug fix II * bug fix III --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * LLfp4 qr cap for atom (#1673) * QR cap implemented to limit QR to prefill * test git config * Fix to genericize qr comm cap * Incorrect cap number * [MLA] MLA conditions rewrite (#1665) * open mla mtp and remove some logs * fix qlen dense 128,N * fix hint * support sparse qlen input = 1 * change default splits * fix dp causal (#1677) * add two fp4 tune shapes and tuned config (#1687) * add two fp4 tune shapes and tuned config * change 32800 to 65536 to cover all cases between 32768 to 65536 as per feedback * Dev/a8w4 and a8w8splitk (#1667) * support moe a8w8 splitk (#1654) * Add support to a8w8_ck_moe_blk_gemm1 splitk * add switch and add some logging * tiny fix * update ck 3rd party and add some logging * add AITER_HEURISTIC_ONLY env * update ck * add condition to bypass tuned cfg * change bypass type * fix * fix removed log * upate ck submodule * fix lint * force to run tests --------- Co-authored-by: oscar <huaiguxu@amd.com> * Zan/moe a8w4 (#1655) * update * update * update quant * ut ready * update quant type * compile pass * python3 op_tests/test_moe_2stage.py -t 16 -e 1 -k 1 -dim 256,256 ready * update aiter dipatcher for bf16&fp8 * support a16 a8 dispatch * finish quant & sort * update aiter framework for a8w4 moe * update ck * update * update * update for atom * update --------- Co-authored-by: Zzz9990 <Zzz9990> Co-authored-by: root <root@hjbog-srdc-24.amd.com> * update ck * fix dispatch * fix too much logging * update * update ck * update ck * fix ruff code style * revert aiter-test yaml * fix ci * fix ci * fix ci * add mocked tuned result and decoding cfg token to next power of 2 * Update tuned_fmoe.csv remove duplicate * remove hack dtype * fix black * unique index * add empty arg to ck_moe_stage1 * resolve bias into lru cache * rename bypass cfg to AITER_BYPASS_TUNE_CONFIG --------- Co-authored-by: oscar <huaiguxu@amd.com> Co-authored-by: Zzz9990 <zanzhang@amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: felix <felix.li@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> * bf16_gemm_clean_in_kl (#1700) * bf16_gemm_clean_in_kl * update * update * update * update * fix tuner (#1701) * fix tuner * Update gradlib/gradlib/GemmTuner.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add gen_fake for 4 gemm operators (#1456) Co-authored-by: Lin, Soga <soga.lin@amd.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com> * fix llvm issue (#1703) * fix llvm issue * fix copilot * feat: Adaptive topk algorithm selection based on input characteristics (#1578) * Add radix-base selection * Remove explicit template * Update the selected k condition * remove pos < k guard * code format * Update csrc/include/rocm_ops.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/kernels/topk_per_row_kernels.cu Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update csrc/kernels/topk_plain_kernels.cu Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update test_topk_plain.py * Update TODO message * Update csrc/kernels/topk_per_row_kernels.cu Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update op_tests/test_topk_plain.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * format test_topk_plain.py with black * Disable triton test for a resonalbe execution time * add explicit template instantiation * fix explicit template instantiation * add explicit template instantiation * Add bf16 support * Fix linter * Fix build errors * Fix condition * Fix build and test * Update conditions --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: MHYang <meng-hsuan.yang@amd.com> * fix mha bwd build error (#1705) * fix moe bug when pipever=v1 and nblk=64 (#1707) * fix bug * update * fix (#1710) * fix * update lint * [PA] Optimize PA Decode Gluon Performance for BF16/FP16 with KV_BLOCK_SIZE=64 and Fix ROCm 7.0 AOT Compilation (#1691) * Optimize pa_decode_gluon f16/bf16 perf for KV_BLOCK_SIZE=64 & fix ROCm 7.0 AOT - Add dedicated blocked layouts for f16/bf16 compute types - Add local AOT compile tool to fix ROCm 7.0 compatibility * black format file * format file to pass the ruff check * fix error in gfx950 * Fix argument parsing logic when AITER_JIT_DIR is set (#1715) When AITER_JIT_DIR is defined the enum module is loaded as "module_aiter_enum" rather than "aiter.jit.module_aiter_enum". This caused the docstring cleanup of enums to not work properly, causing a NameError exception in check_args. * fix topk deocde bug in logit value is same (#1716) Co-authored-by: yonshuai <yonshuai@amd.com> * add fp32 input (#1706) * add fp32 input * format code * perf bug fix * logic fix : out type != input type * bug fix * format code * remove dtype convert before act_and_mul in fused_moe --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> * add sampling aot (#1711) * add sampling aot * simple compile * fix compile bugs * fix a bug * revert changes --------- Co-authored-by: root <root@hjbog-srdc-24.amd.com> * update * bugfix * update * update --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Signed-off-by: Double Young <yang.yang2@amd.com> Co-authored-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: who who who <fsx950223@outlook.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Double Young <yang.yang2@amd.com> Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com> Co-authored-by: Satya Nikhil Kodukula <nikhil.kodukula@gmail.com> Co-authored-by: JaxChen29 <jichen@amd.com> Co-authored-by: steamedMantou <82486092+steamedMantou@users.noreply.github.com> Co-authored-by: yonshuai <yonshuai@amd.com> Co-authored-by: yongshuai <yongshuai@amd.com> Co-authored-by: Yu Guo <82124926+yuguo68@users.noreply.github.com> Co-authored-by: la <46212055+junhaha666@users.noreply.github.com> Co-authored-by: valarLip <Lingpeng.Jin@amd.com> Co-authored-by: shay-li77 <xiangxli@amd.com> Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: zufayu <zufa.yu@amd.com> Co-authored-by: feifei14119 <feiw@amd.com> Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: ruanjm <jiming.ruan@amd.com> Co-authored-by: amirumoAMD <Amelia.Moore@amd.com> Co-authored-by: yadaish <yadai@amd.com> Co-authored-by: oscar <huaiguxu@amd.com> Co-authored-by: felix <felix.li@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: mqhc2020 <marvin.tsai@amd.com> Co-authored-by: Lin, Soga <soga.lin@amd.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com> Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com> Co-authored-by: MHYang <meng-hsuan.yang@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: yanguahe <yanguahe@amd.com> Co-authored-by: omoisis-dn <omoisis@drivenets.com> Co-authored-by: chenjun <junchen2@amd.com>

This reverts commit ac6142e.

Co-authored-by: Lin, Soga <soga.lin@amd.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com>

sogalin and others added 6 commits June 9, 2025 10:00

Add gfx950 support in asm_moe

d0c445d

Merge branch 'ROCm:main' into main

2c54ba2

Merge branch 'ROCm:main' into main

c5ec187

fix triton compile issue

c9ecccc

keep config

95924b3

add for other 3 functions

109e297

Copilot AI review requested due to automatic review settings November 21, 2025 03:40

Copilot started reviewing on behalf of mqhc2020 November 21, 2025 03:41 View session

Revert "Add gfx950 support in asm_moe"

c7bda54

This reverts commit d0c445d.

Copilot finished reviewing on behalf of mqhc2020 November 21, 2025 03:43

Copilot AI reviewed Nov 21, 2025

View reviewed changes

for consistency

dbf6932

mqhc2020 requested review from ZhangLirong-amd and valarLip November 21, 2025 03:55

mqhc2020 added 5 commits November 21, 2025 11:27

fix errors

a0d98b8

fix gen_fake issues considering y is None case

4a9d464

fix error

82e523f

Merge branch 'main' into marv/gemm_torch_compile

bda6f57

fix conflicts

ae2022a

mqhc2020 requested a review from a team December 18, 2025 05:51

fix error

24c2ada

mqhc2020 changed the title ~~Enable torch compile for 4 gemms~~ add gen_fake for 4 gemm operators Dec 18, 2025

mqhc2020 added 4 commits December 18, 2025 02:18

simplify serialization

a596a62

for consistency

ce07338

fix black failure

90e00e4

fix ruff problems

16749d8

mqhc2020 force-pushed the marv/gemm_torch_compile branch from 78a031e to 16749d8 Compare December 19, 2025 08:36

Merge branch 'main' into marv/gemm_torch_compile

1df8aab

ZhangLirong-amd approved these changes Dec 21, 2025

View reviewed changes

mqhc2020 merged commit ac6142e into ROCm:main Dec 21, 2025
21 checks passed

azaidy added a commit that referenced this pull request Dec 27, 2025

Revert "add gen_fake for 4 gemm operators (#1456)"

8a0e41f

This reverts commit ac6142e.

azaidy mentioned this pull request Dec 27, 2025

Revert "add gen_fake for 4 gemm operators" #1746

Closed

ZhangLirong-amd pushed a commit that referenced this pull request Dec 29, 2025

add gen_fake for 4 gemm operators (#1456)

ba1acfc

Co-authored-by: Lin, Soga <soga.lin@amd.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com>

farlukas pushed a commit that referenced this pull request Jan 5, 2026

add gen_fake for 4 gemm operators (#1456)

971495b

Co-authored-by: Lin, Soga <soga.lin@amd.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com>

zhuyuhua-v pushed a commit that referenced this pull request Jan 14, 2026

add gen_fake for 4 gemm operators (#1456)

0d829a5

Co-authored-by: Lin, Soga <soga.lin@amd.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com>

valarLip pushed a commit that referenced this pull request Mar 18, 2026

add gen_fake for 4 gemm operators (#1456)

0acba73

Co-authored-by: Lin, Soga <soga.lin@amd.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com>

valarLip pushed a commit that referenced this pull request Mar 18, 2026

add gen_fake for 4 gemm operators (#1456)

8f74222

Co-authored-by: Lin, Soga <soga.lin@amd.com> Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add gen_fake for 4 gemm operators#1456

add gen_fake for 4 gemm operators#1456
mqhc2020 merged 19 commits into
ROCm:mainfrom
sogalin:marv/gemm_torch_compile

mqhc2020 commented Nov 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mqhc2020 commented Nov 21, 2025

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants