Skip to content

Wjx/ck tile moe merge#1729

Merged
Zzz9990 merged 41 commits into
wjx/ck_tile_moefrom
wjx/ck_tile_moe_merge
Dec 25, 2025
Merged

Wjx/ck tile moe merge#1729
Zzz9990 merged 41 commits into
wjx/ck_tile_moefrom
wjx/ck_tile_moe_merge

Conversation

@Zzz9990

@Zzz9990 Zzz9990 commented Dec 25, 2025

Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

LJ-underdog and others added 30 commits December 17, 2025 16:28
Signed-off-by: Linjun-AMD <Jun.Lin@amd.com>
* add one shot pa kernel

* fix buffer load in sliding window kernel

* fix typo

* revert

---------

Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Signed-off-by: Double Young <yang.yang2@amd.com>
/lgtm

The customer has tested the code. It can work.

* topk uplift v1

* topk add api for choose topk_v1 or topk_v2

---------

Co-authored-by: yonshuai <yonshuai@amd.com>
Co-authored-by: yongshuai <yongshuai@amd.com>
* Remove the input parameter "out" in gemm_a4w4

* update

* format

---------

Co-authored-by: valarLip <Lingpeng.Jin@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* add fmoe co with tilesize 32x128

* add ps co

* fix pertoken co bug

* add co to csv

* add 128ntile logic for one stage asm

* fix mem fault during perf turn

* en vs for pertoken kernel

---------

Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: zufayu <zufayu@amd.com>
* Introduce new grid config strategy for compatibility with cases that hdim is small.

* add launch bound to make sure that occu is always 8

* follow Copilot the suggestions
* enhance prebuild logic

* ATen.h build issues

* bug fix

* bug fix II

* bug fix III

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* QR cap implemented to limit QR to prefill

* test git config

* Fix to genericize qr comm cap

* Incorrect cap number
* open mla mtp and remove some logs

* fix qlen dense 128,N

* fix hint

* support sparse qlen input = 1

* change default splits
* add two fp4 tune shapes and tuned config

* change 32800 to 65536 to cover all cases between 32768 to 65536 as per feedback
* support moe a8w8 splitk  (#1654)

* Add support to a8w8_ck_moe_blk_gemm1 splitk

* add switch and add some logging

* tiny fix

* update ck 3rd party and add some logging

* add AITER_HEURISTIC_ONLY env

* update ck

* add condition to bypass tuned cfg

* change bypass type

* fix

* fix removed log

* upate ck submodule

* fix lint

* force to run tests

---------

Co-authored-by: oscar <huaiguxu@amd.com>

* Zan/moe a8w4 (#1655)

* update

* update

* update quant

* ut ready

* update quant type

* compile pass

* python3 op_tests/test_moe_2stage.py -t 16 -e 1 -k 1 -dim 256,256 ready

* update aiter dipatcher for bf16&fp8

* support a16 a8 dispatch

* finish quant & sort

* update aiter framework for a8w4 moe

* update ck

* update

* update

* update for atom

* update

---------

Co-authored-by: Zzz9990 <Zzz9990>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* update ck

* fix dispatch

* fix too much logging

* update

* update ck

* update ck

* fix ruff code style

* revert aiter-test yaml

* fix ci

* fix ci

* fix ci

* add mocked tuned result and decoding cfg token to next power of 2

* Update tuned_fmoe.csv

remove duplicate

* remove hack dtype

* fix black

* unique index

* add empty arg to ck_moe_stage1

* resolve bias into lru cache

* rename bypass cfg to AITER_BYPASS_TUNE_CONFIG

---------

Co-authored-by: oscar <huaiguxu@amd.com>
Co-authored-by: Zzz9990 <zanzhang@amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: felix <felix.li@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* bf16_gemm_clean_in_kl

* update

* update

* update

* update
* fix tuner

* Update gradlib/gradlib/GemmTuner.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: amd-ruitang3 <145657428+amd-ruitang3@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Lin, Soga <soga.lin@amd.com>
Co-authored-by: sogalin <39478626+sogalin@users.noreply.github.com>
* fix llvm issue

* fix copilot
#1578)

* Add radix-base selection

* Remove explicit template

* Update the selected k condition

* remove pos < k guard

* code format

* Update csrc/include/rocm_ops.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/kernels/topk_per_row_kernels.cu

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update csrc/kernels/topk_plain_kernels.cu

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update test_topk_plain.py

* Update TODO message

* Update csrc/kernels/topk_per_row_kernels.cu

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update op_tests/test_topk_plain.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* format test_topk_plain.py with black

* Disable triton test for a resonalbe execution time

* add explicit template instantiation

* fix explicit template instantiation

* add explicit template instantiation

* Add bf16 support

* Fix linter

* Fix build errors

* Fix condition

* Fix build and test

* Update conditions

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: MHYang <meng-hsuan.yang@amd.com>
valarLip and others added 9 commits December 22, 2025 20:34
* fix

* update lint
…_SIZE=64 and Fix ROCm 7.0 AOT Compilation (#1691)

* Optimize pa_decode_gluon f16/bf16 perf for KV_BLOCK_SIZE=64 & fix ROCm 7.0 AOT

- Add dedicated blocked layouts for f16/bf16 compute types
- Add local AOT compile tool to fix ROCm 7.0 compatibility

* black format file

* format file to pass the ruff check

* fix error in gfx950
When AITER_JIT_DIR is defined the enum module is loaded as "module_aiter_enum" rather than "aiter.jit.module_aiter_enum".
This caused the docstring cleanup of enums to not work properly, causing a NameError exception in check_args.
Co-authored-by: yonshuai <yonshuai@amd.com>
* add fp32 input

* format code

* perf bug fix

* logic fix : out type != input type

* bug fix

* format code

* remove dtype convert before act_and_mul in fused_moe

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
* add sampling aot

* simple compile

* fix compile bugs

* fix a bug

* revert changes

---------

Co-authored-by: root <root@hjbog-srdc-24.amd.com>
@Zzz9990 Zzz9990 force-pushed the wjx/ck_tile_moe_merge branch from 122fc80 to c5aa662 Compare December 25, 2025 03:35
@Zzz9990 Zzz9990 marked this pull request as ready for review December 25, 2025 10:16
@Zzz9990 Zzz9990 merged commit 532945a into wjx/ck_tile_moe Dec 25, 2025
4 checks passed
@Zzz9990 Zzz9990 deleted the wjx/ck_tile_moe_merge branch December 25, 2025 10:16
@Zzz9990 Zzz9990 restored the wjx/ck_tile_moe_merge branch December 25, 2025 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.