[Update] update dev/perf branch to main branch commit 9c2df5 and cherry-pick mla_fix, mmd/fix/fwd_v3 by zhuyuhua-v · Pull Request #1467 · ROCm/aiter

zhuyuhua-v · 2025-11-22T14:43:20Z

Motivation

[Update] update dev/perf branch to main branch commit 9c2df5a
Author: TennyWang1223
Date: Fri Nov 21 09:46:03 2025 +0800
[fix]: add ar switch ([fix]: add ar switch #1376)
cherry-pick mla_fix, remove maxsize for mla lru cache #1445 and mmd/fix/fwd_v3, fix fwd_v3 output/lse is nan when kseq=0 and fix qseq >> kseq error #1442 to fix ds-r1 acc issue.

Test Plan

gsm8k

Test Result

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9477	±	0.0061
		strict-match	5	exact_match	↑	0.9477	±	0.0061

* integrate m grouped gemm * update ck * add limit for 950 * rename deepgeem

Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* Enable large batch size and optimization of non-Ragged batching * Add RAGGED_BATCH to test_la.py and bench_la.py

…1298)

* CI: Optimize autotuning pipeline and inital the docs

* initial commit for topk per row kernel * topk per row kernel initial commit * Fix the typo issue * Add the topk per row kernel * optimizations for topk_per_row kernel * fix overflow * add unit test for topk_per_row_decode * update test for decode * apply vector dispatch from carlus --------- Co-authored-by: root <root@smci355-ccs-aus-m06-29.cs-aus.dcgpu> Co-authored-by: valarLip <340077269@qq.com>

* fix aot * remove other kernels path * fix aot * format code --------- Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* Fix fp8 issue in torch compile * use less code

* CI: Optimize autotuning pipeline docs * Update docs/autotuning_pipeline.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix fwd perf calc error * black aiter/ops/triton/_triton_kernels/gemm_afp4wfp4.py

* add the asm kernel performance of the attention forwards and attention backwards * modify perf data * fix perf data * add a16 perf data

* Add topk softmax * Add test for topk sigmoid * register the op properly * apply black * don't use constexpr with std::string * bump ck to include topk sigmoid commit * hipify * add argparse to the topk sigmoid test, also add pytest * use own module instead of asm moe * black formatting * add missing file * revert changes to module_moe_asm

* [fea]: add fused allreduce rmsnorm kernel Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: fuse ar rms interface Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix: fuse ar rms interface Signed-off-by: root <root@hjbog-srdc-24.amd.com> * fix ck branch Signed-off-by: root <root@hjbog-srdc-24.amd.com> * update ar interface Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* add indexer_k_quant_and_cache & cp_gather_indexer_k_quant_cache * ndexer_k_quant_and_cache opt kernel and add test * update * update2 --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

Currently it will look like this: ```log File "TransformerEngine/3rdparty/aiter/aiter/jit/utils/chip_info.py", line 77, in get_gfx_custom_op_core raise KeyError( KeyError: 'Unknown GPU architecture: . Supported architectures: [0, 1, 2, 3, 4, 5, 6, 7, 8]' ``` Signed-off-by: Hollow Man <hollowman@opensuse.org>

* using standalone pybind * fix * update

* update codegen.py * update kernels & kernel launch * fix fa bwd dq_acc shape * remove mask in python api

Creates a unified pre-checks.yaml workflow that runs Black, Ruff, and dependency checks, uploading success/failure signal artifacts Download and verify the signal artifacts in the other heavy jobs. If the verification succeeds, the heavy jobs will continue running. If the verification fails, the heavy jobs will exit immediately. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* [CK_TILE] fmha: Add backward pass support for padded inputs Introduces support for padded sequence lengths in the backward pass of the variable-length flash attention (fmha_v3_varlen_bwd). - Updated Python and C++ function signatures to accept optional `cu_seqlens_q_padded` and `cu_seqlens_k_padded` arguments. - Modified the underlying CUDA kernels and code generation scripts to pass padding information via the new `seqlen_q_ptr` and `seqlen_k_ptr` fields in the CK `fmha_bwd_args` struct. - Modified the underlying kernels and code generation scripts to correctly handle pointers for both padded and unpadded sequence data. - Added comprehensive gradient verification to the test suite (`test_mha_varlen.py`) to ensure the correctness of the backward pass with various padding scenarios. * [CK_TILE] fmha: Adapt to composable_kernel padding API changes Refactor the FMHA forward and backward pass to align with the updated padding API in `composable_kernel`. - Argument Simplification: Removed the manual calculation of `seqlen_q` and `seqlen_k` from `cu_seqlens` in the `mha.cu` interface. The underlying kernels now handle this logic. - API Alignment: Updated the arguments passed to `aiter::mha_fwd` and `aiter::mha_bwd` to match the new `composable_kernel` API. This involves passing `cu_seqlen` pointers directly. - Kernel Interface Update: Modified the `codegen.py` scripts for `gfx942` and `gfx950` to reflect the changes in the kernel's function signatures and argument handling for padded and unpadded sequence lengths. * fix build error in op_tests/cpp/mha/benchmark_mha_*.cpp

* add num_kv_splits_indptr to mla for mtp<=4 case for now * update * update new kernel * infrastructures * 1st version of split kernel * Fix issues raised by Lingpeng and fix the issue on batch_size * update mla * update mla_stage2 * 1st draft of v1 split program * add kv_offset * mla_splitkv_enhance_split_alg_inte * splitkv debug * 1st version of reduce kernel * metadata & kernel finish * add reduce * final_lse is optional now. * update kernel * bug fix * bug fix 1 * modify reduce api * update kernel * fix max splits * bug fix 3 * fix s80 early return * udpate calculation of partial_indx * add per split test * make lse support by ref * test split * fix redundant calculation of head offset in reduce kernel * add custom test * Add support of 128 head size Fix how to get head count fff * update comments * 1. Let large work be assigned first. 2. Add tolerance to the tile which is slightly smaller than kv_limit. * Calculate kv_limit dynamically * Fix bug about difference in split_kv(bool) * add test * fix seed * Add global tolerance 16 in kv seqlen because main kernel cannot handle small splits (kv_seqlen<4) well. * Fix warp=1 error * Add redundant mode to make the size of output of metadata be fixed add new param: no_redundant. Reduce can support redundant input in reduce_indptr as well. fix comm * fp8 setup * first version of device metadata aaa * Add work_ptrs * Compatibility to CUDA Graph * Refactor code. Merge 2 iterations of generate work together. * Make sure that each batch of workload can never be splited to more than #cluster of tiles. * Adjust metadata. Get 1% perf gain. * Paralize most of metadata kernel Make get_cost_top() paralized. aaa * add scale * 1. Use warp-level bitonic sort to sort batch idx based on their cost for reducing #splits. 2. Use CK's warp ops. * fp8 function pass * Fix issues: 1. avg_workload cannot handle any batch! 2. split_kv(bool) is not correct when all the clusters are full. * fp8 ready * fix * persistent ready * add nv acc test * rename * updata metashape * update reduce cu num * update optest for mla * fix cu num * Update metadata and reduce kernels. * rename kernels * Add new param kv_granularity to metadata kernel. * Introduce cal_workload_limit_global_v2 * Support qhead=128 cases. * Change get_mla_metadata() api. Make some not important parameters be optional through a dict. * fix potential problem on calculating tot_qo_tiles typo * refactor metadata files aaa * update metadata v1_2 * update gqa_128 mla_ps & fix metadata v1_2 * Optimize mla metadata v1.2 * Optimize mla metadata v1.2 Part.2 * Optimize mla metadata v1.2 Part.3 * update qlen <=4 * fix mla qlen1 * Optimize mla metadata v1.2 Part.4 * Make reduce_final_map be optional in mla_reduce_v1 * Slightly increase reduce perf * Add persistent mode for mla reduce kernel * add mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co * update deepseekv32 sparse mla metadata * update mla_a16w8_qh16_m16x4_n16x1_coex0_mask1_ps.co * Adjust code for sparse attn * Optimize the a16w8 kernel * Improve metadata v1.1 perf * Make metadata v1.1 support sparse attn bug fix tiny fix * Remove redundant code in mla_reduce * futile struggle * Fix issue after merge. aiter main branch is using torch.library.infer_schema which doesn't allow dict as parameter. Thus, change the API for metadata. * Adjust metadata v1.1 and make this branch be ready to be merged to main branch. * remove invalid co kernel * Fix issue brought from f794ae4 which disabled hipify by default. * support qolen>1 for sparse mla * make code become prettier * Fix issue in metadata v1.1 * Fix issue in test_mla.py * Fix lint fails * Fix sub-test fails in op_test/test_mla.py * Fix regression in test_mla.py where mtp>1 * Add head_dim=128 support to reduce * Add nhead=8 for pa and add assert to make sure the input tensors are in float32. * fix issue in vllm benchmark for deepseek: remove metadata v0 because it's not compatible with hip graph * fix lint * Revert all the change about mi350 gemm. * add a8w8 and a16w8 kernel in mla mi350 * add A8W8 Non-persistent mode kernel * Fix issue reported by Copilot * add mla non-persistent test * script: update a16w8 kernel * rm test_mla_persistent_mi350.py and support mi350 in test_mla_persistent.py * add mla_a16w16_qh16_m16x4_n16x1_coex0_mask1_ps.co * fix a8w8 num_kv_split=1 * Fix issue in metadata v1.2 on qo_tiles > 1 * fix ut bandwidth * Use nhead=16 simulate cases that nhead=16*N where N is in range(32,16*32+1,16) aaa Fix regression in sparse attn from the fix in metadata v1.2 for multi qo tile issue * Add new api get_mla_metadata_info * fix lint format issues * Adjust get_mla_metadata_info_v1's parameters. * update A16W8 kernel * update A16W8 kernel2 * update A16W8 for mi300 * fix ut and rename some kernels * rename mla kernel name for head 128 * remove log * fix format * add nativly back * change zeros into empty * fix with comments --------- Co-authored-by: Jiming Ruan <Jiming.Ruan@amd.com> Co-authored-by: zanzhang <zanzhang@amd.com> Co-authored-by: Fang.Che <Fang.Che@amd.com> Co-authored-by: minmengdie <memin@amd.com>

* workaround-retry tuning when encounter invalid pointer * workaround-retry tuning when encounter invalid pointer * fix lint error * Update gemm_tuner.py em timeout

…s when setting up triton (#1325) * CI: Skip triton in Aiter standard and multigpu tests * Add retries when building triton * Add ninja installation

`hipGetDeviceProperties` is called by the `torch_fp8` initialization. It will trigger all the HIP runtime initialization in global variable initialization. There are two issues: - There are several global variables involved in the runtime initialization too. The initialization order of global variables is not guaranteed. So it may use uninitialized global variables for the runtime initialization. - When there is a forked child process, needs to initialize its own HIP runtime to get proper GPU driver kernel context and handles. But since there is a runtime initialized globally in the parent process, the forked process will just consider the runtime is initialized and use it directly. But it is actually invalid. The fix is to ensure `hipGetDeviceProperties` is only called when actually needed, not during static initialization To repro the issue: 1. fork a child process 2. call torch.empty on the child process It will get a `hipErrorInvalidValue` error. Co-authored-by: Hui Zhou <huizhou@meta.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

…nning (#1315) * update AOT, always pad x_scale when generating input, add UT * update act_mul_mxfp4_quant, fused_rms_mxfp4_quant * add LL FP4 configs and AOT files for TP8 shapes * fix UT bug * add LL TP2 and TP4 shapes

* Kernel naming: add reusable constexpr repr helper for gemm a16w16 * add missing params to the repr

* add ptpc deepseek ep moe tuned config * add block deepseek ep moe tune config * using 1stage moe for ptpc deepseek

* fuse routing kernels for small batches * tune batch=1024

…#1295) * Apply kernel_repr to attention kernels Applied make_kernel_repr helper to 4 attention kernel files: - pa_decode.py (6 kernels) - pa_prefill.py (2 kernels) - chunked_pa_prefill.py (1 kernel) - mla_decode_rope.py (2 kernels) Each kernel now has config-aware naming with constexpr parameters included in the kernel metadata name. Base: amd/satya/kernel_config_to_name * Apply kernel_repr to attention kernels * fix indentation error and add kernel_repr to a missed kernel * Add descriptions to missing API descriptions * remove unused imports * fix runtime error * revert lean atten to main * lean atten repr and API desc * formatting fix * Update aiter/ops/triton/pod_attention.py

Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com>

FP8 MQA optimizations AND bench. script

…ng (#1366) * redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning * add required keys to fused_moe_1stage_dict * add kernel descriptors and code object files * add 32x128 file descriptors and code objects for tuning * move code objects and kernel descriptors to correct csv * remove unnecessary import, add quant type argument * move fused_moe_stage1_tkw1 into fused_moe.py * remove unnecessary kernel code object files * Add missing comma * saved modified tuned fmoe config for testing purposes * apply black required formatting * remove fused_moe_stage1_tkw1 and place aiter.fmoe_g1u1_tkw1 under fused_moe_1stage * remove unnecesary arguments * apply black formatting * simplify aiter.fmoe_g1u1_tkw1 call * add doweight_stage1 column to fused_moe_1stage_dict map and remove elif condition to select run_1stage=True * add doweight_stage1 to query key * modidy elif to select run_stage=True for tokens > 16 * apply black formatting * removing csv and .co files as they will come in separate commit * removing log logger.info(f[get_2stage_cfgs] run_1stage) --------- Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>

…1446) * CI: Move some tests to MI355 due to the network issue of TW cluster * Modify the GPU_ARCH of sglang tests

* tune a8w8_blockscale&bpreshuffle for tencent Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv * update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_tuned_gemm.csv * updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* [fix]: add ar switch Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: call ar naive Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: add ar switch Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

zhuyuhua-v force-pushed the yuhua/dev/perf branch from cd94ab0 to ad23532 Compare November 22, 2025 14:53

lalala-sh and others added 29 commits November 23, 2025 22:24

integrate deep gemm (#1265)

0e42e4b

* integrate m grouped gemm * update ck * add limit for 950 * rename deepgeem

add a tuned config and insert entries in untuned config (#1243)

6dd30b8

Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

Enable large batch size and optimization of non-Ragged batching (#1269)

c6f1644

* Enable large batch size and optimization of non-Ragged batching * Add RAGGED_BATCH to test_la.py and bench_la.py

add few more fw ds f4 untuned and tuned shapes for using asm kernel (#…

de20a5d

…1298)

CI: Optimize autotuning pipeline and inital the docs (#1286)

7828dc9

* CI: Optimize autotuning pipeline and inital the docs

fix aot (#1279)

4d1edfe

* fix aot * remove other kernels path * fix aot * format code --------- Co-authored-by: root <root@hjbog-srdc-24.amd.com>

Fix ATOM fp8 model quant fail issue in torch compile (#1299)

1bcc436

* Fix fp8 issue in torch compile * use less code

feat - pa_fwd support block map with stride in num_kv_heads_dim (#1301)

4559713

Fix how to update accumulator for dot_scaled (#1297)

e8ec7b2

Fix the lint issue (#1307)

0ec3be3

fix fwd perf calc error (#1305)

edfd22a

* fix fwd perf calc error * black aiter/ops/triton/_triton_kernels/gemm_afp4wfp4.py

add the asm kernel performance of fwd and bwd (#1270)

dd665b0

* add the asm kernel performance of the attention forwards and attention backwards * modify perf data * fix perf data * add a16 perf data

Dsv32 cache (#1314)

4f30288

* add indexer_k_quant_and_cache & cp_gather_indexer_k_quant_cache * ndexer_k_quant_and_cache opt kernel and add test * update * update2 --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

using standalone pybind (#1317)

c806ccd

* using standalone pybind * fix * update

Enable mha bwd hd192_hd128 (#1308)

317c3c2

* update codegen.py * update kernels & kernel launch * fix fa bwd dq_acc shape * remove mask in python api

Fix gemm tuner error mi350 (#1313)

7ea8873

* workaround-retry tuning when encounter invalid pointer * workaround-retry tuning when encounter invalid pointer * fix lint error * Update gemm_tuner.py em timeout

CI: Skip triton setup in Aiter standard/multigpu tests and add retrie…

fc04c84

…s when setting up triton (#1325) * CI: Skip triton in Aiter standard and multigpu tests * Add retries when building triton * Add ninja installation

Add transpose scale to the triton fused_rms_fp8_group_quant (#1291)

a0c8cee

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

[TRITON] Kernel naming: add reusable constexpr repr helper (#1260)

965b563

* Kernel naming: add reusable constexpr repr helper for gemm a16w16 * add missing params to the repr

junhaha666 and others added 22 commits November 23, 2025 22:25

add deepseek ep moe tune config (#1431)

87a9abb

* add ptpc deepseek ep moe tuned config * add block deepseek ep moe tune config * using 1stage moe for ptpc deepseek

[TRITON] Moe a8w4 tuning (#1410)

b48237c

* fuse routing kernels for small batches * tune batch=1024

[fix]: prebuild gen so (#1412)

971a86d

Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com>

[TRITON] FP8 MQA optimizations (#1422)

6f11d71

FP8 MQA optimizations AND bench. script

CI: Move some tests to MI355 due to the network issue of TW cluster (#…

9c8ccbe

…1446) * CI: Move some tests to MI355 due to the network issue of TW cluster * Modify the GPU_ARCH of sglang tests

CI: Move Triton tests from TW cluster to internal cluster (#1451)

b2d86ef

update 3rdparty/composable_kernel to e31a7a4

94d83fd

Signed-off-by: zhuyuhua-v <yuhzhu@amd.com>

remote maxsize for mla lru cache

5edc3d4

enable extenal num_kv_splits & num_kv_splits_ptr

9322323

fix output/lse is nan when kseq=0

c062ce0

fix gfx950 128_128 fwd_v3

48cb253

update the k_seq=0 error in MI300 and MI308

092f6ef

update the smoke test

5705ad4

update the smoke test

458c325

fix MI300 and MI308 err

0d167eb

fix qseq >> kseq error MI300 and MI308

e60feb8

fix qseq >> kseq error in MI355

d679df6

fix the MI300 error

82abc1c

zhuyuhua-v force-pushed the yuhua/dev/perf branch from ad23532 to 82abc1c Compare November 23, 2025 14:29

zhuyuhua-v changed the title ~~[Update] update dev/perf branch to main branch commit e0fb38e and cherry-pick mla_fix, mmd/fix/fwd_v3~~ [Update] update dev/perf branch to main branch commit 9a30baf and cherry-pick mla_fix, mmd/fix/fwd_v3 Nov 23, 2025

zhuyuhua-v changed the title ~~[Update] update dev/perf branch to main branch commit 9a30baf and cherry-pick mla_fix, mmd/fix/fwd_v3~~ [Update] update dev/perf branch to main branch commit 9c2df5 and cherry-pick mla_fix, mmd/fix/fwd_v3 Nov 23, 2025

zhuyuhua-v marked this pull request as ready for review November 24, 2025 09:12

zhuyuhua-v requested a review from wuhuikx November 24, 2025 09:13

wuhuikx merged commit 5f4c65e into dev/perf Nov 24, 2025

wuhuikx deleted the yuhua/dev/perf branch November 24, 2025 09:18

zhuyuhua-v mentioned this pull request Nov 24, 2025

[Update] update dev/perf branch to main branch commit e0fb38e and cherry-pick mla_fix, mmd/fix/fwd_v3 #1466

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Update] update dev/perf branch to main branch commit 9c2df5 and cherry-pick mla_fix, mmd/fix/fwd_v3#1467

[Update] update dev/perf branch to main branch commit 9c2df5 and cherry-pick mla_fix, mmd/fix/fwd_v3#1467
wuhuikx merged 115 commits intodev/perffrom
yuhua/dev/perf

zhuyuhua-v commented Nov 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

zhuyuhua-v commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

zhuyuhua-v commented Nov 22, 2025 •

edited

Loading