Address PR review comments: confirm unit test coverage#1848
Closed
Address PR review comments: confirm unit test coverage#1848
Conversation
* cktile bpreshuffle && tuning code * refine code * refine code * refine code * refine * refine * fix merge conflict * fix conflict * fix CI build fail * refine code * align aiter interface * refine * add get_padded_M --------- Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com>
* mdf_bf16gemm * update * update f4gemm * update * update * f4gemm bugs fix * f4gemm fix2 * update * update moe 2 stages * update codegen * update gemm_a8w8_asm * update * update * update * update * update * update * update --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@.com>
* fix merged tuned config error * update --------- Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* update triton version check for pa_mqa_logits * Add some explanation for aot-check * Support pa_mqa_logits aot load on triton>=3.5 * Support pa_mqa_logits aot load on triton>=3.5 --------- Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
* update codegen * update * update * update * fix --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: valarLip <340077269@qq.com>
* change to merge config when used * fix lint * fix error in GemmTuner * fix lint * fix error when runing deepseek * fix lint error * revert other format change * fix gemm_op_a8w8.py --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* set preshuffle=False default * fix lint --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
…1442) * fix output/lse is nan when kseq=0 * fix gfx950 128_128 fwd_v3 * update the k_seq=0 error in MI300 and MI308 * tune a8w8_blockscale&bpreshuffle for tencent (#1444) * tune a8w8_blockscale&bpreshuffle for tencent Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv * update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_tuned_gemm.csv * updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update the smoke test * update the smoke test * fix MI300 and MI308 err * fix qseq >> kseq error MI300 and MI308 * fix qseq >> kseq error in MI355 * fix the MI300 error --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Co-authored-by: Zhu Yuhua <yuhzhu@amd.com> Co-authored-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning * add required keys to fused_moe_1stage_dict * add kernel descriptors and code object files * add 32x128 file descriptors and code objects for tuning * move code objects and kernel descriptors to correct csv * remove unnecessary import, add quant type argument * move fused_moe_stage1_tkw1 into fused_moe.py * remove unnecessary kernel code object files * add kernel descriptor to tuned fmoe config, add kernel descriptors to related csvs and add related code objects * create kernel descriptors and kernel co files with correct tags
* CI: Set MAX_JOBS=20 when running aiter tests * Update aiter-test.yaml
Files: Upload experimental pa_ragged kernels and unit test Technical Details: 1. Added double buffering for K-cache loading. 2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout. 3. Turn on non-temporal loads for KV cache.
* fix: MXFP4 mantissa rounding * fix: mantissa rounding in test_quant_mxfp4 * refactor dynamic_mxfp4_quant * chore: format * fix: mxfp4 quantization tests * chore: format * fix: mxfp4 quantization test with correct bitwidth and sign * chore: restore DEBUG_MODE * chore: align test_quant_mxfp4 with triton kernel --------- Co-authored-by: lucas-santos-amd <Lucas.Santos@amd.com>
* tune token 16 on MI355 * tuned a8w8 m1~16 * tune fp8 m32-1024 * tune.py bugfix * rm garbage code * logits_diff warning * black op_tests/test_moe_2stage.py --------- Co-authored-by: zhimding <zhimding@amd.com>
* [fix]: fused_ar_rms acc error Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: reduce_scatter boundary check error Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com>
* update ck * Support descale for mha_fwd * Support descale for mha_varlen_fwd * update ck * update ck * update ck * update CK * Fix compile error for batch prefill --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com>
* add fmha to prebuild * More PREBUILD_KERNELS Option --------- Co-authored-by: zufayu <zufayu@amd.com>
* [fix]: fused_ar_rms acc error Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: reduce_scatter boundary check error Signed-off-by: root <root@hjbog-srdc-24.amd.com> * correct return type of fused_arnorm fake impl --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: guanbao <gyu@amd.com>
* Select correct `fp8_e4m3` type for `gfx950` * Avoid shell command substitution in PR title Following GitHub Actions security recommendations available at <https://docs.github.com/en/actions/reference/security/ secure-use#use-an-intermediate-environment-variable>.
Co-authored-by: perzhang <perzhang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Motivation Add mrope fused kernel for Qwen-vl models with inference mode. Technical Details For norm implementation details, we use warp norm method for better performance, overall uplift 10x compared with native torch ops.
* add weight preshuffling for triton fp8 blockscale gemm * add config interface * add x_scale shuffle * import * add default config for gfx942 * fix get_config return * fix * Added tuned configs for gemm a8w8 blockscale preshuffled * Fixed tuned configs keys * resolve comments * resolve comments * update config * add kernel, add config, standard return_y_pp flag * update config * fix * update config and UT --------- Co-authored-by: Farel Lukas <farlukas@amd.com>
* mv fmoe tune to csrc/ck_gemm_moe_2stages_codegen * rename fmoe tune.py to gemm_moe_tune.py * update tune readme for more usages * update splitK info and fix splitK error * Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix splitK error in cktile tune and clean mp_tuner log * fix tune hang when get result error * fix mp_tuner error --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* add a8w8 fp8 tune support * add q_dtype_w to deal with different type and refine config csv file --------- Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>
* fix fuse_qk_rope_concat_and_cache_mla in rocm-6.4.1 * update * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update format * revert set interface change * use gmem in opus.h to replace ck_tile::buffer_view --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* bug fix * fix flag * update --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Inference engines should be calling paged_attention_common now with shuffled kv cache layout and aiter internally will decide between asm or hip kernel. HIP is more performant for lower concurrencies ( < 128). Also a unit test has been updated to include the new interface. Note that support for the shuffled scales in HIP is not supported and is always redirected to asm now when KV cache is in int8 or fp8 formats.
Copilot
AI
changed the title
[WIP] Implement API for switching between ASM and HIP kernels
Address PR review comments: confirm unit test coverage
Jan 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Responded to reviewer question about page attention unit test coverage for the new
paged_attention_commonAPI.Analysis Provided
op_tests/test_pa.pyincludes comprehensive tests for the new API viarun_aiter_common()No Code Changes
This PR contains no code modifications—only clarification that existing test coverage validates the paged attention implementation introduced in the original commits.
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.