Address PR review comments: confirm unit test coverage by Copilot · Pull Request #1848 · ROCm/aiter

Copilot · 2026-01-15T04:07:05Z

Responded to reviewer question about page attention unit test coverage for the new paged_attention_common API.

Analysis Provided

Confirmed op_tests/test_pa.py includes comprehensive tests for the new API via run_aiter_common()
Tests validate correctness against golden outputs for both unquantized and quantized KV cache paths
Tests cover per-tensor and per-token quantization with proper scale tensor handling for both HIP and ASM backends
Noted tests require ROCm/AMD GPU hardware (MI325/MI355) and run via CI pipeline

No Code Changes

This PR contains no code modifications—only clarification that existing test coverage validates the paged attention implementation introduced in the original commits.

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

* cktile bpreshuffle && tuning code * refine code * refine code * refine code * refine * refine * fix merge conflict * fix conflict * fix CI build fail * refine code * align aiter interface * refine * add get_padded_M --------- Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com>

* mdf_bf16gemm * update * update f4gemm * update * update * f4gemm bugs fix * f4gemm fix2 * update * update moe 2 stages * update codegen * update gemm_a8w8_asm * update * update * update * update * update * update * update --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: chenjun <junchen2@amd.com> Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com> Co-authored-by: amd-ruitang3 <Rui.Tang2@.com>

* fix merged tuned config error * update --------- Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* update triton version check for pa_mqa_logits * Add some explanation for aot-check * Support pa_mqa_logits aot load on triton>=3.5 * Support pa_mqa_logits aot load on triton>=3.5 --------- Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>

* update codegen * update * update * update * fix --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: valarLip <340077269@qq.com>

* change to merge config when used * fix lint * fix error in GemmTuner * fix lint * fix error when runing deepseek * fix lint error * revert other format change * fix gemm_op_a8w8.py --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* set preshuffle=False default * fix lint --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

…1442) * fix output/lse is nan when kseq=0 * fix gfx950 128_128 fwd_v3 * update the k_seq=0 error in MI300 and MI308 * tune a8w8_blockscale&bpreshuffle for tencent (#1444) * tune a8w8_blockscale&bpreshuffle for tencent Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv * update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * Update a8w8_blockscale_tuned_gemm.csv * updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv Signed-off-by: LJ-underdog <Jun.Lin@amd.com> --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> * update the smoke test * update the smoke test * fix MI300 and MI308 err * fix qseq >> kseq error MI300 and MI308 * fix qseq >> kseq error in MI355 * fix the MI300 error --------- Signed-off-by: LJ-underdog <Jun.Lin@amd.com> Co-authored-by: Zhu Yuhua <yuhzhu@amd.com> Co-authored-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning * add required keys to fused_moe_1stage_dict * add kernel descriptors and code object files * add 32x128 file descriptors and code objects for tuning * move code objects and kernel descriptors to correct csv * remove unnecessary import, add quant type argument * move fused_moe_stage1_tkw1 into fused_moe.py * remove unnecessary kernel code object files * add kernel descriptor to tuned fmoe config, add kernel descriptors to related csvs and add related code objects * create kernel descriptors and kernel co files with correct tags

* CI: Set MAX_JOBS=20 when running aiter tests * Update aiter-test.yaml

Files: Upload experimental pa_ragged kernels and unit test Technical Details: 1. Added double buffering for K-cache loading. 2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout. 3. Turn on non-temporal loads for KV cache.

* fix: MXFP4 mantissa rounding * fix: mantissa rounding in test_quant_mxfp4 * refactor dynamic_mxfp4_quant * chore: format * fix: mxfp4 quantization tests * chore: format * fix: mxfp4 quantization test with correct bitwidth and sign * chore: restore DEBUG_MODE * chore: align test_quant_mxfp4 with triton kernel --------- Co-authored-by: lucas-santos-amd <Lucas.Santos@amd.com>

* tune token 16 on MI355 * tuned a8w8 m1~16 * tune fp8 m32-1024 * tune.py bugfix * rm garbage code * logits_diff warning * black op_tests/test_moe_2stage.py --------- Co-authored-by: zhimding <zhimding@amd.com>

* [fix]: fused_ar_rms acc error Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: reduce_scatter boundary check error Signed-off-by: root <root@hjbog-srdc-24.amd.com> --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com>

* update ck * Support descale for mha_fwd * Support descale for mha_varlen_fwd * update ck * update ck * update ck * update CK * Fix compile error for batch prefill --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* add fmha to prebuild * More PREBUILD_KERNELS Option --------- Co-authored-by: zufayu <zufayu@amd.com>

* [fix]: fused_ar_rms acc error Signed-off-by: root <root@hjbog-srdc-24.amd.com> * [fix]: reduce_scatter boundary check error Signed-off-by: root <root@hjbog-srdc-24.amd.com> * correct return type of fused_arnorm fake impl --------- Signed-off-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: root <root@hjbog-srdc-24.amd.com> Co-authored-by: guanbao <gyu@amd.com>

* Select correct `fp8_e4m3` type for `gfx950` * Avoid shell command substitution in PR title Following GitHub Actions security recommendations available at <https://docs.github.com/en/actions/reference/security/ secure-use#use-an-intermediate-environment-variable>.

Co-authored-by: perzhang <perzhang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

Motivation Add mrope fused kernel for Qwen-vl models with inference mode. Technical Details For norm implementation details, we use warp norm method for better performance, overall uplift 10x compared with native torch ops.

* add weight preshuffling for triton fp8 blockscale gemm * add config interface * add x_scale shuffle * import * add default config for gfx942 * fix get_config return * fix * Added tuned configs for gemm a8w8 blockscale preshuffled * Fixed tuned configs keys * resolve comments * resolve comments * update config * add kernel, add config, standard return_y_pp flag * update config * fix * update config and UT --------- Co-authored-by: Farel Lukas <farlukas@amd.com>

* mv fmoe tune to csrc/ck_gemm_moe_2stages_codegen * rename fmoe tune.py to gemm_moe_tune.py * update tune readme for more usages * update splitK info and fix splitK error * Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * fix splitK error in cktile tune and clean mp_tuner log * fix tune hang when get result error * fix mp_tuner error --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add a8w8 fp8 tune support * add q_dtype_w to deal with different type and refine config csv file --------- Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>

* fix fuse_qk_rope_concat_and_cache_mla in rocm-6.4.1 * update * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update format * revert set interface change * use gmem in opus.h to replace ck_tile::buffer_view --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* bug fix * fix flag * update --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>

Inference engines should be calling paged_attention_common now with shuffled kv cache layout and aiter internally will decide between asm or hip kernel. HIP is more performant for lower concurrencies ( < 128). Also a unit test has been updated to include the new interface. Note that support for the shuffled scales in HIP is not supported and is always redirected to asm now when KV cache is in int8 or fp8 formats.

solinzby1 and others added 30 commits November 21, 2025 18:06

fix merged tuned config error (#1460)

1ce204b

* fix merged tuned config error * update --------- Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

fix all_reduce_fake (#1465)

aabdf16

CI: Use ausartifactory.amd.com in pip installation (#1469)

6066b1c

update codegen (#1471)

e51913a

* update codegen * update * update * update * fix --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com> Co-authored-by: valarLip <340077269@qq.com>

CI: Fix sglang CI test (#1473)

36022cf

CI: Add checkout retry in vLLM benchmark tests (#1476)

1f5b378

CI: Move SGlang and Triton tests to MI300 runners (#1485)

a976ce1

fix fmoe tune preshuffle error (#1430)

76cd4a5

* set preshuffle=False default * fix lint --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

Fix issue: Add nhead=128 support to bf16 and align restrictions (#1450)

28e0b1e

fix the build error of rtp (#1438)

d43f8c0

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

some fix for support gpt_oss (#1488)

e3aad09

CI: Revert vllm_benchmark to use the latest nightly image (#1402)

e62a38e

CI: Debug Aiter test issue (#1482)

43b924a

* CI: Set MAX_JOBS=20 when running aiter tests * Update aiter-test.yaml

tune token 16 on MI355 (#1461)

cee55ea

* tune token 16 on MI355 * tuned a8w8 m1~16 * tune fp8 m32-1024 * tune.py bugfix * rm garbage code * logits_diff warning * black op_tests/test_moe_2stage.py --------- Co-authored-by: zhimding <zhimding@amd.com>

Update vllm_benchmark.yaml to use aiter-mi325-8gpu-bm (#1500)

7ae4d8c

Multinode_aot_prebuild_all (#1501)

5129567

* add fmha to prebuild * More PREBUILD_KERNELS Option --------- Co-authored-by: zufayu <zufayu@amd.com>

[PA](fix): fix format error in str (#1486)

68f5a6b

Co-authored-by: perzhang <perzhang@amd.com> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

k50112113 and others added 20 commits January 9, 2026 13:01

[Triton] fix get_config return

a498c8b

fix log rccl version (#1806)

fa9bd9f

add a8w8 fp8 ck gemm tune support (#1782)

2e39dbe

* add a8w8 fp8 tune support * add q_dtype_w to deal with different type and refine config csv file --------- Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>

Optimize the performance of quick_allreduce (#1816)

ae774a3

fix moe fp4 dual prebuild issues (#1775)

2946bda

* bug fix * fix flag * update --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>

Delete op_tests/README_pa_merged_tests.md

04d6687

Delete op_tests/test_pa_merged.py

a744e4a

Fix formatting according to Black requirements

476286e

Fix one last place with broken formatting

b95df2e

Remove modification to pa_v1, we already have pa for 5D kv cache

85d9984

Fix another formatting issue

d759438

Add proper quant support for the common API

fa7634d

Apply formatting

39e3d66

Remove redundant parameters

3677df8

Remove redundant parameters

94816c7

Initial plan

82b5715

Copilot AI assigned Copilot and fsx950223 Jan 15, 2026

Copilot AI mentioned this pull request Jan 15, 2026

Implement a new api that will be switching between asm and hip pa internally #1821

Merged

Copilot started work on behalf of fsx950223 January 15, 2026 04:09 View session

Copilot AI changed the title ~~[WIP] Implement API for switching between ASM and HIP kernels~~ Address PR review comments: confirm unit test coverage Jan 15, 2026

Copilot AI requested a review from fsx950223 January 15, 2026 04:17

Copilot finished work on behalf of fsx950223 January 15, 2026 04:17

Base automatically changed from common_hip_asm_pa_inerface to main January 16, 2026 02:53

valarLip closed this Mar 18, 2026

valarLip force-pushed the main branch from 584e490 to 3a5a1cf Compare March 18, 2026 05:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address PR review comments: confirm unit test coverage#1848

Address PR review comments: confirm unit test coverage#1848
Copilot wants to merge 1204 commits intomainfrom
copilot/sub-pr-1821

Copilot AI commented Jan 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

Copilot AI commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analysis Provided

No Code Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Copilot AI commented Jan 15, 2026 •

edited

Loading