Skip to content

Address PR review comments: confirm unit test coverage#1848

Closed
Copilot wants to merge 1204 commits intomainfrom
copilot/sub-pr-1821
Closed

Address PR review comments: confirm unit test coverage#1848
Copilot wants to merge 1204 commits intomainfrom
copilot/sub-pr-1821

Conversation

Copy link
Contributor

Copilot AI commented Jan 15, 2026

Responded to reviewer question about page attention unit test coverage for the new paged_attention_common API.

Analysis Provided

  • Confirmed op_tests/test_pa.py includes comprehensive tests for the new API via run_aiter_common()
  • Tests validate correctness against golden outputs for both unquantized and quantized KV cache paths
  • Tests cover per-tensor and per-token quantization with proper scale tensor handling for both HIP and ASM backends
  • Noted tests require ROCm/AMD GPU hardware (MI325/MI355) and run via CI pipeline

No Code Changes

This PR contains no code modifications—only clarification that existing test coverage validates the paged attention implementation introduced in the original commits.


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

solinzby1 and others added 30 commits November 21, 2025 18:06
* cktile bpreshuffle && tuning code

* refine code

* refine code

* refine code

* refine

* refine

* fix merge conflict

* fix conflict

* fix CI build fail

* refine code

* align aiter interface

* refine

* add get_padded_M

---------

Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com>
* mdf_bf16gemm

* update

* update f4gemm

* update

* update

* f4gemm bugs fix

* f4gemm fix2

* update

* update moe 2 stages

* update codegen

* update gemm_a8w8_asm

* update

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: Ying Zhou <Ying.Zhou2@amd.com>
Co-authored-by: amd-ruitang3 <Rui.Tang2@.com>
* fix merged tuned config error

* update

---------

Co-authored-by: valarLip <340077269@qq.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* update triton  version check for pa_mqa_logits

* Add some explanation for aot-check

* Support pa_mqa_logits aot load on triton>=3.5

* Support pa_mqa_logits aot load on triton>=3.5

---------

Co-authored-by: Feng Shijie <Shijie.Feng@amd.com>
* update codegen

* update

* update

* update

* fix

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: valarLip <340077269@qq.com>
* change to merge config when used

* fix lint

* fix error in GemmTuner

* fix lint

* fix error when runing deepseek

* fix lint error

* revert other format change

* fix gemm_op_a8w8.py

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* set preshuffle=False default

* fix lint

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
…1442)

* fix output/lse is nan when kseq=0

* fix gfx950 128_128 fwd_v3

* update the k_seq=0 error in MI300 and MI308

* tune a8w8_blockscale&bpreshuffle for tencent (#1444)

* tune a8w8_blockscale&bpreshuffle for tencent

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_bpreshuffle_tuned_gemm.csv

* update aiter/configs/a8w8_bpreshuffle_tuned_gemm.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* Update a8w8_blockscale_tuned_gemm.csv

* updated a8w8_blockscale_tuned_gemm_ds_v3.csv&a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update aiter/configs/model_configs/a8w8_bpreshuffle_tuned_gemm_dsv3.csv

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>

* update the smoke test

* update the smoke test

* fix MI300 and MI308 err

* fix qseq >> kseq error MI300 and MI308

* fix qseq >> kseq error in MI355

* fix the MI300 error

---------

Signed-off-by: LJ-underdog <Jun.Lin@amd.com>
Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>
Co-authored-by: Linjun-AMD <Jun.Lin@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* redirect asm_moe_tkw1 call to fused_moe in order to force kernel tuning

* add required keys to fused_moe_1stage_dict

* add kernel descriptors and code object files

* add 32x128 file descriptors and code objects for tuning

* move code objects and kernel descriptors to correct csv

* remove unnecessary import, add quant type argument

* move fused_moe_stage1_tkw1 into fused_moe.py

* remove unnecessary kernel code object files

* add kernel descriptor to tuned fmoe config, add kernel descriptors to related csvs and add related code objects

* create kernel descriptors and kernel co files with correct tags
* CI: Set MAX_JOBS=20 when running aiter tests

* Update aiter-test.yaml
Files: Upload experimental pa_ragged kernels and unit test
Technical Details:
1. Added double buffering for K-cache loading.
2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout.
3. Turn on non-temporal loads for KV cache.
* fix: MXFP4 mantissa rounding

* fix: mantissa rounding in test_quant_mxfp4

* refactor dynamic_mxfp4_quant

* chore: format

* fix: mxfp4 quantization tests

* chore: format

* fix: mxfp4 quantization test with correct bitwidth and sign

* chore: restore DEBUG_MODE

* chore: align test_quant_mxfp4 with triton kernel

---------

Co-authored-by: lucas-santos-amd <Lucas.Santos@amd.com>
* tune token 16 on MI355

* tuned a8w8 m1~16

* tune fp8 m32-1024

* tune.py bugfix

* rm garbage code

* logits_diff warning

* black op_tests/test_moe_2stage.py

---------

Co-authored-by: zhimding <zhimding@amd.com>
* [fix]: fused_ar_rms acc error

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: reduce_scatter boundary check error

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
* update ck

* Support descale for mha_fwd

* Support descale for mha_varlen_fwd

* update ck

* update ck

* update ck

* update CK

* Fix compile error for batch prefill

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
* add fmha to prebuild

* More PREBUILD_KERNELS Option

---------

Co-authored-by: zufayu <zufayu@amd.com>
* [fix]: fused_ar_rms acc error

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* [fix]: reduce_scatter boundary check error

Signed-off-by: root <root@hjbog-srdc-24.amd.com>

* correct return type of fused_arnorm fake impl

---------

Signed-off-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: root <root@hjbog-srdc-24.amd.com>
Co-authored-by: guanbao <gyu@amd.com>
* Select correct `fp8_e4m3` type for `gfx950`

* Avoid shell command substitution in PR title

Following GitHub Actions security recommendations available at
<https://docs.github.com/en/actions/reference/security/
secure-use#use-an-intermediate-environment-variable>.
Co-authored-by: perzhang <perzhang@amd.com>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
Motivation
Add mrope fused kernel for Qwen-vl models with inference mode.

Technical Details
For norm implementation details, we use warp norm method for better performance, overall uplift 10x compared with native torch ops.
k50112113 and others added 20 commits January 9, 2026 13:01
* add weight preshuffling for triton fp8 blockscale gemm

* add config interface

* add x_scale shuffle

* import

* add default config for gfx942

* fix get_config return

* fix

* Added tuned configs for gemm a8w8 blockscale preshuffled

* Fixed tuned configs keys

* resolve comments

* resolve comments

* update config

* add kernel, add config, standard return_y_pp flag

* update config

* fix

* update config and UT

---------

Co-authored-by: Farel Lukas <farlukas@amd.com>
* mv fmoe tune to csrc/ck_gemm_moe_2stages_codegen

* rename fmoe tune.py to gemm_moe_tune.py

* update tune readme for more usages

* update splitK info and fix splitK error

* Apply suggestions from code review

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix splitK error in cktile tune and clean mp_tuner log

* fix tune hang when get result error

* fix mp_tuner error

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* add a8w8 fp8 tune support

* add q_dtype_w to deal with different type and refine config csv file

---------

Co-authored-by: solin <bingzhou@amd.com>
Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>
* fix fuse_qk_rope_concat_and_cache_mla in rocm-6.4.1

* update

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update format

* revert set interface change

* use gmem in opus.h to replace ck_tile::buffer_view

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* bug fix

* fix flag

* update

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Inference engines should be calling paged_attention_common now with
shuffled kv cache layout and aiter internally will decide between asm
or hip kernel. HIP is more performant for lower concurrencies ( < 128).
Also a unit test has been updated to include the new interface.

Note that support for the shuffled scales in HIP is not supported and is
always redirected to asm now when KV cache is  in int8 or fp8 formats.
Copilot AI changed the title [WIP] Implement API for switching between ASM and HIP kernels Address PR review comments: confirm unit test coverage Jan 15, 2026
Copilot AI requested a review from fsx950223 January 15, 2026 04:17
Base automatically changed from common_hip_asm_pa_inerface to main January 16, 2026 02:53
@valarLip valarLip closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.