Jacchang/pa ragged experimental by Jacob0226 · Pull Request #1479 · ROCm/aiter

Jacob0226 · 2025-11-24T10:21:16Z

Motivation

The decode kernel _paged_attention_kernel in aiter/csrc/cpp_itfs/pa/pa_kernels.cuh shows an increasing runtime as the batch size scales from 8 to 512, rising from 4.8% to 11% of total runtime.
This PR currently supports only head_size = 128 and kv_dtype = bf16.

Technical Details

This work introduces the following changes:

Added double buffering for K-cache loading.
Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout.
Turn on non-temporal loads for KV cache.

Test Plan

Unit test: aiter/op_tests/test_pa_ragged_experimental.py
This script differs from the original test_pa_ragged.py. The new test reproduces the same KV-cache layout used in SGLang, making it more realistic. It also performs numerical verification against the original kernel (referred to as GOLDEN) to ensure correctness. The new implementation is labeled EXPERIMENTAL.

# Docker: lmsysorg/sglang:v0.5.5.post3-rocm700-mi35x
export BS=512
export CL=2048 # context length (prefill length)
export PageSize=1
export OUT=UT
rocprof-compute profile -n $OUT -- python aiter/op_tests/test_pa_ragged_experimental.py -n 512 -c 2048 --warmup 3 --page-size 1
rocprof-compute analyze -p workloads/$OUT/MI355/ --list-stats  > tmp.log 
# Find the kernel paged_attention_ll4mi_QKV_mfma16_kernel dispatch ID for GOLDEN and EXPERIMENTAL
# The second-to-last matched dispatch  ID is GOLDEN
# The last one is for matched dispatch ID EXPERIMENTAL

Accuracy test:
sglang/benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 200 --num-shots 8

Test Result

Node: MI355X (smci355-ccs-aus-n10-09)

Unit test:
GOLDEN: 100 us
EXPERIMENTAL: 88 us.
Speedup: 1.13x.
Accuracy test: GSM8K doesn't drop by using this PR.

Submission Checklist

[V] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…-ragged-experimental

… kernel.

fsx950223 · 2025-11-25T03:31:32Z

Format code.

aiter/.githooks/pre-commit

Line 75 in 1f5b378

echo "black $file"

csrc/cpp_itfs/pa/pa_kernels.cuh

…-ragged-experimental

Jacob0226 · 2025-11-26T06:22:41Z

@fsx950223 Ready for re-review

Files: Upload experimental pa_ragged kernels and unit test Technical Details: 1. Added double buffering for K-cache loading. 2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout. 3. Turn on non-temporal loads for KV cache.

Jacob0226 added 3 commits November 21, 2025 00:41

Upload experimental pa_ragged kernels and unit test

d8f094f

Merge branch 'main' of https://github.com/ROCm/aiter into jacchang/pa…

20e7cab

…-ragged-experimental

Set requirements for using EXPERIMENTAL kernel. Update comment in the…

8730a2b

… kernel.

valarLip requested a review from fsx950223 November 25, 2025 02:05

Format the files using black

e44d5c3

fsx950223 reviewed Nov 25, 2025

View reviewed changes

csrc/cpp_itfs/pa/pa_kernels.cuh Outdated Show resolved Hide resolved

Jacob0226 added 2 commits November 25, 2025 03:26

Revert NT_KV_LOAD to false

c0e968d

Merge branch 'main' of https://github.com/ROCm/aiter into jacchang/pa…

d4a3426

…-ragged-experimental

fsx950223 approved these changes Nov 26, 2025

View reviewed changes

Merge branch 'main' into jacchang/pa-ragged-experimental

0f8d4df

Jacob0226 merged commit 8e4d703 into ROCm:main Nov 26, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jacchang/pa ragged experimental#1479

Jacchang/pa ragged experimental#1479
Jacob0226 merged 7 commits intoROCm:mainfrom
Jacob0226:jacchang/pa-ragged-experimental

Jacob0226 commented Nov 24, 2025 •

edited

Loading

Uh oh!

fsx950223 commented Nov 25, 2025

Uh oh!

Uh oh!

Jacob0226 commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jacob0226 commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

fsx950223 commented Nov 25, 2025

Uh oh!

Uh oh!

Jacob0226 commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jacob0226 commented Nov 24, 2025 •

edited

Loading