Skip to content

Jacchang/pa ragged experimental#1479

Merged
Jacob0226 merged 7 commits intoROCm:mainfrom
Jacob0226:jacchang/pa-ragged-experimental
Nov 26, 2025
Merged

Jacchang/pa ragged experimental#1479
Jacob0226 merged 7 commits intoROCm:mainfrom
Jacob0226:jacchang/pa-ragged-experimental

Conversation

@Jacob0226
Copy link
Contributor

@Jacob0226 Jacob0226 commented Nov 24, 2025

Motivation

The decode kernel _paged_attention_kernel in aiter/csrc/cpp_itfs/pa/pa_kernels.cuh shows an increasing runtime as the batch size scales from 8 to 512, rising from 4.8% to 11% of total runtime.
This PR currently supports only head_size = 128 and kv_dtype = bf16.

Technical Details

This work introduces the following changes:

  1. Added double buffering for K-cache loading.
  2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout.
  3. Turn on non-temporal loads for KV cache.

Test Plan

  • Unit test: aiter/op_tests/test_pa_ragged_experimental.py
    This script differs from the original test_pa_ragged.py. The new test reproduces the same KV-cache layout used in SGLang, making it more realistic. It also performs numerical verification against the original kernel (referred to as GOLDEN) to ensure correctness. The new implementation is labeled EXPERIMENTAL.
# Docker: lmsysorg/sglang:v0.5.5.post3-rocm700-mi35x
export BS=512
export CL=2048 # context length (prefill length)
export PageSize=1
export OUT=UT
rocprof-compute profile -n $OUT -- python aiter/op_tests/test_pa_ragged_experimental.py -n 512 -c 2048 --warmup 3 --page-size 1
rocprof-compute analyze -p workloads/$OUT/MI355/ --list-stats  > tmp.log 
# Find the kernel paged_attention_ll4mi_QKV_mfma16_kernel dispatch ID for GOLDEN and EXPERIMENTAL
# The second-to-last matched dispatch  ID is GOLDEN
# The last one is for matched dispatch ID EXPERIMENTAL
  • Accuracy test:
    sglang/benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 200 --num-shots 8

Test Result

Node: MI355X (smci355-ccs-aus-n10-09)

  • Unit test:
    GOLDEN: 100 us
    EXPERIMENTAL: 88 us.
    Speedup: 1.13x.
  • Accuracy test: GSM8K doesn't drop by using this PR.

Submission Checklist

@valarLip valarLip requested a review from fsx950223 November 25, 2025 02:05
@fsx950223
Copy link
Contributor

Format code.

echo "black $file"

@Jacob0226
Copy link
Contributor Author

@fsx950223 Ready for re-review

@Jacob0226 Jacob0226 merged commit 8e4d703 into ROCm:main Nov 26, 2025
15 checks passed
nsusanto pushed a commit that referenced this pull request Dec 4, 2025
Files: Upload experimental pa_ragged kernels and unit test
Technical Details:
1. Added double buffering for K-cache loading.
2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout.
3. Turn on non-temporal loads for KV cache.
zhuyuhua-v pushed a commit that referenced this pull request Dec 17, 2025
Files: Upload experimental pa_ragged kernels and unit test
Technical Details:
1. Added double buffering for K-cache loading.
2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout.
3. Turn on non-temporal loads for KV cache.
valarLip pushed a commit that referenced this pull request Mar 18, 2026
Files: Upload experimental pa_ragged kernels and unit test
Technical Details:
1. Added double buffering for K-cache loading.
2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout.
3. Turn on non-temporal loads for KV cache.
valarLip pushed a commit that referenced this pull request Mar 18, 2026
Files: Upload experimental pa_ragged kernels and unit test
Technical Details:
1. Added double buffering for K-cache loading.
2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout.
3. Turn on non-temporal loads for KV cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants