Merged
Conversation
Contributor
|
Format code. Line 75 in 1f5b378 |
fsx950223
reviewed
Nov 25, 2025
Contributor
Author
|
@fsx950223 Ready for re-review |
fsx950223
approved these changes
Nov 26, 2025
nsusanto
pushed a commit
that referenced
this pull request
Dec 4, 2025
Files: Upload experimental pa_ragged kernels and unit test Technical Details: 1. Added double buffering for K-cache loading. 2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout. 3. Turn on non-temporal loads for KV cache.
zhuyuhua-v
pushed a commit
that referenced
this pull request
Dec 17, 2025
Files: Upload experimental pa_ragged kernels and unit test Technical Details: 1. Added double buffering for K-cache loading. 2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout. 3. Turn on non-temporal loads for KV cache.
valarLip
pushed a commit
that referenced
this pull request
Mar 18, 2026
Files: Upload experimental pa_ragged kernels and unit test Technical Details: 1. Added double buffering for K-cache loading. 2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout. 3. Turn on non-temporal loads for KV cache.
valarLip
pushed a commit
that referenced
this pull request
Mar 18, 2026
Files: Upload experimental pa_ragged kernels and unit test Technical Details: 1. Added double buffering for K-cache loading. 2. Used 64 threads to load the continuous K-cache into LDS and then distributed the data to thread registers to match the MFMA layout. 3. Turn on non-temporal loads for KV cache.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The decode kernel
_paged_attention_kernelinaiter/csrc/cpp_itfs/pa/pa_kernels.cuhshows an increasing runtime as the batch size scales from 8 to 512, rising from 4.8% to 11% of total runtime.This PR currently supports only
head_size = 128andkv_dtype = bf16.Technical Details
This work introduces the following changes:
Test Plan
aiter/op_tests/test_pa_ragged_experimental.pyThis script differs from the original
test_pa_ragged.py. The new test reproduces the same KV-cache layout used in SGLang, making it more realistic. It also performs numerical verification against the original kernel (referred to as GOLDEN) to ensure correctness. The new implementation is labeled EXPERIMENTAL.sglang/benchmark/gsm8k/bench_sglang.py --num-questions 200 --parallel 200 --num-shots 8Test Result
Node: MI355X (smci355-ccs-aus-n10-09)
GOLDEN: 100 us
EXPERIMENTAL: 88 us.
Speedup: 1.13x.
Submission Checklist