[FMHA] Support Vectorized KV Cache Layout and vLLM/SGLang block table in Batch Prefill kernel by Jeff-Huang · Pull Request #1754 · ROCm/aiter

Jeff-Huang · 2025-12-30T05:36:51Z

Motivation

Introduces support for a vectorized KV cache memory layout (e.g., [num_blocks, num_kv_heads, head_size/8, block_size, 8]) to improve memory access efficiency and also support different type of block table such as vLLM and SGLang.

Technical Details

Key changes:

KV Cache Layout Optimization and Adjustment:
- The KV cache memory layout has been adjusted to support vectorized read patterns (Vectorized KV layout).
- Support for various layout formats has been implemented, such as [num_blocks, num_kv_heads, head_size/8, block_size, 8] and other structures.
vLLM Block Table Integration:
- Added support for vLLM block table integration ([num_batch, max_blocks_per_seq]).
- Added support for SGLang block table integration ([num_blocks]).
- Support PageSize 1024
Kernel Interface Updates:
- New parameters for block table and kv cache layout.
Structure and Traits Updates:
- Adapted to changes in the fmha_fwd_batch_prefill_traits structure.

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8]

…ayout Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout. **Key Changes:** * **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces. * **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers. * **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments. * **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout.

… one

…ensors in batch prefill - Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1. - Relax contiguity checks to only require the last dimension to be contiguous. - Update C++ stride calculations for 3D, 4D, and 5D layouts. - Add tests for 3D layout and non-contiguous KV cache.

… in Batch Prefill kernel (#1754) * add page size 16 to test and op * add num_total_pages to kernel parameter * add is_sglang parameter * chang is_sglang to is_sglang_layout * kv last page size=16 pass * pass kv_last_page_lens to kernel * add parameters check before calling kernel * change kv layout to [page_num, page_size, nhead, hdim] * adopt the changes of struct fmha_fwd_batch_prefill_traits * change kv cache memory layout to [num_blocks, num_kv_heads, head_size/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8] * [FMHA] Integrate vLLM block table support and enforce vectorized KV layout Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout. **Key Changes:** * **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces. * **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers. * **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments. * **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout. * update CK * update ck * adopt api changes from fmha_batch_prefill_traits * add support for linear kv cache layout * update api * Refactor the test code by gathering the different test functions into one * update ck * update ck * Add profile measurements for batch prefill function * update ck * fix style * fix style * [FMHA] Support 3D linear layout (page_size=1) and non-contiguous KV tensors in batch prefill - Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1. - Relax contiguity checks to only require the last dimension to be contiguous. - Update C++ stride calculations for 3D, 4D, and 5D layouts. - Add tests for 3D layout and non-contiguous KV cache. * update ck --------- Co-authored-by: ltqin <letaoqin@amd.com>

ltqin and others added 12 commits December 30, 2025 09:09

add page size 16 to test and op

3fa3733

add num_total_pages to kernel parameter

113217e

add is_sglang parameter

6e2d9e4

chang is_sglang to is_sglang_layout

7a463b7

kv last page size=16 pass

ee72e04

pass kv_last_page_lens to kernel

ae459b0

add parameters check before calling kernel

b25cee7

change kv layout to [page_num, page_size, nhead, hdim]

93754f4

adopt the changes of struct fmha_fwd_batch_prefill_traits

8c52122

change kv cache memory layout to [num_blocks, num_kv_heads, head_size…

9d7cd3f

…/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8]

update CK

ac28e9d

Jeff-Huang requested a review from a team December 30, 2025 05:36

poyenc assigned Jeff-Huang Dec 30, 2025

Jeff-Huang added 3 commits December 30, 2025 18:29

Merge branch 'main' into batch_prefill_page_size_16_rebase

9d69a01

update ck

688b141

adopt api changes from fmha_batch_prefill_traits

0c9c886

poyenc reviewed Dec 31, 2025

View reviewed changes

Comment thread aiter/ops/mha.py

Jeff-Huang added 12 commits December 31, 2025 11:06

add support for linear kv cache layout

c75fee4

update api

d144a76

Refactor the test code by gathering the different test functions into…

d727a92

… one

Merge branch 'main' into batch_prefill_page_size_16_rebase

7642e79

Merge branch 'main' into batch_prefill_page_size_16_rebase

2917917

update ck

b1f452c

update ck

ed5f66a

Add profile measurements for batch prefill function

f5cc627

Merge branch 'main' into batch_prefill_page_size_16_rebase

c7dd47f

update ck

9e10ffc

fix style

6a06de9

Merge branch 'main' into batch_prefill_page_size_16_rebase

ae12e04

Jeff-Huang added 7 commits January 7, 2026 09:36

fix style

db5f333

Merge branch 'main' into batch_prefill_page_size_16_rebase

44a5cc7

Merge branch 'main' into batch_prefill_page_size_16_rebase

4de0de3

Merge branch 'main' into batch_prefill_page_size_16_rebase

ec79599

Merge branch 'main' into batch_prefill_page_size_16_rebase

ba88187

update ck

e7af363

valarLip approved these changes Jan 13, 2026

View reviewed changes

valarLip merged commit 93903b1 into main Jan 13, 2026
19 checks passed

valarLip deleted the batch_prefill_page_size_16_rebase branch January 13, 2026 12:02

frida-andersson mentioned this pull request Apr 27, 2026

[ROCm][DeepSeek-V3.2][Perf] Enable gluon preshuffle indexer (block_size=64 + SHUFFLE layout) vllm-project/vllm#41008

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FMHA] Support Vectorized KV Cache Layout and vLLM/SGLang block table in Batch Prefill kernel#1754

[FMHA] Support Vectorized KV Cache Layout and vLLM/SGLang block table in Batch Prefill kernel#1754
valarLip merged 34 commits intomainfrom
batch_prefill_page_size_16_rebase

Jeff-Huang commented Dec 30, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Jeff-Huang commented Dec 30, 2025

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants