Skip to content

[FMHA] Support Vectorized KV Cache Layout and vLLM/SGLang block table in Batch Prefill kernel#1754

Merged
valarLip merged 34 commits intomainfrom
batch_prefill_page_size_16_rebase
Jan 13, 2026
Merged

[FMHA] Support Vectorized KV Cache Layout and vLLM/SGLang block table in Batch Prefill kernel#1754
valarLip merged 34 commits intomainfrom
batch_prefill_page_size_16_rebase

Conversation

@Jeff-Huang
Copy link
Copy Markdown
Contributor

Motivation

Introduces support for a vectorized KV cache memory layout (e.g., [num_blocks, num_kv_heads, head_size/8, block_size, 8]) to improve memory access efficiency and also support different type of block table such as vLLM and SGLang.

Technical Details

Key changes:

  • KV Cache Layout Optimization and Adjustment:

    • The KV cache memory layout has been adjusted to support vectorized read patterns (Vectorized KV layout).
    • Support for various layout formats has been implemented, such as [num_blocks, num_kv_heads, head_size/8, block_size, 8] and other structures.
  • vLLM Block Table Integration:

    • Added support for vLLM block table integration ([num_batch, max_blocks_per_seq]).
    • Added support for SGLang block table integration ([num_blocks]).
    • Support PageSize 1024
  • Kernel Interface Updates:

    • New parameters for block table and kv cache layout.
  • Structure and Traits Updates:

    • Adapted to changes in the fmha_fwd_batch_prefill_traits structure.

Test Plan

Test Result

Submission Checklist

ltqin and others added 12 commits December 30, 2025 09:09
…/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8]
…ayout

Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout.

**Key Changes:**
*   **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces.
*   **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers.
*   **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments.
*   **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout.
@Jeff-Huang Jeff-Huang requested a review from a team December 30, 2025 05:36
Comment thread aiter/ops/mha.py
…ensors in batch prefill

- Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1.
- Relax contiguity checks to only require the last dimension to be contiguous.
- Update C++ stride calculations for 3D, 4D, and 5D layouts.
- Add tests for 3D layout and non-contiguous KV cache.
@valarLip valarLip merged commit 93903b1 into main Jan 13, 2026
19 checks passed
@valarLip valarLip deleted the batch_prefill_page_size_16_rebase branch January 13, 2026 12:02
zhuyuhua-v pushed a commit that referenced this pull request Jan 14, 2026
… in Batch Prefill kernel (#1754)

* add page size 16 to test and op

* add num_total_pages to kernel parameter

* add is_sglang parameter

* chang is_sglang to is_sglang_layout

* kv last page size=16 pass

* pass kv_last_page_lens to kernel

* add parameters check before calling kernel

* change kv layout to [page_num, page_size, nhead, hdim]

* adopt the changes of struct fmha_fwd_batch_prefill_traits

* change kv cache memory layout to [num_blocks, num_kv_heads, head_size/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8]

* [FMHA] Integrate vLLM block table support and enforce vectorized KV layout

Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout.

**Key Changes:**
*   **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces.
*   **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers.
*   **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments.
*   **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout.

* update CK

* update ck

* adopt api changes from fmha_batch_prefill_traits

* add support for linear kv cache layout

* update api

* Refactor the test code by gathering the different test functions into one

* update ck

* update ck

* Add profile measurements for batch prefill function

* update ck

* fix style

* fix style

* [FMHA] Support 3D linear layout (page_size=1) and non-contiguous KV tensors in batch prefill

- Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1.
- Relax contiguity checks to only require the last dimension to be contiguous.
- Update C++ stride calculations for 3D, 4D, and 5D layouts.
- Add tests for 3D layout and non-contiguous KV cache.

* update ck

---------

Co-authored-by: ltqin <letaoqin@amd.com>
yzhou103 pushed a commit that referenced this pull request Jan 28, 2026
… in Batch Prefill kernel (#1754)

* add page size 16 to test and op

* add num_total_pages to kernel parameter

* add is_sglang parameter

* chang is_sglang to is_sglang_layout

* kv last page size=16 pass

* pass kv_last_page_lens to kernel

* add parameters check before calling kernel

* change kv layout to [page_num, page_size, nhead, hdim]

* adopt the changes of struct fmha_fwd_batch_prefill_traits

* change kv cache memory layout to [num_blocks, num_kv_heads, head_size/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8]

* [FMHA] Integrate vLLM block table support and enforce vectorized KV layout

Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout.

**Key Changes:**
*   **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces.
*   **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers.
*   **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments.
*   **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout.

* update CK

* update ck

* adopt api changes from fmha_batch_prefill_traits

* add support for linear kv cache layout

* update api

* Refactor the test code by gathering the different test functions into one

* update ck

* update ck

* Add profile measurements for batch prefill function

* update ck

* fix style

* fix style

* [FMHA] Support 3D linear layout (page_size=1) and non-contiguous KV tensors in batch prefill

- Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1.
- Relax contiguity checks to only require the last dimension to be contiguous.
- Update C++ stride calculations for 3D, 4D, and 5D layouts.
- Add tests for 3D layout and non-contiguous KV cache.

* update ck

---------

Co-authored-by: ltqin <letaoqin@amd.com>
valarLip pushed a commit that referenced this pull request Mar 18, 2026
… in Batch Prefill kernel (#1754)

* add page size 16 to test and op

* add num_total_pages to kernel parameter

* add is_sglang parameter

* chang is_sglang to is_sglang_layout

* kv last page size=16 pass

* pass kv_last_page_lens to kernel

* add parameters check before calling kernel

* change kv layout to [page_num, page_size, nhead, hdim]

* adopt the changes of struct fmha_fwd_batch_prefill_traits

* change kv cache memory layout to [num_blocks, num_kv_heads, head_size/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8]

* [FMHA] Integrate vLLM block table support and enforce vectorized KV layout

Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout.

**Key Changes:**
*   **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces.
*   **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers.
*   **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments.
*   **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout.

* update CK

* update ck

* adopt api changes from fmha_batch_prefill_traits

* add support for linear kv cache layout

* update api

* Refactor the test code by gathering the different test functions into one

* update ck

* update ck

* Add profile measurements for batch prefill function

* update ck

* fix style

* fix style

* [FMHA] Support 3D linear layout (page_size=1) and non-contiguous KV tensors in batch prefill

- Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1.
- Relax contiguity checks to only require the last dimension to be contiguous.
- Update C++ stride calculations for 3D, 4D, and 5D layouts.
- Add tests for 3D layout and non-contiguous KV cache.

* update ck

---------

Co-authored-by: ltqin <letaoqin@amd.com>
valarLip pushed a commit that referenced this pull request Mar 18, 2026
… in Batch Prefill kernel (#1754)

* add page size 16 to test and op

* add num_total_pages to kernel parameter

* add is_sglang parameter

* chang is_sglang to is_sglang_layout

* kv last page size=16 pass

* pass kv_last_page_lens to kernel

* add parameters check before calling kernel

* change kv layout to [page_num, page_size, nhead, hdim]

* adopt the changes of struct fmha_fwd_batch_prefill_traits

* change kv cache memory layout to [num_blocks, num_kv_heads, head_size/8, block_size, 8], [num_blocks, num_kv_heads, block_size/8, head_size, 8]

* [FMHA] Integrate vLLM block table support and enforce vectorized KV layout

Updated `mha_batch_prefill` API and tests to support vLLM-style block tables alongside SGLang-style page tables, while enforcing the new hardware-optimized 5D vectorized KV cache layout.

**Key Changes:**
*   **API**: Added `block_table` and `seqlen_k` arguments to python/C++ interfaces.
*   **Layout Enforcement**: Added strict checks for 5D vectorized KV layout (swizzled x=8) in host bindings and python wrappers.
*   **CodeGen**: Automatically select `VLLM_BLOCK_TABLE_2D` or `SGLANG_PAGE_TABLE_1D` trait based on input arguments.
*   **Tests**: Added `test_batch_prefill_vllm` to verify block table correctness and updated existing tests to use the vectorized layout.

* update CK

* update ck

* adopt api changes from fmha_batch_prefill_traits

* add support for linear kv cache layout

* update api

* Refactor the test code by gathering the different test functions into one

* update ck

* update ck

* Add profile measurements for batch prefill function

* update ck

* fix style

* fix style

* [FMHA] Support 3D linear layout (page_size=1) and non-contiguous KV tensors in batch prefill

- Enable 3D [N, H, D] K/V tensors for batch prefill, treating as linear layout with page_size=1.
- Relax contiguity checks to only require the last dimension to be contiguous.
- Update C++ stride calculations for 3D, 4D, and 5D layouts.
- Add tests for 3D layout and non-contiguous KV cache.

* update ck

---------

Co-authored-by: ltqin <letaoqin@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants