feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel by elvischenv · Pull Request #2792 · flashinfer-ai/flashinfer

elvischenv · 2026-03-16T02:39:56Z

📌 Description

vLLM is using seqlen=0 padding tokens for running a full cudagraph: https://github.com/vllm-project/vllm/blob/95c0f928cdeeaa21c4906e73cee6a156e1b3b995/vllm/v1/worker/gpu/model_runner.py#L652-L654

Update the following functions:
get_batch_indices_positions_kernel: initialize batch_indices/positions to -1/0 for recognizing the padding tokens
rope_quantize_fp8_append_paged_kv_cache: skip those padding tokens

Testing:
pytest -v -s tests/attention/test_rope.py::test_rope_quantize_fp8_append_paged_kv_cache_padding

======= 4 passed in 1.66s =======

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Bug Fixes
- Paged KV cache append now reliably skips padding entries, preventing unintended overwrite during RoPE/quantization processing.
- Deterministic initialization for internal index/position buffers to avoid uninitialized memory affecting padding behavior.
Tests
- Added parameterized test to verify padding entries are ignored and prefilled KV cache data remains unchanged across layouts and modes.

gemini-code-assist · 2026-03-16T02:40:11Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the KV cache update mechanism to gracefully handle padding tokens, which is crucial for enabling full CUDA graph functionality in systems like vLLM. By explicitly marking and skipping padding tokens during KV cache writes, the changes prevent data corruption and ensure the integrity of the cache while maintaining performance benefits of fixed-size batches.

Highlights

Padding Token Support: Introduced support for padding tokens within the rope+quant+kv cache update fusion kernel to align with vLLM's full CUDA graph requirements.
Kernel Updates: The get_batch_indices_positions_kernel now explicitly marks padding tokens with batch_indices = -1 and positions = 0. The rope_quantize_fp8_append_paged_kv_cache kernel has been updated to skip processing these marked padding tokens.
Test Coverage: A new test case, test_rope_quantize_fp8_append_paged_kv_cache_padding, was added to ensure that padding tokens do not corrupt the KV cache, simulating a decode batch with padded requests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

flashinfer/page.py
- Updated get_batch_indices_positions to pass the nnz argument to the Triton kernel.
flashinfer/triton/page.py
- Modified get_batch_indices_positions_kernel to accept nnz and to fill padding entries with batch_indices=-1 and positions=0.
include/flashinfer/pos_enc.cuh
- Added a conditional check in RopeQuantizeAppendPagedKVCacheKernel to return early if batch_indices is less than 0, effectively skipping padding tokens.
tests/attention/test_rope.py
- Added test_rope_quantize_fp8_append_paged_kv_cache_padding to validate that padding tokens do not corrupt the KV cache.

Activity

The author has indicated that pre-commit checks have been installed and run, and tests have been added or updated as needed, with all tests passing.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-16T02:40:19Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1b24e1d3-1670-432c-b064-50a5213c04aa

📥 Commits

Reviewing files that changed from the base of the PR and between 63197ac and 832ac30.

📒 Files selected for processing (3)

flashinfer/page.py
include/flashinfer/pos_enc.cuh
tests/attention/test_rope.py

🚧 Files skipped from review as they are similar to previous changes (2)

flashinfer/page.py
tests/attention/test_rope.py

📝 Walkthrough

Walkthrough

Adds deterministic initialization for batch_indices/positions and kernel-level guards that skip RoPE, quantization, and paged KV cache append work for tokens marked with batch_indices = -1, preventing padding tokens from being processed and corrupting the KV cache.

Changes

Cohort / File(s)	Summary
Tensor Initialization `flashinfer/page.py`	`get_batch_indices_positions` now fills `batch_indices` with `-1` for padding markers and initializes `positions` with `0` when not provided (or resets `batch_indices` when reusing buffers).
Kernel-Level Padding Guards `include/flashinfer/pos_enc.cuh`	Wrapped token-specific work in `RopeQuantizeAppendPagedKVCacheKernel` with `if (batch_indices[idx] >= 0)` to skip RoPE cos/sin loads, quant stores, and paged KV cache appends for padding tokens.
Test Coverage `tests/attention/test_rope.py`	Added `test_rope_quantize_fp8_append_paged_kv_cache_padding` parameterized test to assert `batch_indices` padding markers and to verify KV cache byte-identical preservation for padded entries across attention modes, layouts, and page sizes.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host (CPU)
    participant Kernel as RopeQuantizeAppendPagedKVCacheKernel (GPU)
    participant KV as Paged KV Cache
    Host->>Host: prepare inputs\n(batch_indices, positions)
    Host->>Kernel: launch kernel(inputs)
    Kernel->>Kernel: idx := thread idx
    alt batch_indices[idx] >= 0
        Kernel->>Kernel: compute page location\napply RoPE, quantize
        Kernel->>KV: append/store K/V/Q to paged cache
    else batch_indices[idx] < 0
        Kernel->>Kernel: skip all RoPE/quantize/cache ops
    end
    Kernel-->>Host: kernel completes

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

MLA RoPE + quantization fused kernel: shape generalization for MHA / GQA #1924: Changes generalize/rename RoPE quantize kernels in the same pos_enc CUDA path that this PR updates with padding guards.

Suggested reviewers

yzh119
kahyunnam
bkryu
nvmbreughe
nvpohanh

Poem

🐰
Hop-hop, I mark the line with -1,
Padding tucked away where no kernels run,
RoPE and bytes stay snug and tight,
Cache keeps calm through day and night,
A little hop for code well-done.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding support for padding tokens with seqlen=0 in the rope+quant+kv cache fusion kernel, matching the core functionality described in the changeset.
Description check	✅ Passed	The PR description includes all key required sections: a clear description of changes, related context (vLLM use case), completed pre-commit checks, and passing test results demonstrating functionality.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request adds support for padding tokens in the rope+quant+kv cache update fused kernel, which is useful for cudagraphs. The approach involves modifying get_batch_indices_positions_kernel to mark padding tokens and updating RopeQuantizeAppendPagedKVCacheKernel to skip them. A new test case is added to validate this padding logic. While the implementation changes seem correct, I've identified issues in the new test case where token positions are calculated incorrectly. This could cause the test to pass while not properly verifying the intended behavior, potentially masking bugs. I've provided suggestions to correct the test logic.

tests/attention/test_rope.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/attention/test_rope.py (1)

1390-1590: Add enable_pdl coverage to this new padding regression test.

Lines 1392-1589 only exercise the default path. Please parameterize enable_pdl and pass it into the fused call so padding behavior is validated under the programmatic dependent launch mode too.

Proposed test update

 `@pytest.mark.parametrize`("kv_layout", ["NHD", "HND"])
 `@pytest.mark.parametrize`("page_size", [16])
+@pytest.mark.parametrize("enable_pdl", [True, False])
 def test_rope_quantize_fp8_append_paged_kv_cache_padding(
@@
     kv_layout,
     page_size,
+    enable_pdl,
 ):
@@
     flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache(
@@
         quant_scale_kv=1.0,
         is_neox=False,
+        enable_pdl=enable_pdl,
     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/attention/test_rope.py` around lines 1390 - 1590, The test
test_rope_quantize_fp8_append_paged_kv_cache currently only runs the default
path; add a pytest parameterization for enable_pdl (e.g.,
`@pytest.mark.parametrize`("enable_pdl",[False,True]) and add enable_pdl to the
test signature) and pass enable_pdl into the fused call
flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache(enable_pdl=enable_pdl,
... ) so the padding behavior is validated under programmatic dependent launch
mode as well.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/flashinfer/pos_enc.cuh`:
- Around line 862-865: Replace the early-return on the batch_indices check so
all threads reach the PDL epilogue: remove the "if (batch_indices[idx] < 0)
return;" and instead wrap the work body that follows (the block currently
between lines 867–1030) in "if (batch_indices[idx] >= 0) { ... }". Keep the
final epilogue (including the griddepcontrol.launch_dependents instruction)
outside that guard so it executes unconditionally for every thread in the block;
ensure you reference the same batch_indices and idx variables used in the
kernel.

---

Nitpick comments:
In `@tests/attention/test_rope.py`:
- Around line 1390-1590: The test test_rope_quantize_fp8_append_paged_kv_cache
currently only runs the default path; add a pytest parameterization for
enable_pdl (e.g., `@pytest.mark.parametrize`("enable_pdl",[False,True]) and add
enable_pdl to the test signature) and pass enable_pdl into the fused call
flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache(enable_pdl=enable_pdl,
... ) so the padding behavior is validated under programmatic dependent launch
mode as well.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6761f8e1-64cf-4451-8fe3-a2914f272b95

📥 Commits

Reviewing files that changed from the base of the PR and between b418bc3 and 476de9c.

📒 Files selected for processing (4)

flashinfer/page.py
flashinfer/triton/page.py
include/flashinfer/pos_enc.cuh
tests/attention/test_rope.py

include/flashinfer/pos_enc.cuh

elvischenv · 2026-03-16T03:02:11Z

include/flashinfer/pos_enc.cuh

+    // skip padding tokens with batch_indices < 0
+    if (batch_indices[idx] >= 0) {


The main change is just this line. The following is just indent formatting.

coderabbitai

🧹 Nitpick comments (1)

include/flashinfer/pos_enc.cuh (1)

862-868: Make padding-sentinel check type-safe across PagedKVIdType instantiations.

Line 863 uses batch_indices[idx] >= 0, which is only safe when PagedKVIdType is signed. If it is ever unsigned, the padding sentinel -1 becomes the maximum value and this branch incorrectly passes, leading to invalid indptr indexing.

🔧 Proposed fix

-    // skip padding tokens with batch_indices < 0
-    if (batch_indices[idx] >= 0) {
+    constexpr PagedKVIdType kPaddingSentinel = static_cast<PagedKVIdType>(-1);
+    const PagedKVIdType batch_idx = batch_indices[idx];
+    if (batch_idx != kPaddingSentinel) {
       // Compute page location for this token
       uint32_t page_iter, entry_idx;
       paged_kv_like.page_size.divmod(
-          paged_kv_like.indptr[batch_indices[idx]] * paged_kv_like.page_size + positions[idx],
+          paged_kv_like.indptr[batch_idx] * paged_kv_like.page_size + positions[idx],
           page_iter, entry_idx);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/pos_enc.cuh` around lines 862 - 868, The current padding
check uses "batch_indices[idx] >= 0" which breaks for unsigned PagedKVIdType;
change it to a type-safe sentinel comparison (e.g., compare against a named
padding sentinel value) so you explicitly test for the padding marker instead of
signedness. Replace the condition in the block using batch_indices, positions
and paged_kv_like (the if around batch_indices[idx] >= 0 that precedes
paged_kv_like.indptr[...] and page_size.divmod calls) with a check like
"batch_indices[idx] != static_cast<PagedKVIdType>(-1)" or, better, introduce a
constexpr PagedKVIdType kPaddingSentinel = static_cast<PagedKVIdType>(-1) and
use "batch_indices[idx] != kPaddingSentinel" to ensure correct behavior for both
signed and unsigned instantiations.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@include/flashinfer/pos_enc.cuh`:
- Around line 862-868: The current padding check uses "batch_indices[idx] >= 0"
which breaks for unsigned PagedKVIdType; change it to a type-safe sentinel
comparison (e.g., compare against a named padding sentinel value) so you
explicitly test for the padding marker instead of signedness. Replace the
condition in the block using batch_indices, positions and paged_kv_like (the if
around batch_indices[idx] >= 0 that precedes paged_kv_like.indptr[...] and
page_size.divmod calls) with a check like "batch_indices[idx] !=
static_cast<PagedKVIdType>(-1)" or, better, introduce a constexpr PagedKVIdType
kPaddingSentinel = static_cast<PagedKVIdType>(-1) and use "batch_indices[idx] !=
kPaddingSentinel" to ensure correct behavior for both signed and unsigned
instantiations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a25b44d8-52fc-4514-8eeb-44210f1c0877

📥 Commits

Reviewing files that changed from the base of the PR and between 476de9c and 54c5fff.

📒 Files selected for processing (1)

include/flashinfer/pos_enc.cuh

coderabbitai

🧹 Nitpick comments (1)

include/flashinfer/pos_enc.cuh (1)
862-863: Consider adding a compile-time assertion to document the signed type requirement for the padding check.

The code assumes PagedKVIdType is signed to make the check batch_indices[idx] >= 0 work correctly for detecting padding tokens (marked with -1). While the type is hardcoded to int32_t in the JIT binding at csrc/rope.cu (lines 574, 605) and in Python via torch.int(), a static assertion would document this requirement and guard against accidental misuse:
static_assert(std::is_signed<PagedKVIdType>::value,
              "PagedKVIdType must be signed for padding token detection (batch_indices < 0)");
This defensive check aligns with the kernel's assumptions and improves long-term maintainability without adding runtime overhead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/pos_enc.cuh` around lines 862 - 863, Add a compile-time
assertion that PagedKVIdType is a signed type to document and enforce the
kernel's assumption used by the padding check `batch_indices[idx] >= 0`; insert
a static_assert using `std::is_signed<PagedKVIdType>::value` (e.g., near the
typedef/using of PagedKVIdType or at the top of the kernel in pos_enc.cuh before
the `batch_indices` usage) with a clear message like "PagedKVIdType must be
signed for padding token detection (batch_indices < 0)"; this is purely
compile-time and has no runtime overhead but prevents accidental unsigned types
from being used.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@include/flashinfer/pos_enc.cuh`:
- Around line 862-863: Add a compile-time assertion that PagedKVIdType is a
signed type to document and enforce the kernel's assumption used by the padding
check `batch_indices[idx] >= 0`; insert a static_assert using
`std::is_signed<PagedKVIdType>::value` (e.g., near the typedef/using of
PagedKVIdType or at the top of the kernel in pos_enc.cuh before the
`batch_indices` usage) with a clear message like "PagedKVIdType must be signed
for padding token detection (batch_indices < 0)"; this is purely compile-time
and has no runtime overhead but prevents accidental unsigned types from being
used.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: de20e129-16a7-4945-80ef-553ab0f8df70

📥 Commits

Reviewing files that changed from the base of the PR and between 54c5fff and 63197ac.

📒 Files selected for processing (3)

flashinfer/page.py
include/flashinfer/pos_enc.cuh
tests/attention/test_rope.py

🚧 Files skipped from review as they are similar to previous changes (2)

flashinfer/page.py
tests/attention/test_rope.py

elvischenv · 2026-03-17T11:56:48Z

Hi @yzh119, could you help review this? We need this fix for integrating this kernel to vLLM. Thanks!

elvischenv · 2026-03-18T09:45:19Z

cc @kahyunnam for viz.

bkryu · 2026-03-20T06:30:12Z

/bot run

flashinfer-bot · 2026-03-20T06:30:32Z

GitLab MR !438 has been created, and the CI pipeline #46584451 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-20T11:16:50Z

[SUCCESS] Pipeline #46584451: 14/20 passed

nvpohanh · 2026-03-23T11:54:16Z

/bot run

flashinfer-bot · 2026-03-23T11:55:16Z

GitLab MR !438 has been created, and the CI pipeline #46776615 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-23T23:59:00Z

[FAILED] Pipeline #46776615: 12/20 passed

elvischenv requested review from Anerudhan, bkryu, jiahanc, jimmyzho, kahyunnam, nv-yunzheq, nvmbreughe and yzh119 as code owners March 16, 2026 02:39

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

tests/attention/test_rope.py Show resolved Hide resolved

tests/attention/test_rope.py Show resolved Hide resolved

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

include/flashinfer/pos_enc.cuh Outdated Show resolved Hide resolved

elvischenv commented Mar 16, 2026

View reviewed changes

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

elvischenv mentioned this pull request Mar 16, 2026

Support Flashinfer rope+quant+cache update fusion kernel for TRTLLM attention vllm-project/vllm#36858

Open

5 tasks

elvischenv force-pushed the elvischenv/support-rope-fusion-token-padding branch 2 times, most recently from f0f6c80 to 63197ac Compare March 16, 2026 07:29

elvischenv changed the title ~~feat: Support padding token for rope+quant+kv cache update fusion kernel~~ feat: Support 0 seqlen padding tokens for rope+quant+kv cache update fusion kernel Mar 16, 2026

elvischenv changed the title ~~feat: Support 0 seqlen padding tokens for rope+quant+kv cache update fusion kernel~~ feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel Mar 16, 2026

coderabbitai bot reviewed Mar 16, 2026

View reviewed changes

bkryu added the run-ci label Mar 20, 2026

support token padding for rope fusion kernel

acb92e4

fix pdl

832ac30

elvischenv force-pushed the elvischenv/support-rope-fusion-token-padding branch from 63197ac to 832ac30 Compare March 24, 2026 12:28

elvischenv requested review from aleozlx, cyx-6, saltyminty, samuellees, sricketts, yongwww and yyihuang as code owners March 24, 2026 12:28

elvischenv mentioned this pull request Mar 26, 2026

[Perf] Support Flashinfer RoPE+Quant+KV update kernel for trtllm_mha backend for GPT-OSS sgl-project/sglang#15729

Open

6 tasks

		// skip padding tokens with batch_indices < 0
		if (batch_indices[idx] >= 0) {

Conversation

elvischenv commented Mar 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elvischenv Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

elvischenv commented Mar 17, 2026

Uh oh!

elvischenv commented Mar 18, 2026

Uh oh!

bkryu commented Mar 20, 2026

Uh oh!

flashinfer-bot commented Mar 20, 2026

Uh oh!

flashinfer-bot commented Mar 20, 2026

Uh oh!

nvpohanh commented Mar 23, 2026

Uh oh!

flashinfer-bot commented Mar 23, 2026

Uh oh!

flashinfer-bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

elvischenv commented Mar 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 16, 2026 •

edited

Loading