Skip to content

feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel#2792

Open
elvischenv wants to merge 2 commits intoflashinfer-ai:mainfrom
elvischenv:elvischenv/support-rope-fusion-token-padding
Open

feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel#2792
elvischenv wants to merge 2 commits intoflashinfer-ai:mainfrom
elvischenv:elvischenv/support-rope-fusion-token-padding

Conversation

@elvischenv
Copy link
Contributor

@elvischenv elvischenv commented Mar 16, 2026

📌 Description

vLLM is using seqlen=0 padding tokens for running a full cudagraph: https://github.com/vllm-project/vllm/blob/95c0f928cdeeaa21c4906e73cee6a156e1b3b995/vllm/v1/worker/gpu/model_runner.py#L652-L654

Update the following functions:
get_batch_indices_positions_kernel: initialize batch_indices/positions to -1/0 for recognizing the padding tokens
rope_quantize_fp8_append_paged_kv_cache: skip those padding tokens

Testing:
pytest -v -s tests/attention/test_rope.py::test_rope_quantize_fp8_append_paged_kv_cache_padding

======= 4 passed in 1.66s =======

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Bug Fixes

    • Paged KV cache append now reliably skips padding entries, preventing unintended overwrite during RoPE/quantization processing.
    • Deterministic initialization for internal index/position buffers to avoid uninitialized memory affecting padding behavior.
  • Tests

    • Added parameterized test to verify padding entries are ignored and prefilled KV cache data remains unchanged across layouts and modes.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the KV cache update mechanism to gracefully handle padding tokens, which is crucial for enabling full CUDA graph functionality in systems like vLLM. By explicitly marking and skipping padding tokens during KV cache writes, the changes prevent data corruption and ensure the integrity of the cache while maintaining performance benefits of fixed-size batches.

Highlights

  • Padding Token Support: Introduced support for padding tokens within the rope+quant+kv cache update fusion kernel to align with vLLM's full CUDA graph requirements.
  • Kernel Updates: The get_batch_indices_positions_kernel now explicitly marks padding tokens with batch_indices = -1 and positions = 0. The rope_quantize_fp8_append_paged_kv_cache kernel has been updated to skip processing these marked padding tokens.
  • Test Coverage: A new test case, test_rope_quantize_fp8_append_paged_kv_cache_padding, was added to ensure that padding tokens do not corrupt the KV cache, simulating a decode batch with padded requests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • flashinfer/page.py
    • Updated get_batch_indices_positions to pass the nnz argument to the Triton kernel.
  • flashinfer/triton/page.py
    • Modified get_batch_indices_positions_kernel to accept nnz and to fill padding entries with batch_indices=-1 and positions=0.
  • include/flashinfer/pos_enc.cuh
    • Added a conditional check in RopeQuantizeAppendPagedKVCacheKernel to return early if batch_indices is less than 0, effectively skipping padding tokens.
  • tests/attention/test_rope.py
    • Added test_rope_quantize_fp8_append_paged_kv_cache_padding to validate that padding tokens do not corrupt the KV cache.
Activity
  • The author has indicated that pre-commit checks have been installed and run, and tests have been added or updated as needed, with all tests passing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1b24e1d3-1670-432c-b064-50a5213c04aa

📥 Commits

Reviewing files that changed from the base of the PR and between 63197ac and 832ac30.

📒 Files selected for processing (3)
  • flashinfer/page.py
  • include/flashinfer/pos_enc.cuh
  • tests/attention/test_rope.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • flashinfer/page.py
  • tests/attention/test_rope.py

📝 Walkthrough

Walkthrough

Adds deterministic initialization for batch_indices/positions and kernel-level guards that skip RoPE, quantization, and paged KV cache append work for tokens marked with batch_indices = -1, preventing padding tokens from being processed and corrupting the KV cache.

Changes

Cohort / File(s) Summary
Tensor Initialization
flashinfer/page.py
get_batch_indices_positions now fills batch_indices with -1 for padding markers and initializes positions with 0 when not provided (or resets batch_indices when reusing buffers).
Kernel-Level Padding Guards
include/flashinfer/pos_enc.cuh
Wrapped token-specific work in RopeQuantizeAppendPagedKVCacheKernel with if (batch_indices[idx] >= 0) to skip RoPE cos/sin loads, quant stores, and paged KV cache appends for padding tokens.
Test Coverage
tests/attention/test_rope.py
Added test_rope_quantize_fp8_append_paged_kv_cache_padding parameterized test to assert batch_indices padding markers and to verify KV cache byte-identical preservation for padded entries across attention modes, layouts, and page sizes.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host (CPU)
    participant Kernel as RopeQuantizeAppendPagedKVCacheKernel (GPU)
    participant KV as Paged KV Cache
    Host->>Host: prepare inputs\n(batch_indices, positions)
    Host->>Kernel: launch kernel(inputs)
    Kernel->>Kernel: idx := thread idx
    alt batch_indices[idx] >= 0
        Kernel->>Kernel: compute page location\napply RoPE, quantize
        Kernel->>KV: append/store K/V/Q to paged cache
    else batch_indices[idx] < 0
        Kernel->>Kernel: skip all RoPE/quantize/cache ops
    end
    Kernel-->>Host: kernel completes
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • yzh119
  • kahyunnam
  • bkryu
  • nvmbreughe
  • nvpohanh

Poem

🐰
Hop-hop, I mark the line with -1,
Padding tucked away where no kernels run,
RoPE and bytes stay snug and tight,
Cache keeps calm through day and night,
A little hop for code well-done.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding support for padding tokens with seqlen=0 in the rope+quant+kv cache fusion kernel, matching the core functionality described in the changeset.
Description check ✅ Passed The PR description includes all key required sections: a clear description of changes, related context (vLLM use case), completed pre-commit checks, and passing test results demonstrating functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for padding tokens in the rope+quant+kv cache update fused kernel, which is useful for cudagraphs. The approach involves modifying get_batch_indices_positions_kernel to mark padding tokens and updating RopeQuantizeAppendPagedKVCacheKernel to skip them. A new test case is added to validate this padding logic. While the implementation changes seem correct, I've identified issues in the new test case where token positions are calculated incorrectly. This could cause the test to pass while not properly verifying the intended behavior, potentially masking bugs. I've provided suggestions to correct the test logic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/attention/test_rope.py (1)

1390-1590: Add enable_pdl coverage to this new padding regression test.

Lines 1392-1589 only exercise the default path. Please parameterize enable_pdl and pass it into the fused call so padding behavior is validated under the programmatic dependent launch mode too.

Proposed test update
 `@pytest.mark.parametrize`("kv_layout", ["NHD", "HND"])
 `@pytest.mark.parametrize`("page_size", [16])
+@pytest.mark.parametrize("enable_pdl", [True, False])
 def test_rope_quantize_fp8_append_paged_kv_cache_padding(
@@
     kv_layout,
     page_size,
+    enable_pdl,
 ):
@@
     flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache(
@@
         quant_scale_kv=1.0,
         is_neox=False,
+        enable_pdl=enable_pdl,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/attention/test_rope.py` around lines 1390 - 1590, The test
test_rope_quantize_fp8_append_paged_kv_cache currently only runs the default
path; add a pytest parameterization for enable_pdl (e.g.,
`@pytest.mark.parametrize`("enable_pdl",[False,True]) and add enable_pdl to the
test signature) and pass enable_pdl into the fused call
flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache(enable_pdl=enable_pdl,
... ) so the padding behavior is validated under programmatic dependent launch
mode as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@include/flashinfer/pos_enc.cuh`:
- Around line 862-865: Replace the early-return on the batch_indices check so
all threads reach the PDL epilogue: remove the "if (batch_indices[idx] < 0)
return;" and instead wrap the work body that follows (the block currently
between lines 867–1030) in "if (batch_indices[idx] >= 0) { ... }". Keep the
final epilogue (including the griddepcontrol.launch_dependents instruction)
outside that guard so it executes unconditionally for every thread in the block;
ensure you reference the same batch_indices and idx variables used in the
kernel.

---

Nitpick comments:
In `@tests/attention/test_rope.py`:
- Around line 1390-1590: The test test_rope_quantize_fp8_append_paged_kv_cache
currently only runs the default path; add a pytest parameterization for
enable_pdl (e.g., `@pytest.mark.parametrize`("enable_pdl",[False,True]) and add
enable_pdl to the test signature) and pass enable_pdl into the fused call
flashinfer.rope.rope_quantize_fp8_append_paged_kv_cache(enable_pdl=enable_pdl,
... ) so the padding behavior is validated under programmatic dependent launch
mode as well.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6761f8e1-64cf-4451-8fe3-a2914f272b95

📥 Commits

Reviewing files that changed from the base of the PR and between b418bc3 and 476de9c.

📒 Files selected for processing (4)
  • flashinfer/page.py
  • flashinfer/triton/page.py
  • include/flashinfer/pos_enc.cuh
  • tests/attention/test_rope.py

Comment on lines +862 to +863
// skip padding tokens with batch_indices < 0
if (batch_indices[idx] >= 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main change is just this line. The following is just indent formatting.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
include/flashinfer/pos_enc.cuh (1)

862-868: Make padding-sentinel check type-safe across PagedKVIdType instantiations.

Line 863 uses batch_indices[idx] >= 0, which is only safe when PagedKVIdType is signed. If it is ever unsigned, the padding sentinel -1 becomes the maximum value and this branch incorrectly passes, leading to invalid indptr indexing.

🔧 Proposed fix
-    // skip padding tokens with batch_indices < 0
-    if (batch_indices[idx] >= 0) {
+    constexpr PagedKVIdType kPaddingSentinel = static_cast<PagedKVIdType>(-1);
+    const PagedKVIdType batch_idx = batch_indices[idx];
+    if (batch_idx != kPaddingSentinel) {
       // Compute page location for this token
       uint32_t page_iter, entry_idx;
       paged_kv_like.page_size.divmod(
-          paged_kv_like.indptr[batch_indices[idx]] * paged_kv_like.page_size + positions[idx],
+          paged_kv_like.indptr[batch_idx] * paged_kv_like.page_size + positions[idx],
           page_iter, entry_idx);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/pos_enc.cuh` around lines 862 - 868, The current padding
check uses "batch_indices[idx] >= 0" which breaks for unsigned PagedKVIdType;
change it to a type-safe sentinel comparison (e.g., compare against a named
padding sentinel value) so you explicitly test for the padding marker instead of
signedness. Replace the condition in the block using batch_indices, positions
and paged_kv_like (the if around batch_indices[idx] >= 0 that precedes
paged_kv_like.indptr[...] and page_size.divmod calls) with a check like
"batch_indices[idx] != static_cast<PagedKVIdType>(-1)" or, better, introduce a
constexpr PagedKVIdType kPaddingSentinel = static_cast<PagedKVIdType>(-1) and
use "batch_indices[idx] != kPaddingSentinel" to ensure correct behavior for both
signed and unsigned instantiations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@include/flashinfer/pos_enc.cuh`:
- Around line 862-868: The current padding check uses "batch_indices[idx] >= 0"
which breaks for unsigned PagedKVIdType; change it to a type-safe sentinel
comparison (e.g., compare against a named padding sentinel value) so you
explicitly test for the padding marker instead of signedness. Replace the
condition in the block using batch_indices, positions and paged_kv_like (the if
around batch_indices[idx] >= 0 that precedes paged_kv_like.indptr[...] and
page_size.divmod calls) with a check like "batch_indices[idx] !=
static_cast<PagedKVIdType>(-1)" or, better, introduce a constexpr PagedKVIdType
kPaddingSentinel = static_cast<PagedKVIdType>(-1) and use "batch_indices[idx] !=
kPaddingSentinel" to ensure correct behavior for both signed and unsigned
instantiations.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a25b44d8-52fc-4514-8eeb-44210f1c0877

📥 Commits

Reviewing files that changed from the base of the PR and between 476de9c and 54c5fff.

📒 Files selected for processing (1)
  • include/flashinfer/pos_enc.cuh

@elvischenv elvischenv force-pushed the elvischenv/support-rope-fusion-token-padding branch 2 times, most recently from f0f6c80 to 63197ac Compare March 16, 2026 07:29
@elvischenv elvischenv changed the title feat: Support padding token for rope+quant+kv cache update fusion kernel feat: Support 0 seqlen padding tokens for rope+quant+kv cache update fusion kernel Mar 16, 2026
@elvischenv elvischenv changed the title feat: Support 0 seqlen padding tokens for rope+quant+kv cache update fusion kernel feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel Mar 16, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
include/flashinfer/pos_enc.cuh (1)

862-863: Consider adding a compile-time assertion to document the signed type requirement for the padding check.

The code assumes PagedKVIdType is signed to make the check batch_indices[idx] >= 0 work correctly for detecting padding tokens (marked with -1). While the type is hardcoded to int32_t in the JIT binding at csrc/rope.cu (lines 574, 605) and in Python via torch.int(), a static assertion would document this requirement and guard against accidental misuse:

static_assert(std::is_signed<PagedKVIdType>::value,
              "PagedKVIdType must be signed for padding token detection (batch_indices < 0)");

This defensive check aligns with the kernel's assumptions and improves long-term maintainability without adding runtime overhead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/pos_enc.cuh` around lines 862 - 863, Add a compile-time
assertion that PagedKVIdType is a signed type to document and enforce the
kernel's assumption used by the padding check `batch_indices[idx] >= 0`; insert
a static_assert using `std::is_signed<PagedKVIdType>::value` (e.g., near the
typedef/using of PagedKVIdType or at the top of the kernel in pos_enc.cuh before
the `batch_indices` usage) with a clear message like "PagedKVIdType must be
signed for padding token detection (batch_indices < 0)"; this is purely
compile-time and has no runtime overhead but prevents accidental unsigned types
from being used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@include/flashinfer/pos_enc.cuh`:
- Around line 862-863: Add a compile-time assertion that PagedKVIdType is a
signed type to document and enforce the kernel's assumption used by the padding
check `batch_indices[idx] >= 0`; insert a static_assert using
`std::is_signed<PagedKVIdType>::value` (e.g., near the typedef/using of
PagedKVIdType or at the top of the kernel in pos_enc.cuh before the
`batch_indices` usage) with a clear message like "PagedKVIdType must be signed
for padding token detection (batch_indices < 0)"; this is purely compile-time
and has no runtime overhead but prevents accidental unsigned types from being
used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: de20e129-16a7-4945-80ef-553ab0f8df70

📥 Commits

Reviewing files that changed from the base of the PR and between 54c5fff and 63197ac.

📒 Files selected for processing (3)
  • flashinfer/page.py
  • include/flashinfer/pos_enc.cuh
  • tests/attention/test_rope.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • flashinfer/page.py
  • tests/attention/test_rope.py

@elvischenv
Copy link
Contributor Author

Hi @yzh119, could you help review this? We need this fix for integrating this kernel to vLLM. Thanks!

@elvischenv
Copy link
Contributor Author

cc @kahyunnam for viz.

@bkryu bkryu added the run-ci label Mar 20, 2026
@bkryu
Copy link
Collaborator

bkryu commented Mar 20, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !438 has been created, and the CI pipeline #46584451 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[SUCCESS] Pipeline #46584451: 14/20 passed

@nvpohanh
Copy link
Contributor

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !438 has been created, and the CI pipeline #46776615 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #46776615: 12/20 passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants