Skip to content

Add fused RoPE + paged KV cache append op for the non-quant path #24678#2785

Open
baonudesifeizhai wants to merge 6 commits intoflashinfer-ai:mainfrom
baonudesifeizhai:addkernel
Open

Add fused RoPE + paged KV cache append op for the non-quant path #24678#2785
baonudesifeizhai wants to merge 6 commits intoflashinfer-ai:mainfrom
baonudesifeizhai:addkernel

Conversation

@baonudesifeizhai
Copy link
Copy Markdown
Contributor

@baonudesifeizhai baonudesifeizhai commented Mar 14, 2026

📌 Description

🔍 Related Issues

#24678
vllm-project/vllm#37041

Add rope_append_paged_kv_cache, a fused non-quant RoPE + paged KV cache append op.

This fills the missing fused rope + cache path for serving decode integration, and can still write to FP8 KV cache through kv_scale.

Motivation

#24678
We were missing the fused non-quant rope + cache op needed by the vLLM integration path.

Validation

  • Verified in vLLM integration
  • Checked end-to-end serving correctness on decode requests

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

pytest -q tests/attention/test_rope.py -k "test_generalized_rope_quantize_append_kv_cache or test_rope_quantize_fp8_append_paged_kv_cache_decode" passed 6528 passed, 23722 deselected in 359.06s (0:05:59)

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features
    • Fused paged KV-cache append with RoPE supporting GQA/MHA/MLA, configurable layouts, page sizes, FP8/FP16/BF16 quantization, interleave and PDL options, returning updated Q outputs.
  • API
    • Exposed a high-level public primitive to invoke the paged KV append path from Python and re-exported at package top-level.
  • Tests
    • Added comprehensive parameterized tests validating outputs and paged-cache integrity across layouts, dtypes, page sizes and quantized paths.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 14, 2026

📝 Walkthrough

Walkthrough

Adds a fused operation rope_append_paged_kv_cache across CUDA headers/kernels, host dispatch, FFI bindings, Python APIs, and tests to apply RoPE to Q/K and append K/V into a paged KV cache (supports FP16/BF16/FP8, layouts, MLA/GQA/MHA variants).

Changes

Cohort / File(s) Summary
CUDA headers / kernel declarations
include/flashinfer/pos_enc.cuh
Added RopeAppendPagedKVCacheParams, templated RopeAppendPagedKVCacheKernel, and host RopeAppendPagedKVCache entry to support RoPE + paged K/V append with interleave/vectorization and kv_scale.
CUDA implementation / dispatch
csrc/rope.cu
Added FFI-callable rope_append_paged_kv_cache: tensor validation, runtime param extraction, dtype/layout/architecture dispatch (FP16/BF16/FP8, MLA vs GQA/MHA), and invocation of the templated host/kernel API.
FFI binding
csrc/flashinfer_rope_binding.cu
Registered and exported rope_append_paged_kv_cache for FFI (TVM_FFI_DLL_EXPORT_TYPED_FUNC) and added it to the export list.
Python API & package export
flashinfer/rope.py, flashinfer/__init__.py
Added public API rope_append_paged_kv_cache, internal _rope_append_paged_kv_cache and fake-op stub, input normalization/validation, layout-code resolution, and package-level re-export.
Tests
tests/attention/test_rope.py
Added parameterized test test_rope_append_paged_kv_cache_decode (duplicate block present) validating outputs and paged-KV updates across layouts, dtypes, FP8 paths, page sizes, and MLA/GQA variants.

Sequence Diagram(s)

sequenceDiagram
    participant Py as Python API
    participant FFI as FFI Binding
    participant Host as CUDA Host/Dispatch
    participant Kernel as CUDA Kernel
    participant KV as Paged KV Cache (device)

    Py->>FFI: call rope_append_paged_kv_cache(args)
    FFI->>Host: forward TensorViews & params
    Host->>Host: validate shapes/dtypes, resolve layout & templates
    Host->>Kernel: launch RopeAppendPagedKVCacheKernel(stream, params...)
    Kernel->>KV: read existing pages, write appended K/V pages
    Kernel-->>Host: kernel complete
    Host-->>FFI: return q_rope_out, q_nope_out
    FFI-->>Py: return outputs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • yzh119
  • kahyunnam
  • nvmbreughe
  • cyx-6
  • jiahanc
  • bkryu
  • IwakuraRein
  • jimmyzho

Poem

🐰 I twirl the ropes and hop with glee,
Pages fill with K and V for me.
GPUs hum, pages bloom anew,
Append and rotate — code hops through.
Hop, hop, cache, and revel too.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly summarizes the main change: adding a fused RoPE + paged KV cache append operation for the non-quantized path.
Description check ✅ Passed PR description includes motivation, validation details, related issues, and confirms pre-commit checks are complete. However, the main description section is minimal and test pass/fail status is incomplete.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a crucial optimization for large language model serving by adding a fused CUDA operation for applying Rotary Positional Embeddings (RoPE) and appending to a paged Key-Value (KV) cache. This new functionality streamlines the decode integration path for non-quantized inputs, enhancing performance and completing the set of necessary fused operations for efficient inference, while retaining compatibility with FP8 KV caches.

Highlights

  • New Fused Operation: Introduced rope_append_paged_kv_cache, a new fused operation that combines Rotary Positional Embeddings (RoPE) application with paged Key-Value (KV) cache appending.
  • Non-Quantized Path Support: This operation specifically targets the non-quantized path, addressing a missing fused rope + cache functionality required for serving decode integration.
  • FP8 KV Cache Compatibility: The new fused op can still write to FP8 KV cache by utilizing a kv_scale parameter, maintaining flexibility in cache data types.
  • vLLM Integration: The feature fills a critical gap for vLLM integration, ensuring a complete set of fused operations for efficient model serving.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/flashinfer_rope_binding.cu
    • Declared and exported the new rope_append_paged_kv_cache function for TVM FFI binding.
  • csrc/rope.cu
    • Implemented the rope_append_paged_kv_cache host function, including input validation and dispatch logic.
    • Added the RopeAppendPagedKVCacheKernel CUDA kernel for the fused RoPE and paged KV cache append logic.
  • flashinfer/init.py
    • Imported the new rope_append_paged_kv_cache function to make it accessible in the Python API.
  • flashinfer/rope.py
    • Added Python bindings (_rope_append_paged_kv_cache) for the new custom CUDA operation.
    • Implemented a user-facing Python function rope_append_paged_kv_cache with argument handling and type checks.
    • Included a fake operation for _rope_append_paged_kv_cache to support testing and development workflows.
  • include/flashinfer/pos_enc.cuh
    • Defined the RopeAppendPagedKVCacheParams struct to encapsulate parameters for the new fused operation.
    • Declared the RopeAppendPagedKVCacheKernel and RopeAppendPagedKVCache template functions.
Activity
  • The pull request was opened by baonudesifeizhai with the intent to add a fused RoPE + paged KV cache append op for non-quantized paths.
  • The author validated the changes through vLLM integration and confirmed end-to-end serving correctness on decode requests.
  • Tests related to test_generalized_rope_quantize_append_kv_cache and test_rope_quantize_fp8_append_paged_kv_cache_decode were run and passed.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new fused RoPE and paged KV cache append operation for the non-quantized path, which is a valuable addition for serving decode integration. The implementation follows the existing patterns in the codebase. However, I've identified critical out-of-bounds memory access issues in the new CUDA kernel RopeAppendPagedKVCacheKernel. The processing for k_nope, v, and q_nope does not correctly handle partial vector reads/writes at chunk boundaries, which can lead to memory corruption or crashes. Please address these issues to ensure kernel correctness and stability.

Comment on lines +1183 to +1194
vec_t<float, vec_size> k_nope_vec;
k_nope_vec.cast_load(k_nope_in_ptr + tx * vec_size);
if constexpr (kNeedsScale) {
#pragma unroll
for (uint32_t i = 0; i < vec_size; ++i) {
k_nope_vec[i] = k_nope_vec[i] * kv_scale;
}
}

CacheType* k_ptr = paged_kv.get_k_ptr(page_iter, k_head_idx, entry_idx,
rope_dim + elem_offset + tx * vec_size);
k_nope_vec.cast_store(k_ptr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the q_nope section, the vectorized load at line 1184 and subsequent store at line 1194 can cause out-of-bounds memory access on partial chunks when no_rope_dim is not a multiple of rope_dim. Please add boundary checks to prevent this.

uint32_t v_elem_offset = j * rope_chunk_size;
if (v_elem_offset + tx * vec_size < head_dim_total) {
vec_t<float, vec_size> v_vec;
v_vec.cast_load(v_in_ptr + v_elem_offset + tx * vec_size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The vectorized load for v_in can read out of bounds. The check at line 1205 (if (v_elem_offset + tx * vec_size < head_dim_total)) is not sufficient to prevent a vectorized read from crossing the boundary of head_dim_total. This needs to be a masked load for threads processing the boundary to avoid memory corruption.

Comment on lines +1233 to +1235
vec_t<float, vec_size> q_nope_vec;
q_nope_vec.cast_load(q_nope_in_ptr + tx * vec_size);
q_nope_vec.cast_store(q_nope_out_ptr + tx * vec_size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The vectorized load and store for q_nope can lead to out-of-bounds memory access if no_rope_dim is not a multiple of rope_chunk_size (rope_dim). This happens when processing the last, partial chunk, as the vectorized operations do not account for boundary conditions. This can cause memory corruption.

You should add boundary checks for both the load and store operations. A similar pattern can be found in RopeQuantizeKernel which uses scale_store_partial_chunk for this purpose. The k_nope and v processing sections in this kernel have a similar issue.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/rope.py`:
- Around line 1809-1825: The function rope_append_paged_kv_cache currently only
checks 4D and dtype but must validate that the cache layout and all metadata
lengths match expected dimensions before calling the kernel: verify kv_layout is
a valid TensorLayout key (use TensorLayout[kv_layout]), ensure paged_kv_cache
shapes match the expected axes for the chosen kv_layout/page_size/num_kv_heads
(e.g. confirm num_kv_heads matches the corresponding axis in k_cache and v_cache
and that page_size is consistent with the paging axis), and check that
batch_indices, positions, kv_indices, kv_indptr and kv_last_page_len have
lengths and value ranges consistent with batch size, sequence length and number
of pages (no index exceeds axis sizes); if any check fails raise a descriptive
ValueError. Use the existing symbols paged_kv_cache, k_cache, v_cache,
kv_layout, page_size, num_kv_heads, batch_indices, positions, kv_indices,
kv_indptr and kv_last_page_len to locate and implement these validations in
rope_append_paged_kv_cache before dispatching to the kernel.

In `@include/flashinfer/pos_enc.cuh`:
- Around line 1175-1194: The no-RoPE branch loads/stores full vec_size
unconditionally (k_nope_vec.cast_load / cast_store) which overruns when rope_dim
< bdx; guard the load/store by computing the remaining elements for this chunk
(using rope_dim, elem_offset, tx and vec_size) and if the chunk is partial call
the existing scale_store_partial_chunk(...) helper (as RopeQuantizeKernel does)
to safely load/scale/store only the valid lanes, otherwise perform the full
cast_load/cast_store path; update the k_nope_in pointer use (k_nope_in +
get_elem_offset_impl(...)) and the destination from paged_kv.get_k_ptr(...)
accordingly so partial writes never exceed the slice.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2a5d426d-063d-4a6a-b144-2e5ff38bc61c

📥 Commits

Reviewing files that changed from the base of the PR and between 46fd825 and 9ca74cf.

📒 Files selected for processing (5)
  • csrc/flashinfer_rope_binding.cu
  • csrc/rope.cu
  • flashinfer/__init__.py
  • flashinfer/rope.py
  • include/flashinfer/pos_enc.cuh

Comment on lines +1809 to +1825
if len(paged_kv_cache) != 2:
raise ValueError("paged_kv_cache must be a tuple of (k_cache, v_cache)")
k_cache, v_cache = paged_kv_cache
if k_cache.ndim != 4 or v_cache.ndim != 4:
raise ValueError("rope_append_paged_kv_cache expects 4D GQA/MHA cache tensors")
if k_cache.dtype != v_cache.dtype:
raise ValueError("k_cache and v_cache must have the same dtype")

from .utils import TensorLayout

kv_layout_code = TensorLayout[kv_layout].value
batch_indices = batch_indices.int()
positions = positions.int()
kv_indices = kv_indices.int()
kv_indptr = kv_indptr.int()
kv_last_page_len = kv_last_page_len.int()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate cache layout and metadata sizes before dispatch.

This path only checks that the caches are 4D and same dtype. A wrong kv_layout/page_size/num_kv_heads combination, or short pos_ids/batch_indices/positions, still reaches the kernel and will misaddress page writes. Fail fast here with explicit axis and length checks.

🛡️ Suggested validation
     if len(paged_kv_cache) != 2:
         raise ValueError("paged_kv_cache must be a tuple of (k_cache, v_cache)")
     k_cache, v_cache = paged_kv_cache
     if k_cache.ndim != 4 or v_cache.ndim != 4:
         raise ValueError("rope_append_paged_kv_cache expects 4D GQA/MHA cache tensors")
     if k_cache.dtype != v_cache.dtype:
         raise ValueError("k_cache and v_cache must have the same dtype")
+    head_dim = q_rope.shape[-1] + q_nope.shape[-1]
+    if kv_layout == "NHD":
+        expected_tail = (page_size, num_kv_heads, head_dim)
+    elif kv_layout == "HND":
+        expected_tail = (num_kv_heads, page_size, head_dim)
+    else:
+        raise ValueError(f"unsupported kv_layout: {kv_layout}")
+    if k_cache.shape[0] != v_cache.shape[0]:
+        raise ValueError("k_cache and v_cache must have the same number of pages")
+    if tuple(k_cache.shape[1:]) != expected_tail or tuple(v_cache.shape[1:]) != expected_tail:
+        raise ValueError(
+            f"cache shape/layout mismatch: expected (*, {expected_tail[0]}, "
+            f"{expected_tail[1]}, {expected_tail[2]}) for kv_layout={kv_layout}"
+        )
+    if pos_ids.numel() != nnz or batch_indices.numel() != nnz or positions.numel() != nnz:
+        raise ValueError("pos_ids, batch_indices, and positions must all have length nnz")
+    if kv_indptr.numel() != kv_last_page_len.numel() + 1:
+        raise ValueError("kv_indptr must have length batch_size + 1")
 
     from .utils import TensorLayout
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/rope.py` around lines 1809 - 1825, The function
rope_append_paged_kv_cache currently only checks 4D and dtype but must validate
that the cache layout and all metadata lengths match expected dimensions before
calling the kernel: verify kv_layout is a valid TensorLayout key (use
TensorLayout[kv_layout]), ensure paged_kv_cache shapes match the expected axes
for the chosen kv_layout/page_size/num_kv_heads (e.g. confirm num_kv_heads
matches the corresponding axis in k_cache and v_cache and that page_size is
consistent with the paging axis), and check that batch_indices, positions,
kv_indices, kv_indptr and kv_last_page_len have lengths and value ranges
consistent with batch size, sequence length and number of pages (no index
exceeds axis sizes); if any check fails raise a descriptive ValueError. Use the
existing symbols paged_kv_cache, k_cache, v_cache, kv_layout, page_size,
num_kv_heads, batch_indices, positions, kv_indices, kv_indptr and
kv_last_page_len to locate and implement these validations in
rope_append_paged_kv_cache before dispatching to the kernel.

Comment on lines +1175 to +1194
} else if (by < k_nope_end) {
uint32_t k_head_idx = (by - k_rope_end) / no_rope_chunks;
uint32_t nope_chunk_idx = (by - k_rope_end) % no_rope_chunks;
uint32_t elem_offset = nope_chunk_idx * rope_chunk_size;

DType* k_nope_in_ptr = k_nope_in + get_elem_offset_impl(idx, k_head_idx, elem_offset,
k_nope_in_stride, k_nope_in_stride_h);

vec_t<float, vec_size> k_nope_vec;
k_nope_vec.cast_load(k_nope_in_ptr + tx * vec_size);
if constexpr (kNeedsScale) {
#pragma unroll
for (uint32_t i = 0; i < vec_size; ++i) {
k_nope_vec[i] = k_nope_vec[i] * kv_scale;
}
}

CacheType* k_ptr = paged_kv.get_k_ptr(page_iter, k_head_idx, entry_idx,
rope_dim + elem_offset + tx * vec_size);
k_nope_vec.cast_store(k_ptr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard partial no-RoPE chunks before vector load/store.

bdx is sized from rope_dim, but these K/Q no-RoPE branches always load/store a full vec_size. With common splits like rope_dim=128, no_rope_dim=64, the extra threads in the no-RoPE section read and write past the end of the slice. Reuse scale_store_partial_chunk(...) here the same way RopeQuantizeKernel already handles no-RoPE tails.

🛠️ Tail-safe fix
-      vec_t<float, vec_size> k_nope_vec;
-      k_nope_vec.cast_load(k_nope_in_ptr + tx * vec_size);
-      if constexpr (kNeedsScale) {
-#pragma unroll
-        for (uint32_t i = 0; i < vec_size; ++i) {
-          k_nope_vec[i] = k_nope_vec[i] * kv_scale;
-        }
-      }
-
-      CacheType* k_ptr = paged_kv.get_k_ptr(page_iter, k_head_idx, entry_idx,
-                                            rope_dim + elem_offset + tx * vec_size);
-      k_nope_vec.cast_store(k_ptr);
+      const uint32_t chunk_valid =
+          (elem_offset < no_rope_dim) ? min(rope_chunk_size, no_rope_dim - elem_offset) : 0u;
+      const uint32_t lane_elem_offset = tx * vec_size;
+      CacheType* k_ptr =
+          paged_kv.get_k_ptr(page_iter, k_head_idx, entry_idx, rope_dim + elem_offset);
+      scale_store_partial_chunk<DType, CacheType, vec_size>(
+          k_nope_in_ptr, k_ptr, lane_elem_offset, chunk_valid, kNeedsScale ? kv_scale : 1.f);
@@
-      vec_t<float, vec_size> q_nope_vec;
-      q_nope_vec.cast_load(q_nope_in_ptr + tx * vec_size);
-      q_nope_vec.cast_store(q_nope_out_ptr + tx * vec_size);
+      const uint32_t chunk_valid =
+          (elem_offset < no_rope_dim) ? min(rope_chunk_size, no_rope_dim - elem_offset) : 0u;
+      const uint32_t lane_elem_offset = tx * vec_size;
+      scale_store_partial_chunk<DType, DType, vec_size>(
+          q_nope_in_ptr, q_nope_out_ptr, lane_elem_offset, chunk_valid, 1.f);

Also applies to: 1220-1235

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@include/flashinfer/pos_enc.cuh` around lines 1175 - 1194, The no-RoPE branch
loads/stores full vec_size unconditionally (k_nope_vec.cast_load / cast_store)
which overruns when rope_dim < bdx; guard the load/store by computing the
remaining elements for this chunk (using rope_dim, elem_offset, tx and vec_size)
and if the chunk is partial call the existing scale_store_partial_chunk(...)
helper (as RopeQuantizeKernel does) to safely load/scale/store only the valid
lanes, otherwise perform the full cast_load/cast_store path; update the
k_nope_in pointer use (k_nope_in + get_elem_offset_impl(...)) and the
destination from paged_kv.get_k_ptr(...) accordingly so partial writes never
exceed the slice.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/attention/test_rope.py (2)

1546-1565: Exercise a non-unit kv_scale in the FP8 branch.

Both append calls hardcode kv_scale=1.0, so use_fp8_cache=True only validates plain cast-to-FP8. That misses the cache-side scaling behavior this API is adding.

💡 Minimal coverage change
     kv_cache_dtype = torch.float8_e4m3fn if use_fp8_cache else input_dtype
+    kv_scale = 0.5 if use_fp8_cache else 1.0
 ...
     flashinfer.rope.rope_append_paged_kv_cache(
         q_rope_existing,
         k_rope_existing,
@@
-        kv_scale=1.0,
+        kv_scale=kv_scale,
         is_neox=False,
         enable_pdl=enable_pdl,
     )
@@
     q_rope_out_new, q_nope_out_new = flashinfer.rope.rope_append_paged_kv_cache(
         q_rope_new,
         k_rope_new,
@@
-        kv_scale=1.0,
+        kv_scale=kv_scale,
         is_neox=False,
         enable_pdl=enable_pdl,
     )
@@
-    k_ref_tokens_new = k_ref_new.to(kv_cache_dtype)
-    v_ref_tokens_new = v_new.to(kv_cache_dtype)
+    k_ref_tokens_new = (
+        (k_ref_new * kv_scale).to(kv_cache_dtype)
+        if use_fp8_cache
+        else k_ref_new.to(kv_cache_dtype)
+    )
+    v_ref_tokens_new = (
+        (v_new * kv_scale).to(kv_cache_dtype)
+        if use_fp8_cache
+        else v_new.to(kv_cache_dtype)
+    )

Also applies to: 1631-1650, 1678-1686

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/attention/test_rope.py` around lines 1546 - 1565, The test calls to
flashinfer.rope.rope_append_paged_kv_cache always pass kv_scale=1.0, so the FP8
branch never exercises non-trivial cache-side scaling; update the test calls
(the instances invoking rope_append_paged_kv_cache with parameters like
q_rope_existing, k_rope_existing, v_existing, rope_ref.cos_sin_cache,
pos_ids_existing, (k_cache, v_cache), kv_page_indices_existing,
kv_page_indptr_existing, kv_last_page_len_existing, batch_indices_existing,
positions_existing, page_size, kv_layout, is_neox, enable_pdl) to include a
non-unit kv_scale (e.g. 0.5 or 2.0) whenever use_fp8_cache / FP8 path is being
tested so the code path that applies cache-side scaling is executed and
validated; apply the same change for the other occurrences of the same call
pattern mentioned in the comment.

1546-1565: Assert the bootstrap append as well.

The first rope_append_paged_kv_cache call seeds the cache, but the test never checks its returned Q tensors or the written K/V values. The later “unchanged” assertions only prove the second append did not overwrite whatever step 1 produced, so an empty-cache append bug would still pass here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/attention/test_rope.py` around lines 1546 - 1565, The first
rope_append_paged_kv_cache call is used to seed the cache but its outputs and
the K/V writes are never asserted; capture its returned Q (and any other
returned tensors) from the initial call to
flashinfer.rope.rope_append_paged_kv_cache (the call that passes
q_rope_existing, k_rope_existing, q_nope_existing, k_nope_existing, v_existing,
rope_ref.cos_sin_cache, pos_ids_existing, (k_cache, v_cache),
kv_page_indices_existing, kv_page_indptr_existing, kv_last_page_len_existing,
batch_indices_existing, positions_existing, page_size, kv_layout, kv_scale=1.0,
is_neox=False, enable_pdl) and add explicit assertions that those returned Q
tensors equal the expected Q (e.g., copies of q_rope_existing transformed as in
the test) and that the underlying k_cache and v_cache (or the persisted pages
referenced by kv_page_indices_existing/kv_page_indptr_existing) contain the
expected K and V values written by the first call, so the subsequent “unchanged”
checks prove the second append didn’t just preserve an empty initial state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/attention/test_rope.py`:
- Around line 1393-1420: The test test_rope_append_paged_kv_cache_decode
currently allocates CUDA FP8 caches when use_fp8_cache is True without skipping
unsupported GPUs; update the top of that test (inside the function) to check GPU
capability using flashinfer.utils.get_compute_capability() together with
flashinfer.utils.is_sm90a_supported() and
flashinfer.utils.is_sm100a_supported(), and call pytest.skip(...) when
use_fp8_cache is True and the compute capability is not SM90A/SM100A (or
otherwise unsupported), so the FP8 allocation is only attempted on supported
architectures; ensure the check runs before any FP8-specific dtype or tensor
allocations.

---

Nitpick comments:
In `@tests/attention/test_rope.py`:
- Around line 1546-1565: The test calls to
flashinfer.rope.rope_append_paged_kv_cache always pass kv_scale=1.0, so the FP8
branch never exercises non-trivial cache-side scaling; update the test calls
(the instances invoking rope_append_paged_kv_cache with parameters like
q_rope_existing, k_rope_existing, v_existing, rope_ref.cos_sin_cache,
pos_ids_existing, (k_cache, v_cache), kv_page_indices_existing,
kv_page_indptr_existing, kv_last_page_len_existing, batch_indices_existing,
positions_existing, page_size, kv_layout, is_neox, enable_pdl) to include a
non-unit kv_scale (e.g. 0.5 or 2.0) whenever use_fp8_cache / FP8 path is being
tested so the code path that applies cache-side scaling is executed and
validated; apply the same change for the other occurrences of the same call
pattern mentioned in the comment.
- Around line 1546-1565: The first rope_append_paged_kv_cache call is used to
seed the cache but its outputs and the K/V writes are never asserted; capture
its returned Q (and any other returned tensors) from the initial call to
flashinfer.rope.rope_append_paged_kv_cache (the call that passes
q_rope_existing, k_rope_existing, q_nope_existing, k_nope_existing, v_existing,
rope_ref.cos_sin_cache, pos_ids_existing, (k_cache, v_cache),
kv_page_indices_existing, kv_page_indptr_existing, kv_last_page_len_existing,
batch_indices_existing, positions_existing, page_size, kv_layout, kv_scale=1.0,
is_neox=False, enable_pdl) and add explicit assertions that those returned Q
tensors equal the expected Q (e.g., copies of q_rope_existing transformed as in
the test) and that the underlying k_cache and v_cache (or the persisted pages
referenced by kv_page_indices_existing/kv_page_indptr_existing) contain the
expected K and V values written by the first call, so the subsequent “unchanged”
checks prove the second append didn’t just preserve an empty initial state.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d903e76-4ce3-4054-8575-bb92e6aa73f3

📥 Commits

Reviewing files that changed from the base of the PR and between 9ca74cf and 28bbd30.

📒 Files selected for processing (1)
  • tests/attention/test_rope.py

@kahyunnam
Copy link
Copy Markdown
Collaborator

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !425 has been created, and the CI pipeline #46402900 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[CANCELED] Pipeline #46402900: canceled

@kahyunnam
Copy link
Copy Markdown
Collaborator

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !425 has been created, and the CI pipeline #46541900 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[SUCCESS] Pipeline #46541900: 14/20 passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants