Added the cudnn backend Ragged KV Cache wrapper by Anerudhan · Pull Request #2352 · flashinfer-ai/flashinfer

Anerudhan · 2026-01-14T07:53:24Z

📌 Description

Added the cudnn backend Ragged KV Cache wrapper
Fixed the test_prefill.py to not use torch.ones (accidentally did it before)

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

[x ] I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
[x ] All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Added explicit cuDNN-native backend support and enhanced batched prefill to handle variable-length sequences and indptr-based offsets for cuDNN paths.
Tests
- Expanded and hardened cuDNN tests: broader parameter sets, randomized initializations, larger workspace, and a wrapper-based test flow for more reliable validation.
Documentation
- Updated README and benchmark backend lists to include cuDNN-native and reflect backend support changes.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-14T07:53:35Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Extends BatchPrefillWithPagedKVCacheWrapper.plan/run to accept sequence length and indptr parameters, buffer indptrs, and add a cuDNN-specific execution path; tests and benchmarks updated to exercise cudnn-native and wrapper-based ragged/paged KV prefill flows.

Changes

Cohort / File(s)	Summary
Prefill Wrapper Extension `flashinfer/prefill.py`	Extend `BatchPrefillWithPagedKVCacheWrapper.plan(...)` with `seq_lens`, `seq_lens_q`, `max_token_per_sequence`, `max_sequence_kv`, `v_indptr`, `o_indptr`; add internal `_v_indptr_buf`/`_o_indptr_buf` and sequence attributes; add cudnn-specific branch in `run()` that reshapes seq tensors and calls `cudnn_batch_prefill_with_kv_cache`.
Ragged Wrapper & Tests `tests/attention/test_cudnn_prefill_deepseek.py`, `flashinfer/...` (new class)	Introduce/align usage of `BatchPrefillWithRaggedKVCacheWrapper` (new public wrapper) in deepseek test: expanded parameterization, skip invalid head combos, increased workspace, refactor from direct cudnn call to `wrapper.plan(...)` and `wrapper.run(...)`; adjust indptr construction and remove direct batch_offsets usage.
Unit Test Init Update `tests/attention/test_cudnn_prefill.py`	Change `q` and `kv_cache` initializations from `torch.ones` to `torch.randn` while preserving shape/dtype and existing reshaping.
Benchmarks & Backend Mapping `benchmarks/README.md`, `benchmarks/routines/attention.py`, `benchmarks/routines/flashinfer_benchmark_utils.py`	Add and document `cudnn-native` backend alongside existing `cudnn` (wrapper API); update backend support matrices and add cudnn-native execution branches (with FP8 usage gated/skipped where unsupported).

Sequence Diagram(s)

sequenceDiagram
    participant Test
    participant Wrapper as BatchPrefillWithPagedKVCacheWrapper
    participant Run as run()
    participant CuDNN as cudnn_batch_prefill_with_kv_cache
    participant Module as other_batch_prefill_backend

    Test->>Wrapper: plan(..., seq_lens, seq_lens_q, v_indptr, o_indptr, max_...)
    Wrapper->>Wrapper: store seq attrs and indptr buffers

    Test->>Run: run(q, k_cache, v_cache)
    alt backend == "cudnn"
        Run->>Run: reshape/expand seq tensors if 1D -> 4D
        Run->>CuDNN: cudnn_batch_prefill_with_kv_cache(seq_lens_q, seq_lens_kv, max_token_per_sequence, max_sequence_kv, qo_indptr, kv_indptr, v_indptr, o_indptr, ...)
        CuDNN-->>Run: outputs (and lse if requested)
    else
        Run->>Module: call existing batch_prefill path
        Module-->>Run: outputs
    end
    Run-->>Test: return outputs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Added an initial implementation of Q and KV Cache in fp8 and to use t… #2035: Modifies cudnn prefill/run flow and related parameter passing, overlapping with sequence/indptr forwarding changes.
Rebase FP8 SM100 Cutlass FMHA Attention to main (original PR#1238) #2047: Touches the same prefill wrapper methods to propagate FP8 scales through plan/run paths.
refactor: update fa3 codebase and fix hopper unittest [part 1] #2111: Changes paged/ragged KV parameter handling and kernel dispatch that intersect with this wrapper’s plan/run updates.

Suggested reviewers

yzh119
cyx-6
nvmbreughe
kahyunnam
jimmyzho

Poem

🐰 Soft paws tap and buffers hop,

seqs and indptrs line up on top,
cuDNN hums, wrappers steer the run,
outputs hop out, one by one—
a tiny rabbit celebrates the fun.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main objective: adding cuDNN backend Ragged KV Cache wrapper support, matching the primary change across multiple affected files.
Description check	✅ Passed	Description covers main changes and mentions pre-commit checks and tests, but lacks detail on motivation, related issues, or comprehensive impact summary beyond brief statements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-14T07:53:44Z

Summary of Changes

Hello @Anerudhan, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new cuDNN backend into the Ragged KV Cache prefill functionality, aiming to leverage NVIDIA's optimized deep neural network primitives for improved efficiency. The core changes involve extending the BatchPrefillWithRaggedKVCacheWrapper to support cuDNN-specific parameters and dispatching to the cuDNN implementation when selected. Additionally, the associated test suite has been updated to validate the new wrapper and includes a minor correction for more robust tensor initialization.

Highlights

cuDNN Backend Integration: Introduced a new cuDNN backend for the Ragged KV Cache prefill operation, enhancing performance and compatibility for attention mechanisms.
API Extension for cuDNN: The plan method of BatchPrefillWithRaggedKVCacheWrapper now accepts several new parameters (seq_lens, seq_lens_q, max_token_per_sequence, max_sequence_kv, v_indptr, o_indptr) specifically required for the cuDNN backend.
Test Refactoring and Fixes: Updated test_cudnn_prefill_deepseek.py to utilize the new BatchPrefillWithRaggedKVCacheWrapper and expanded test coverage. A minor fix in test_cudnn_prefill.py changed tensor initialization from torch.ones to torch.randn for improved test robustness.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds a cudnn backend for the Ragged KV Cache wrapper and updates tests. The changes are a good addition, but I've identified a few issues in the implementation. These include a critical bug that could cause a runtime error due to an undefined variable, a hardcoded value that should be dynamic, and a minor docstring formatting issue. I've also pointed out an API inconsistency that could be confusing for users. My review includes specific suggestions to address these points.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/attention/test_cudnn_prefill_deepseek.py (1)
44-78: Add GPU architecture check for cuDNN backend support.

The indptr calculations are correct—they create element-level offsets (cumsum × head_dim × num_heads) which match cuDNN's batch_offsets_* expectations. However, the test is missing a GPU architecture validation check. According to coding guidelines, test implementations should use flashinfer.utils functions to skip tests on unsupported GPU architectures. Similar to test_cudnn_prefill.py, this test should check compute capability before running, as cuDNN prefill is backend-specific and may not be supported on all GPU architectures.

Add at the beginning of the test function:
from flashinfer.utils import get_compute_capability
And after the seed/device setup:
major, _ = get_compute_capability(torch.device(device))
if major < 10:  # or your minimum supported compute capability
    pytest.skip(f"cuDNN prefill not supported on compute capability {major}")

🤖 Fix all issues with AI agents

In `@flashinfer/prefill.py`:
- Around line 2701-2714: The docstring currently runs "disable_split_kv" and
"seq_lens:" together causing bad rendering; edit the relevant function/class
docstring to insert a blank line (newline) before the parameter block so
"seq_lens:" starts on its own paragraph. Locate the docstring containing
"disable_split_kv" and "seq_lens" and ensure there is an empty line between the
prose and the parameter list (preserving existing indentation and parameter
descriptions like seq_lens_q, max_token_per_sequence, max_sequence_kv, v_indptr,
o_indptr).
- Around line 3087-3111: The cuDNN call assumes required size parameters but may
receive None; before calling cudnn_batch_prefill_with_kv_cache check that
self._max_token_per_sequence and self._max_sequence_kv are not None (and
optionally >0) and raise a clear ValueError (or AssertionError) referencing the
function call context if they are missing; update the caller (the method
invoking cudnn_batch_prefill_with_kv_cache in prefill.py) to perform this
validation early (and include the attribute names _max_token_per_sequence and
_max_sequence_kv in the error message) so the backend is never invoked with
invalid parameters.

🧹 Nitpick comments (2)

tests/attention/test_cudnn_prefill_deepseek.py (2)
1-21: Missing GPU architecture skip check per coding guidelines.

The test should use flashinfer.utils functions to skip tests on unsupported GPU architectures. The cuDNN backend may not be supported on all GPUs.
Proposed fix
 import pytest
 import torch
 
 import flashinfer
+from flashinfer.utils import get_compute_capability
 
 
 `@pytest.mark.parametrize`("batch_size", [1, 4])
 `@pytest.mark.parametrize`("s_qo", [32, 64, 87, 256])
 `@pytest.mark.parametrize`("s_kv", [32, 87, 512])
 `@pytest.mark.parametrize`("num_kv_heads", [1, 4])
 `@pytest.mark.parametrize`("num_qo_heads", [1, 8])
 `@pytest.mark.parametrize`("causal", [True, False])
 def test_cudnn_prefill_deepseek(
     batch_size, s_qo, s_kv, num_kv_heads, num_qo_heads, causal
 ):
     if s_qo > s_kv:
         pytest.skip("s_qo > s_kv, skipping test as causal")
 
     if num_qo_heads < num_kv_heads:
         pytest.skip("num_qo_heads < num_kv_heads, skipping test")
+
+    device = "cuda:0"
+    major, _ = get_compute_capability(torch.device(device))
+    if major < 8:
+        pytest.skip(f"cuDNN backend requires compute capability >= 8.0, got {major}")
80-87: Consider removing commented-out code.

The commented-out batch_offsets_stats block appears to be dead code. If it's no longer needed, consider removing it to keep the test clean.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0277fd and 6fcf574dde224351d8fa45f7e58d82af4bbff1c0.

📒 Files selected for processing (3)

flashinfer/prefill.py
tests/attention/test_cudnn_prefill.py
tests/attention/test_cudnn_prefill_deepseek.py

🧰 Additional context used

📓 Path-based instructions (2)

flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

flashinfer/prefill.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Test implementations should use flashinfer.utils functions (get_compute_capability, is_sm90a_supported, is_sm100a_supported, etc.) to skip tests on unsupported GPU architectures
For testing with mpirun on multi-GPU systems, use the pattern: mpirun -np <num_gpus> pytest tests/path/to/test.py::test_function
Avoid OOM (out-of-memory) errors in tests by using appropriate problem sizes - tests/conftest.py provides auto-skipping for OOM tests as a safety net but should not be relied upon

Files:

tests/attention/test_cudnn_prefill.py
tests/attention/test_cudnn_prefill_deepseek.py

🧬 Code graph analysis (2)

flashinfer/prefill.py (1)

flashinfer/cudnn/prefill.py (1)

cudnn_batch_prefill_with_kv_cache (555-724)

tests/attention/test_cudnn_prefill.py (1)

flashinfer/logits_processor/types.py (1)

size (132-136)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (9)

tests/attention/test_cudnn_prefill_deepseek.py (2)

109-133: LGTM!

The cuDNN wrapper setup and plan call correctly use the new API parameters for ragged KV cache support.

150-173: LGTM!

The reference wrapper comparison is correctly set up with standard token-count indptrs and appropriate tolerances for bfloat16 comparison.

flashinfer/prefill.py (5)

2601-2606: LGTM!

New optional parameters are correctly typed and positioned for cuDNN backend support while maintaining backward compatibility.

2803-2812: LGTM!

The indptr buffer assignments correctly handle optional parameters with sensible fallbacks to existing buffers.

2824-2827: LGTM!

Instance attributes are correctly stored for use in the cuDNN execution path.

2860-2863: LGTM!

The conditional correctly skips batch prefill module initialization for the cuDNN backend.

2870-2896: LGTM!

The plan info generation correctly skips the cuDNN backend.

tests/attention/test_cudnn_prefill.py (2)

47-49: LGTM!

Using torch.randn instead of torch.ones provides better test coverage with more realistic input distributions while maintaining reproducibility through the seed.

63-73: LGTM!

Using torch.randn for KV cache initialization improves test robustness by avoiding the edge case of all-ones values.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

bkryu

Generally looking good, but left some minor comments.

Additionally, I created a branch on my end and added this commit to add microbenchmarking support for

cudnn --> cuDNN via wrapper API
cudnn-native --> cudnn_batch_prefill_with_kv_cache

Do you mind copy-pasting or cherrypicking the changes from the linked commit?

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

bkryu · 2026-01-16T06:04:22Z

/bot run

flashinfer-bot · 2026-01-16T06:05:35Z

GitLab MR !244 has been created, and the CI pipeline #41861590 is currently running. I'll report back once the pipeline job completes.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchmarks/routines/attention.py (1)
1681-1704: Respect --no_cuda_graph for the cudnn-native path.

is_cuda_graph_compatible is hardcoded to True, ignoring the CLI flag and diverging from other backends. Pass the computed flag instead.
🔧 Use the computed flag
-                is_cuda_graph_compatible=True,
+                is_cuda_graph_compatible=is_cuda_graph_compatible,

🤖 Fix all issues with AI agents

In `@benchmarks/README.md`:
- Line 19: The new sub-bullet "Also supports computationally similar
`cudnn_batch_prefill_with_kv_cache` (cudnn-native) and
`trtllm_ragged_attention_deepseek`" has inconsistent indentation causing MD007
failures; edit benchmarks/README.md to match the indentation/level used by the
other "Also supports" entries (make this line align with the other sibling
bullets under that list so it uses the same number of spaces or tab characters
as the other "Also supports" lines).

In `@benchmarks/routines/attention.py`:
- Around line 1399-1408: The decode routine's cudnn-native filter lacks the
CUDNN_AVAILABLE guard and may select "cudnn-native" when cuDNN isn't present;
update the block that inspects backends, q_dtype, kv_dtype and
remove_cudnn_native to first check the CUDNN_AVAILABLE flag (same guard used in
prefill), skipping/removing "cudnn-native" immediately if CUDNN_AVAILABLE is
false before evaluating FP8 dtype constraints (refer to variables/backends list,
q_dtype, kv_dtype, remove_cudnn_native and the "cudnn-native" string).

In `@flashinfer/prefill.py`:
- Around line 2601-2606: In BatchPrefillWithRaggedKVCacheWrapper.plan(),
seq_lens_q is left as None despite the docstring saying it should default to
seq_lens; this causes later calls to self._seq_lens_q.dim() to crash. Fix by
assigning seq_lens_q = seq_lens when seq_lens_q is None (same fallback used in
BatchPrefillWithPagedKVCacheWrapper), and ensure the method stores the resolved
value to self._seq_lens_q before any use; reference the plan() method and
variables seq_lens_q and seq_lens in the BatchPrefillWithRaggedKVCacheWrapper
class.

In `@tests/attention/test_cudnn_prefill_deepseek.py`:
- Line 107: The test hardcodes a 512MB workspace (workspace_buffer) which can
OOM on smaller GPUs; replace the fixed size with a safe cap based on the
device's total memory by querying
torch.cuda.get_device_properties(device).total_memory and computing a
workspace_size_bytes = min(512*1024*1024, int(total_mem * 0.1)) (or another safe
fraction like 0.05), then allocate workspace_buffer =
torch.empty(workspace_size_bytes, dtype=torch.int8, device=device) so the buffer
scales to the GPU and reduces OOM risk.
- Around line 7-20: Before allocating tensors in test_cudnn_prefill_deepseek,
add gates to skip the test if no CUDA device or cuDNN is available and if the
GPU's compute capability or free memory is insufficient for the 512MB workspace;
specifically check torch.cuda.is_available(),
torch.backends.cudnn.is_available(), and torch.cuda.get_device_capability() (or
device major/minor) and optionally
torch.cuda.get_device_properties().total_memory/free memory to skip when the
device lacks required SM capability or memory; place these checks at the top of
test_cudnn_prefill_deepseek (before any use of s_qo, s_kv, num_qo_heads,
num_kv_heads or tensor allocations) so the test is skipped early on unsupported
hardware.

♻️ Duplicate comments (1)

flashinfer/prefill.py (1)

3079-3085: Potential None deref / UnboundLocalError in cuDNN run path.

self._seq_lens_q.dim() is called before a None check and batch_size can be undefined when _seq_lens_q is missing or not 1D. This remains a crash risk.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2cdb5f62c59b5467fdcce1f4dbfb5cca263c587a and f6ca31b.

📒 Files selected for processing (6)

benchmarks/README.md
benchmarks/routines/attention.py
benchmarks/routines/flashinfer_benchmark_utils.py
flashinfer/prefill.py
tests/attention/test_cudnn_prefill.py
tests/attention/test_cudnn_prefill_deepseek.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/attention/test_cudnn_prefill.py

🧰 Additional context used

📓 Path-based instructions (2)

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Test implementations should use flashinfer.utils functions (get_compute_capability, is_sm90a_supported, is_sm100a_supported, etc.) to skip tests on unsupported GPU architectures
For testing with mpirun on multi-GPU systems, use the pattern: mpirun -np <num_gpus> pytest tests/path/to/test.py::test_function
Avoid OOM (out-of-memory) errors in tests by using appropriate problem sizes - tests/conftest.py provides auto-skipping for OOM tests as a safety net but should not be relied upon

Files:

tests/attention/test_cudnn_prefill_deepseek.py

flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

flashinfer/prefill.py

🧠 Learnings (3)

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to tests/**/*.py : Test implementations should use `flashinfer.utils` functions (`get_compute_capability`, `is_sm90a_supported`, `is_sm100a_supported`, etc.) to skip tests on unsupported GPU architectures

Applied to files:

benchmarks/routines/flashinfer_benchmark_utils.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed

Applied to files:

benchmarks/routines/flashinfer_benchmark_utils.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Use `FLASHINFER_CUDA_ARCH_LIST` environment variable to specify target GPU architectures (e.g., '8.0 9.0a') and `FLASHINFER_NVCC_THREADS` to control parallel compilation threads

Applied to files:

benchmarks/README.md

🧬 Code graph analysis (2)

benchmarks/routines/attention.py (1)

flashinfer/prefill.py (7)

BatchPrefillWithRaggedKVCacheWrapper (2337-3201)

run (2019-2031)

run (2034-2046)

run (2049-2288)

run (2935-2945)

run (2948-2958)

run (2961-3170)

flashinfer/prefill.py (1)

flashinfer/cudnn/prefill.py (1)

cudnn_batch_prefill_with_kv_cache (555-724)

🪛 markdownlint-cli2 (0.18.1)

benchmarks/README.md

19-19: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (5)

benchmarks/routines/attention.py (1)

1623-1650: cudnn wrapper planning integration looks solid.

flashinfer/prefill.py (2)

2481-2484: Doc updates for cuDNN parameters are clear.

Also applies to: 2701-2713

2803-2812: Buffering v/o indptrs and skipping module planning for cuDNN looks good.

Also applies to: 2860-2871

benchmarks/routines/flashinfer_benchmark_utils.py (1)

223-231: Backend support matrix update aligns with cudnn-native integration.

benchmarks/README.md (1)

283-285: cudnn-native legend/table updates read well.

Also applies to: 293-294, 319-320

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-16T06:10:28Z

        - Also supports computationally similar `cudnn_batch_prefill_with_kv_cache` and  `trtllm_batch_context_with_kv_cache`.
    - `BatchPrefillWithRaggedKVCacheWrapper` - Prefill attention with ragged KV cache.
-        - Also supports computationally similar `cudnn_batch_prefill_with_kv_cache` and  `trtllm_ragged_attention_deepseek`.
+        - Also supports computationally similar `cudnn_batch_prefill_with_kv_cache` (cudnn-native) and  `trtllm_ragged_attention_deepseek`.


⚠️ Potential issue | 🟡 Minor

Fix unordered list indentation to satisfy MD007.

Markdownlint flags inconsistent indentation on this new sub-bullet; align it with the other “Also supports” entries.

🧹 Markdownlint fix

- - Also supports computationally similar `cudnn_batch_prefill_with_kv_cache` (cudnn-native) and `trtllm_ragged_attention_deepseek`. + - Also supports computationally similar `cudnn_batch_prefill_with_kv_cache` (cudnn-native) and `trtllm_ragged_attention_deepseek`.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- Also supports computationally similar `cudnn_batch_prefill_with_kv_cache` (cudnn-native) and `trtllm_ragged_attention_deepseek`.

- Also supports computationally similar `cudnn_batch_prefill_with_kv_cache` (cudnn-native) and `trtllm_ragged_attention_deepseek`.

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

19-19: Unordered list indentation
Expected: 4; Actual: 8

(MD007, ul-indent)

🤖 Prompt for AI Agents

In `@benchmarks/README.md` at line 19, The new sub-bullet "Also supports computationally similar `cudnn_batch_prefill_with_kv_cache` (cudnn-native) and `trtllm_ragged_attention_deepseek`" has inconsistent indentation causing MD007 failures; edit benchmarks/README.md to match the indentation/level used by the other "Also supports" entries (make this line align with the other sibling bullets under that list so it uses the same number of spaces or tab characters as the other "Also supports" lines).

coderabbitai · 2026-01-16T06:10:28Z

+    if "cudnn-native" in backends:
+        remove_cudnn_native = False
+        if q_dtype in [torch.float8_e4m3fn, torch.float8_e5m2] or kv_dtype in [
+            torch.float8_e4m3fn,
+            torch.float8_e5m2,
+        ]:
+            print("[INFO] CUDNN-native backend does not support FP8. Skipping.")
+            remove_cudnn_native = True
+        if remove_cudnn_native:
+            backends.remove("cudnn-native")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file structure and imports head -50 benchmarks/routines/attention.py

Repository: flashinfer-ai/flashinfer

Length of output: 1213

🏁 Script executed:

# Search for CUDNN_AVAILABLE or similar constants in the file rg "CUDNN_AVAILABLE|cudnn_available|is_cuda" benchmarks/routines/attention.py

Repository: flashinfer-ai/flashinfer

Length of output: 1207

🏁 Script executed:

# Search for CUDNN_AVAILABLE across the entire codebase rg "CUDNN_AVAILABLE" --type py

Repository: flashinfer-ai/flashinfer

Length of output: 1417

🏁 Script executed:

# Check for torch.backends.cudnn availability checks rg "torch\.backends\.cudnn|CUDNN|cudnn" benchmarks/routines/attention.py | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 1323

🏁 Script executed:

# Let's look at the complete context around lines 1399-1408 sed -n '1380,1420p' benchmarks/routines/attention.py

Repository: flashinfer-ai/flashinfer

Length of output: 1633

🏁 Script executed:

# Get exact line numbers where CUDNN_AVAILABLE check appears in relation to cudnn-native rg -n "cudnn-native|CUDNN_AVAILABLE" benchmarks/routines/attention.py | grep -A 10 -B 5 "cudnn-native"

Repository: flashinfer-ai/flashinfer

Length of output: 829

🏁 Script executed:

# Get the complete cudnn and cudnn-native backend sections with line numbers sed -n '1370,1450p' benchmarks/routines/attention.py | cat -n | sed 's/^[[:space:]]*//' | sed 's/^/1370+/'

Repository: flashinfer-ai/flashinfer

Length of output: 3843

🏁 Script executed:

# Check context around the first cudnn-native block (lines 784-798) sed -n '770,810p' benchmarks/routines/attention.py

Repository: flashinfer-ai/flashinfer

Length of output: 1801

🏁 Script executed:

# Check context around the second cudnn-native block (lines 1399-1408) sed -n '1390,1415p' benchmarks/routines/attention.py

Repository: flashinfer-ai/flashinfer

Length of output: 1030

Add cuDNN availability guard to cudnn-native backend filter.

The decode routine (lines 1399-1408) checks for FP8 dtype compatibility but lacks the CUDNN_AVAILABLE guard that exists in the prefill routine. If cuDNN is not available, the backend will fail at runtime. Add the availability check before evaluating dtype constraints.

Suggested fix

if "cudnn-native" in backends: remove_cudnn_native = False + if not CUDNN_AVAILABLE or CUDNN_BACKEND_VERSION < 91701: + print("[INFO] cuDNN not available. Skipping cudnn-native backend.") + remove_cudnn_native = True - if q_dtype in [torch.float8_e4m3fn, torch.float8_e5m2] or kv_dtype in [ + elif q_dtype in [torch.float8_e4m3fn, torch.float8_e5m2] or kv_dtype in [ torch.float8_e4m3fn, torch.float8_e5m2, ]: print("[INFO] CUDNN-native backend does not support FP8. Skipping.") remove_cudnn_native = True

🤖 Prompt for AI Agents

In `@benchmarks/routines/attention.py` around lines 1399 - 1408, The decode routine's cudnn-native filter lacks the CUDNN_AVAILABLE guard and may select "cudnn-native" when cuDNN isn't present; update the block that inspects backends, q_dtype, kv_dtype and remove_cudnn_native to first check the CUDNN_AVAILABLE flag (same guard used in prefill), skipping/removing "cudnn-native" immediately if CUDNN_AVAILABLE is false before evaluating FP8 dtype constraints (refer to variables/backends list, q_dtype, kv_dtype, remove_cudnn_native and the "cudnn-native" string).

coderabbitai · 2026-01-16T06:10:28Z

+        seq_lens: Optional[torch.Tensor] = None,
+        seq_lens_q: Optional[torch.Tensor] = None,
+        max_token_per_sequence: Optional[int] = None,
+        max_sequence_kv: Optional[int] = None,
+        v_indptr: Optional[torch.Tensor] = None,
+        o_indptr: Optional[torch.Tensor] = None,


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

git ls-files flashinfer/prefill.py

Repository: flashinfer-ai/flashinfer

Length of output: 89

🏁 Script executed:

# Read lines around 2601-2606 and surrounding context sed -n '2590,2620p' flashinfer/prefill.py

Repository: flashinfer-ai/flashinfer

Length of output: 1480

🏁 Script executed:

# Read lines around 2824-2828 and surrounding context sed -n '2810,2840p' flashinfer/prefill.py

Repository: flashinfer-ai/flashinfer

Length of output: 1268

🏁 Script executed:

# Search for where seq_lens_q is first used in function definitions with docstrings rg -A 20 "seq_lens_q" flashinfer/prefill.py | head -100

Repository: flashinfer-ai/flashinfer

Length of output: 3474

🏁 Script executed:

# Let's look at the function containing lines 2601-2606 sed -n '2570,2640p' flashinfer/prefill.py | cat -n

Repository: flashinfer-ai/flashinfer

Length of output: 3802

🏁 Script executed:

#!/bin/bash # Find all occurrences of seq_lens_q assignment and read context rg -B 5 -A 5 "self\._seq_lens_q\s*=" flashinfer/prefill.py

Repository: flashinfer-ai/flashinfer

Length of output: 2286

🏁 Script executed:

# Search for seq_lens_q in docstrings or comments rg "seq_lens_q" flashinfer/prefill.py -B 2 -A 2 | head -150

Repository: flashinfer-ai/flashinfer

Length of output: 5119

🏁 Script executed:

#!/bin/bash # Get line count first total_lines=$(wc -l < flashinfer/prefill.py) echo "Total lines: $total_lines" # Read the section around 2601-2606 echo "=== Lines 2590-2620 ===" sed -n '2590,2620p' flashinfer/prefill.py echo "" echo "=== Lines 2810-2840 ===" sed -n '2810,2840p' flashinfer/prefill.py

Repository: flashinfer-ai/flashinfer

Length of output: 2748

🏁 Script executed:

#!/bin/bash # Find the method definitions around the problem areas echo "=== Around line 2600 (first occurrence) ===" sed -n '2565,2580p' flashinfer/prefill.py echo "" echo "=== Around line 2800 (second occurrence) ===" sed -n '2790,2810p' flashinfer/prefill.py

Repository: flashinfer-ai/flashinfer

Length of output: 1607

🏁 Script executed:

#!/bin/bash # Find the seq_lens_q documentation echo "=== Docstring for seq_lens_q around line 2671 ===" sed -n '2665,2680p' flashinfer/prefill.py

Repository: flashinfer-ai/flashinfer

Length of output: 1194

🏁 Script executed:

#!/bin/bash # Search for class definitions in the file rg "^class " flashinfer/prefill.py

Repository: flashinfer-ai/flashinfer

Length of output: 154

🏁 Script executed:

#!/bin/bash # Count occurrences of the problematic pattern echo "=== Pattern 1: self._seq_lens_q = seq_lens_q (without fallback) ===" rg -n "self\._seq_lens_q = seq_lens_q$" flashinfer/prefill.py echo "" echo "=== Pattern 2: self._seq_lens_q = seq_lens_q if (with fallback) ===" rg -n "self\._seq_lens_q = seq_lens_q if" flashinfer/prefill.py

Repository: flashinfer-ai/flashinfer

Length of output: 330

🏁 Script executed:

#!/bin/bash # Find the class definitions and their line numbers echo "=== Class definitions with line numbers ===" rg -n "^class " flashinfer/prefill.py echo "" echo "=== Location of line 1988 assignment ===" sed -n '1980,1995p' flashinfer/prefill.py | cat -n echo "" echo "=== Location of line 2824 assignment ===" sed -n '2820,2830p' flashinfer/prefill.py | cat -n

Repository: flashinfer-ai/flashinfer

Length of output: 1522

🏁 Script executed:

#!/bin/bash # Search for seq_lens_q in docstrings specifically mentioning the default rg -B 2 -A 2 "seq_lens_q.*will be set to" flashinfer/prefill.py

Repository: flashinfer-ai/flashinfer

Length of output: 50

🏁 Script executed:

#!/bin/bash # Read the full docstring section describing seq_lens_q sed -n '2620,2720p' flashinfer/prefill.py

Repository: flashinfer-ai/flashinfer

Length of output: 7077

🏁 Script executed:

#!/bin/bash # Find which method line 1988 belongs to echo "=== Method containing line 1988 ===" sed -n '1900,1990p' flashinfer/prefill.py | grep -E "(def |class )" | tail -1 echo "" echo "=== Method containing line 2824 ===" sed -n '2550,2824p' flashinfer/prefill.py | grep -E "(def |class )" | tail -1 echo "" echo "=== Verify line 2824 is in plan method ===" sed -n '2570,2580p' flashinfer/prefill.py

Repository: flashinfer-ai/flashinfer

Length of output: 446

Set seq_lens_q fallback in BatchPrefillWithRaggedKVCacheWrapper.plan() to match documented API contract.

The docstring states that seq_lens_q defaults to seq_lens when not provided, but the assignment at line 2824 leaves it as None. This causes crashes on the cuDNN backend when the code later calls self._seq_lens_q.dim(). The BatchPrefillWithPagedKVCacheWrapper class already implements the correct fallback pattern; apply the same fix here.

Suggested fix

- self._seq_lens_q = seq_lens_q self._seq_lens_kv = seq_lens + self._seq_lens_q = seq_lens_q if seq_lens_q is not None else seq_lens

🤖 Prompt for AI Agents

In `@flashinfer/prefill.py` around lines 2601 - 2606, In BatchPrefillWithRaggedKVCacheWrapper.plan(), seq_lens_q is left as None despite the docstring saying it should default to seq_lens; this causes later calls to self._seq_lens_q.dim() to crash. Fix by assigning seq_lens_q = seq_lens when seq_lens_q is None (same fallback used in BatchPrefillWithPagedKVCacheWrapper), and ensure the method stores the resolved value to self._seq_lens_q before any use; reference the plan() method and variables seq_lens_q and seq_lens in the BatchPrefillWithRaggedKVCacheWrapper class.

coderabbitai · 2026-01-16T06:10:28Z

 @pytest.mark.parametrize("batch_size", [1, 4])
-@pytest.mark.parametrize("s_qo", [32, 64, 87])
-@pytest.mark.parametrize("s_kv", [32, 64, 87])
-@pytest.mark.parametrize("num_kv_heads", [1])
-@pytest.mark.parametrize("num_qo_heads", [1, 16])
+@pytest.mark.parametrize("s_qo", [32, 64, 87, 256])
+@pytest.mark.parametrize("s_kv", [32, 87, 512])
+@pytest.mark.parametrize("num_kv_heads", [1, 4])
+@pytest.mark.parametrize("num_qo_heads", [1, 8])
 @pytest.mark.parametrize("causal", [True, False])
 def test_cudnn_prefill_deepseek(
    batch_size, s_qo, s_kv, num_kv_heads, num_qo_heads, causal
 ):
    if s_qo > s_kv:
        pytest.skip("s_qo > s_kv, skipping test as causal")

+    if num_qo_heads < num_kv_heads:
+        pytest.skip("num_qo_heads < num_kv_heads, skipping test")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find the test file and read it cat -n tests/attention/test_cudnn_prefill_deepseek.py

Repository: flashinfer-ai/flashinfer

Length of output: 5877

🏁 Script executed:

# Check what's available in flashinfer.utils find . -name "utils.py" -o -name "utils" -type d | grep flashinfer | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 301

🏁 Script executed:

# Look for get_compute_capability and related functions in flashinfer rg "get_compute_capability|is_sm90a_supported|is_sm100a_supported" --type py

Repository: flashinfer-ai/flashinfer

Length of output: 17993

🏁 Script executed:

cat -n tests/attention/test_cudnn_prefill.py | head -50

Repository: flashinfer-ai/flashinfer

Length of output: 1740

🏁 Script executed:

# Check flashinfer.utils for cuDNN-specific checks rg "cudnn|get_compute_capability" flashinfer/utils.py -A 3 -B 1

Repository: flashinfer-ai/flashinfer

Length of output: 2283

🏁 Script executed:

# Check if there's documentation or comments about cuDNN GPU requirements rg "cuDNN|cudnn.*sm|cudnn.*arch|cudnn.*compute" --type py -i -B 2 -A 2 | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 4924

🏁 Script executed:

# Check conftest to see if there are global skip mechanisms for cuDNN cat -n tests/conftest.py 2>/dev/null | head -100

Repository: flashinfer-ai/flashinfer

Length of output: 4344

🏁 Script executed:

# Continue reading conftest tail -n +100 tests/conftest.py | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 3055

🏁 Script executed:

# Look for OOM skip mechanisms mentioned in the coding guidelines rg "OOM|out.of.memory|OutOfMemory" --type py tests/

Repository: flashinfer-ai/flashinfer

Length of output: 451

🏁 Script executed:

# Check if there are any SM requirements documented in flashinfer for cuDNN prefill rg "cudnn.*prefill|prefill.*cudnn" --type py -B 3 -A 3

Repository: flashinfer-ai/flashinfer

Length of output: 6803

🏁 Script executed:

# Check what compute capabilities are actually needed by looking at BatchPrefillWithRaggedKVCacheWrapper rg "BatchPrefillWithRaggedKVCacheWrapper" --type py -l

Repository: flashinfer-ai/flashinfer

Length of output: 836

🏁 Script executed:

# Check the actual prefill.py to see if there are SM requirements grep -n "cudnn\|backend" flashinfer/prefill.py | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 1608

🏁 Script executed:

# Read the full test_cudnn_prefill.py to see what GPU checks it has cat -n tests/attention/test_cudnn_prefill.py | grep -A 30 "def test_cudnn_prefill"

Repository: flashinfer-ai/flashinfer

Length of output: 1982

🏁 Script executed:

# Check flashinfer/cudnn/__init__.py to understand cuDNN availability cat -n flashinfer/cudnn/__init__.py

Repository: flashinfer-ai/flashinfer

Length of output: 189

🏁 Script executed:

# Check if there's a CUDNN_AVAILABLE flag or similar rg "CUDNN_AVAILABLE" --type py

Repository: flashinfer-ai/flashinfer

Length of output: 1417

Add GPU capability and cuDNN availability checks to gate test execution.

This test uses the cuDNN backend explicitly but lacks guards for GPU architecture support and cuDNN availability. As per coding guidelines, tests must skip on unsupported hardware. The 512MB workspace allocation can also cause OOM on smaller GPUs. Add checks before tensor allocations.

🧪 Suggested skip guards

import pytest import torch import flashinfer +from flashinfer.utils import get_compute_capability + +try: + import cudnn # type: ignore + CUDNN_AVAILABLE = True +except (ImportError, OSError): + CUDNN_AVAILABLE = False `@pytest.mark.parametrize`("batch_size", [1, 4]) @@ -26,6 +34,14 @@ def test_cudnn_prefill_deepseek( batch_size, s_qo, s_kv, num_kv_heads, num_qo_heads, causal ): if s_qo > s_kv: pytest.skip("s_qo > s_kv, skipping test as causal") if num_qo_heads < num_kv_heads: pytest.skip("num_qo_heads < num_kv_heads, skipping test") + + if not CUDNN_AVAILABLE: + pytest.skip("cuDNN not available") + major, _ = get_compute_capability(torch.device("cuda:0")) + if major < 8: + pytest.skip("cuDNN prefill requires SM80+")

🤖 Prompt for AI Agents

In `@tests/attention/test_cudnn_prefill_deepseek.py` around lines 7 - 20, Before allocating tensors in test_cudnn_prefill_deepseek, add gates to skip the test if no CUDA device or cuDNN is available and if the GPU's compute capability or free memory is insufficient for the 512MB workspace; specifically check torch.cuda.is_available(), torch.backends.cudnn.is_available(), and torch.cuda.get_device_capability() (or device major/minor) and optionally torch.cuda.get_device_properties().total_memory/free memory to skip when the device lacks required SM capability or memory; place these checks at the top of test_cudnn_prefill_deepseek (before any use of s_qo, s_kv, num_qo_heads, num_kv_heads or tensor allocations) so the test is skipped early on unsupported hardware.

coderabbitai · 2026-01-16T06:10:28Z

    scale = float(1.0 / (head_dim_qk**0.5))

-    workspace_buffer = torch.empty(128 * 1024 * 1024, dtype=torch.int8, device=device)
+    workspace_buffer = torch.empty(512 * 1024 * 1024, dtype=torch.int8, device=device)


⚠️ Potential issue | 🟡 Minor

Right-size the workspace buffer to reduce OOM risk.

Hardcoding a 512MB workspace can exhaust memory on smaller GPUs. Consider capping it relative to device memory. As per coding guidelines, avoid OOM-prone test sizes.

💡 Safer workspace sizing

- workspace_buffer = torch.empty(512 * 1024 * 1024, dtype=torch.int8, device=device) + total_mem = torch.cuda.get_device_properties(device).total_memory + workspace_bytes = min(512 * 1024 * 1024, total_mem // 8) + workspace_buffer = torch.empty(workspace_bytes, dtype=torch.int8, device=device)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

workspace_buffer = torch.empty(512 * 1024 * 1024, dtype=torch.int8, device=device)

total_mem = torch.cuda.get_device_properties(device).total_memory

workspace_bytes = min(512 * 1024 * 1024, total_mem // 8)

workspace_buffer = torch.empty(workspace_bytes, dtype=torch.int8, device=device)

🤖 Prompt for AI Agents

In `@tests/attention/test_cudnn_prefill_deepseek.py` at line 107, The test hardcodes a 512MB workspace (workspace_buffer) which can OOM on smaller GPUs; replace the fixed size with a safe cap based on the device's total memory by querying torch.cuda.get_device_properties(device).total_memory and computing a workspace_size_bytes = min(512*1024*1024, int(total_mem * 0.1)) (or another safe fraction like 0.05), then allocate workspace_buffer = torch.empty(workspace_size_bytes, dtype=torch.int8, device=device) so the buffer scales to the GPU and reduces OOM risk.

bkryu

Thanks @Anerudhan, LGTM. Unit test failures are unrelated

flashinfer-bot · 2026-01-17T10:37:00Z

[FAILED] Pipeline #41861590: 14/20 passed

Anerudhan requested review from bkryu, cyx-6, jimmyzho, nvmbreughe and yzh119 as code owners January 14, 2026 07:53

gemini-code-assist Bot reviewed Jan 14, 2026

View reviewed changes

Comment thread flashinfer/prefill.py

Comment thread flashinfer/prefill.py Outdated

Comment thread tests/attention/test_cudnn_prefill_deepseek.py

Comment thread flashinfer/prefill.py Outdated

coderabbitai Bot reviewed Jan 14, 2026

View reviewed changes

Comment thread flashinfer/prefill.py Outdated

Comment thread flashinfer/prefill.py

Comment thread flashinfer/prefill.py

bkryu reviewed Jan 16, 2026

View reviewed changes

Comment thread tests/attention/test_cudnn_prefill_deepseek.py

Comment thread flashinfer/prefill.py Outdated

Comment thread flashinfer/prefill.py

Anerudhan and others added 4 commits January 15, 2026 21:56

Added the deepseek sizes to the Ragged KV Cache wrapper

1670099

Apply suggestions from code review

06ca655

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update the docstring

d0b6b55

Add cudnn via wrapper to benchmark

f6ca31b

Anerudhan force-pushed the feat/cudnn_deepseek_sizes branch from 2cdb5f6 to f6ca31b Compare January 16, 2026 05:57

Anerudhan requested review from jiahanc and kahyunnam as code owners January 16, 2026 05:57

coderabbitai Bot reviewed Jan 16, 2026

View reviewed changes

bkryu approved these changes Jan 16, 2026

View reviewed changes

yzh119 merged commit 820cea1 into flashinfer-ai:main Jan 18, 2026
19 checks passed

wenscarl mentioned this pull request Jan 22, 2026

[Feature] Integrate new flashinfer optimizations for DeepSeekV3 sgl-project/sglang#14453

Open

hjjq mentioned this pull request Feb 3, 2026

[Tracking issue]: Integrate flashinfer optimizations (DeepSeek) vllm-project/vllm#33733

Open

9 tasks

coderabbitai Bot mentioned this pull request Feb 9, 2026

pick fa2 for BatchDecodeWithPagedKVCacheWrapper auto backend #2530

Merged

2 tasks

coderabbitai Bot mentioned this pull request Feb 23, 2026

allow cudnn to be chosen for prefill #2622

Open

5 tasks

This was referenced Mar 19, 2026

[fix] bugfix 1419: Add batch size shape validation in decode and prefill run() APIs #2801

Merged

feat: Add Option For Fixed cta_tile_q #2830

Open

coderabbitai Bot mentioned this pull request Apr 10, 2026

trtllm non causal support #3020

Merged

5 tasks

coderabbitai Bot mentioned this pull request Apr 12, 2026

feat: Integrate CuTe DSL FMHA prefill kernels by loading cubin #3039

Merged

5 tasks

This was referenced Apr 21, 2026

feat(dllm): add Block Extend Attention for Diffusion LLM #2722

Open

cute-dsl fmha prefill (cubin integration): remove front-padding, add attention_sink, and pdl support #3181

Merged

	- Also supports computationally similar `cudnn_batch_prefill_with_kv_cache` (cudnn-native) and `trtllm_ragged_attention_deepseek`.
	- Also supports computationally similar `cudnn_batch_prefill_with_kv_cache` (cudnn-native) and `trtllm_ragged_attention_deepseek`.

-    workspace_buffer = torch.empty(512 * 1024 * 1024, dtype=torch.int8, device=device)
+    total_mem = torch.cuda.get_device_properties(device).total_memory
+    workspace_bytes = min(512 * 1024 * 1024, total_mem // 8)
+    workspace_buffer = torch.empty(workspace_bytes, dtype=torch.int8, device=device)

Conversation

Anerudhan commented Jan 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot commented Jan 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkryu commented Jan 16, 2026

Uh oh!

flashinfer-bot commented Jan 16, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

flashinfer-bot commented Jan 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Anerudhan commented Jan 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 14, 2026 •

edited

Loading