cute-dsl fmha prefill (cubin integration): remove front-padding, add attention_sink, and pdl support#3181
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThreads Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Client/Test
participant Prefill as prefill.trtllm_ragged_attention_deepseek
participant Wrapper as cute_dsl_fmha_ragged_prefill
participant Kernel as TVM/native FMHA Kernel
participant KV as KV Cache
Test->>Prefill: call(inputs, attention_sinks?, enable_pdl?)
Prefill->>Wrapper: forward q/k/v/o, lse, attention_sinks, enable_pdl
Wrapper->>Kernel: prepare 5D q/k/v/o, 4D lse, pass attention_sinks & enable_pdl
Kernel-->>Wrapper: compute output (and optional lse)
Wrapper->>KV: write outputs into KV cache
Wrapper-->>Prefill: return/write output (and lse if produced)
Prefill-->>Test: return output (and lse if produced)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request removes the front-padding requirement for the cute-dsl backend, simplifying tensor allocation across benchmarks, tests, and the core implementation. It also introduces support for attention sinks and updates tensor reshaping logic to 5D/4D for kernel compatibility. Review feedback identifies a potential risk of RuntimeErrors when using .view() on non-contiguous tensors and points out that the enable_pdl flag is currently being ignored despite the removal of its warning.
I am having trouble creating individual review comments. Click here to see my feedback.
flashinfer/attention/cute_dsl/fmha.py (451-457)
The use of .view() on q, k, v, o, and lse assumes that these tensors are contiguous in memory. If any of these tensors are non-contiguous (e.g., resulting from a transpose or slice operation), this will raise a RuntimeError. Since this is a public API, it is safer to use .reshape() which handles non-contiguous tensors by creating a copy if necessary, or explicitly call .contiguous() before .view() if you want to ensure no silent copies are made without the user's knowledge. Additionally, an explicit check that H_k > 0 and H_q % H_k == 0 would provide a more helpful error message than the one produced by a failed view/reshape call.
if H_k == 0 or H_q % H_k != 0:
raise ValueError(f"Invalid GQA configuration: H_q={H_q}, H_k={H_k}")
h_r = H_q // H_k
q_5d = q.reshape(1, total_q, H_k, h_r, D)
k_5d = k.reshape(1, total_kv, H_k, 1, D)
v_5d = v.reshape(1, total_kv, H_k, 1, D_v)
o_5d = o.reshape(1, total_q, H_k, h_r, D_v)
# LSE: (1, total_q, h_k, h_r) — 4D row-major.
lse_4d = lse.reshape(1, total_q, H_k, h_r) if lse is not None else Noneflashinfer/prefill.py (3828-3832)
The PR title and description claim to add PDL support for the cute-dsl backend, and the warning for enable_pdl was removed here. However, the enable_pdl parameter is not passed to cute_dsl_fmha_ragged_prefill (see line 3845 in the updated file), and that function's signature in flashinfer/attention/cute_dsl/fmha.py does not include an enable_pdl argument. This means the enable_pdl flag is currently silently ignored for the cute-dsl backend. If PDL is supported, it should be plumbed through to the kernel launch; otherwise, the warning should be retained.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/attention/test_trtllm_gen_attention.py (1)
2183-2204:⚠️ Potential issue | 🔴 Critical
test_trtllm_gen_prefill_bs1will TypeError — missingenable_sinkargument.
test_trtllm_gen_prefillnow requiresenable_sink: bool(added at line 1855 with no default) buttest_trtllm_gen_prefill_bs1still calls it with only 9 positional args. Every parameterization oftest_trtllm_gen_prefill_bs1will fail withTypeError: test_trtllm_gen_prefill() missing 1 required positional argument: 'enable_sink'as soon as the function body runs.🔧 Proposed fix
`@pytest.mark.parametrize`("enable_pdl", [None]) +@pytest.mark.parametrize("enable_sink", [False]) `@pytest.mark.parametrize`("max_q_len", [8192]) @@ def test_trtllm_gen_prefill_bs1( backend: str, mla_dimensions: MLAHeadDimensions, batch_size: int, s_qo: int, s_kv: int, num_kv_heads: int, head_grp_size: int, causal: bool, skips_softmax: bool, + enable_sink: bool, ): test_trtllm_gen_prefill( backend, mla_dimensions, batch_size, s_qo, s_kv, num_kv_heads, head_grp_size, causal, skips_softmax, + enable_sink, )Or, if the bs1 wrapper should always run with sinks disabled regardless, just hard-code
Falseat the call site:skips_softmax, + False, # enable_sink: kept disabled for bs1 until DKG sink semantics are aligned )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/attention/test_trtllm_gen_attention.py` around lines 2183 - 2204, The test wrapper test_trtllm_gen_prefill_bs1 calls test_trtllm_gen_prefill without the newly required enable_sink argument, causing a TypeError; update the call in test_trtllm_gen_prefill_bs1 to supply the enable_sink boolean (e.g., pass False if sinks should be disabled for bs1) or add enable_sink to the wrapper's signature and forward it through to test_trtllm_gen_prefill so the function receives the required parameter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/attention/test_trtllm_gen_attention.py`:
- Around line 2183-2204: The test wrapper test_trtllm_gen_prefill_bs1 calls
test_trtllm_gen_prefill without the newly required enable_sink argument, causing
a TypeError; update the call in test_trtllm_gen_prefill_bs1 to supply the
enable_sink boolean (e.g., pass False if sinks should be disabled for bs1) or
add enable_sink to the wrapper's signature and forward it through to
test_trtllm_gen_prefill so the function receives the required parameter.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6832df13-24ce-44a4-b06c-f171eea859c9
📥 Commits
Reviewing files that changed from the base of the PR and between 5e1318c and 6f32e8c6fe782c5581fb6d3ab9d767e024d73dde.
📒 Files selected for processing (4)
benchmarks/routines/attention.pyflashinfer/attention/cute_dsl/fmha.pyflashinfer/prefill.pytests/attention/test_trtllm_gen_attention.py
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/attention/test_trtllm_gen_attention.py (1)
2169-2201:⚠️ Potential issue | 🔴 CriticalCritical:
test_trtllm_gen_prefill_bs1no longer matches the newtest_trtllm_gen_prefillsignature.
test_trtllm_gen_prefillnow requiresenable_sink(line 1852, no default), and is parametrized over[False, True]. Howevertest_trtllm_gen_prefill_bs1:
- Does not parametrize
enable_sink, and- Calls
test_trtllm_gen_prefill(...)with only 9 positional args (line 2191-2201) —enable_sinkis missing.This will raise
TypeError: missing 1 required positional argument: 'enable_sink'for every parametrized invocation oftest_trtllm_gen_prefill_bs1, breaking the bs1 path entirely.🐛 Proposed fix
`@pytest.mark.parametrize`("enable_pdl", [None]) +@pytest.mark.parametrize("enable_sink", [False, True]) `@pytest.mark.parametrize`("max_q_len", [511]) ... def test_trtllm_gen_prefill_bs1( backend: str, mla_dimensions: MLAHeadDimensions, batch_size: int, s_qo: int, s_kv: int, num_kv_heads: int, head_grp_size: int, causal: bool, skips_softmax: bool, + enable_sink: bool, ): test_trtllm_gen_prefill( backend, mla_dimensions, batch_size, s_qo, s_kv, num_kv_heads, head_grp_size, causal, skips_softmax, + enable_sink, )Alternatively, give
enable_sinka default ofFalseintest_trtllm_gen_prefillif bs1 should keep its current narrower coverage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/attention/test_trtllm_gen_attention.py` around lines 2169 - 2201, The bs1 wrapper test_trtllm_gen_prefill_bs1 calls test_trtllm_gen_prefill but doesn’t provide the newly required enable_sink argument; update test_trtllm_gen_prefill_bs1 to either add a pytest.parametrize for enable_sink=[False,True] (to match the upstream signature) or pass an explicit enable_sink value when calling test_trtllm_gen_prefill (e.g., enable_sink=False) so the call supplies the missing parameter; refer to the function names test_trtllm_gen_prefill_bs1 and test_trtllm_gen_prefill and the parameter enable_sink when making the change.
🧹 Nitpick comments (1)
flashinfer/attention/cute_dsl/fmha.py (1)
376-388: Document the newenable_pdlparameter.The function signature added
enable_pdl(line 333) but the docstring's Parameters section ends atskip_softmax_threshold_scale_factorwith no entry forenable_pdl. While here, theenable_tvm_ffidescription still says “Default False” although the signature default isTrue(pre-existing).📝 Suggested doc additions
skip_softmax_threshold_scale_factor : float, optional Threshold scale factor for skip-softmax sparsity (https://arxiv.org/abs/2512.12087). The actual threshold = scale_factor / max_kv_len, then converted to log2 domain. None or 0 disables skip-softmax. + enable_pdl : bool, optional + If True, select the `_pdl` kernel variant and enable Programmatic + Dependent Launch at runtime. Default False. """Based on learnings: "Keep documentation in sync with code changes, especially when modifying flashinfer_api, backend_requirement, or TVM-FFI macros".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/attention/cute_dsl/fmha.py` around lines 376 - 388, Update the docstring to document the new enable_pdl parameter and correct the enable_tvm_ffi default: add a Parameters entry for enable_pdl describing its type (bool), behavior (what enabling PDL does), and default value, and change the enable_tvm_ffi description to reflect its actual default (True) instead of "Default False"; locate the parameter list in the fmha function docstring near the existing enable_tvm_ffi and skip_softmax_threshold_scale_factor entries and add the new enable_pdl description right after them so the docs match the function signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/attention/cute_dsl/fmha.py`:
- Around line 514-522: The output tensor passed to from_dlpack() uses
assumed_align=32 but we only validate shape when a user provides out, so
non-32-byte-aligned user tensors can crash the kernel; add an explicit alignment
check and fail fast before calling from_dlpack() (e.g., check o_5d.data_ptr() %
32 == 0 and raise a clear ValueError) for both branches (is_fp8_out true and
false) and/or document the 32-byte alignment precondition in the function
docstring at the top of the file so callers know to provide a 32B-aligned
tensor.
---
Outside diff comments:
In `@tests/attention/test_trtllm_gen_attention.py`:
- Around line 2169-2201: The bs1 wrapper test_trtllm_gen_prefill_bs1 calls
test_trtllm_gen_prefill but doesn’t provide the newly required enable_sink
argument; update test_trtllm_gen_prefill_bs1 to either add a pytest.parametrize
for enable_sink=[False,True] (to match the upstream signature) or pass an
explicit enable_sink value when calling test_trtllm_gen_prefill (e.g.,
enable_sink=False) so the call supplies the missing parameter; refer to the
function names test_trtllm_gen_prefill_bs1 and test_trtllm_gen_prefill and the
parameter enable_sink when making the change.
---
Nitpick comments:
In `@flashinfer/attention/cute_dsl/fmha.py`:
- Around line 376-388: Update the docstring to document the new enable_pdl
parameter and correct the enable_tvm_ffi default: add a Parameters entry for
enable_pdl describing its type (bool), behavior (what enabling PDL does), and
default value, and change the enable_tvm_ffi description to reflect its actual
default (True) instead of "Default False"; locate the parameter list in the fmha
function docstring near the existing enable_tvm_ffi and
skip_softmax_threshold_scale_factor entries and add the new enable_pdl
description right after them so the docs match the function signature.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c26fe6ec-5bfd-4032-9da0-a8baaf55c4e7
📥 Commits
Reviewing files that changed from the base of the PR and between 6f32e8c6fe782c5581fb6d3ab9d767e024d73dde and 9dfbf655d37af8f4c340b5053aa57ef9fa8dfa4e.
📒 Files selected for processing (3)
flashinfer/attention/cute_dsl/fmha.pyflashinfer/prefill.pytests/attention/test_trtllm_gen_attention.py
🚧 Files skipped from review as they are similar to previous changes (1)
- flashinfer/prefill.py
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
flashinfer/attention/cute_dsl/fmha.py (1)
377-389:⚠️ Potential issue | 🟡 MinorStale docstring:
enable_tvm_ffidefault and missingenable_pdlentry.Two issues in this docstring block:
- Line 379 states
"Default False (CuTe native ABI)", but the signature on line 328 defaults toTrue. The default flipped without the docstring being updated.- The
enable_pdlparameter (line 333) is not documented at all.📝 Proposed docstring fix
enable_tvm_ffi : bool If True, use TVM-FFI ABI (pass data_ptr() for Pointer args, torch.Tensor - for Tensor args, no explicit stream). Default False (CuTe native ABI). + for Tensor args, no explicit stream). Default True. Set to False to use + the CuTe native ABI path. max_qo_len : int, optional Maximum query sequence length. Computed from qo_indptr if not provided. Pass this from plan() to avoid D2H copy during CUDA graph capture. max_kv_len : int, optional Maximum KV sequence length. Computed from kv_indptr if not provided. skip_softmax_threshold_scale_factor : float, optional Threshold scale factor for skip-softmax sparsity (https://arxiv.org/abs/2512.12087). The actual threshold = scale_factor / max_kv_len, then converted to log2 domain. None or 0 disables skip-softmax. + enable_pdl : bool + If True, select the PDL-enabled kernel variant and pass the PDL flag + through to the kernel call. Default False.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/attention/cute_dsl/fmha.py` around lines 377 - 389, The docstring for the FMHA constructor/function is stale: update the enable_tvm_ffi parameter description to match the actual signature default (set to True in the code) and clarify its behavior (TVM-FFI ABI vs CuTe native ABI), and add a new documented entry for enable_pdl (include its type, default value from the signature, and a short description of what enabling PDL changes or controls). Reference the parameter names enable_tvm_ffi and enable_pdl (and the surrounding docstring block in fmha.py) so the text matches the function signature defaults and explains their effects concisely.
🧹 Nitpick comments (2)
flashinfer/attention/cute_dsl/fmha.py (2)
244-275: Document the newenable_sinkanduse_pdlparameters.Two new keyword parameters were added to
get_cute_dsl_fmha_kernel(lines 241-242) but the docstring (lines 244-275) doesn't describe them. While at it,varlenandwith_lseare also undocumented in this docstring. As per coding guidelines (retrieved learning): "Keep documentation in sync with code changes."📝 Proposed docstring additions
enable_skip_softmax : bool If True, load kernel compiled with skip-softmax support. + enable_sink : bool + If True, load kernel variant compiled with attention-sink support. + use_pdl : bool + If True, load kernel variant compiled with Programmatic Dependent Launch (PDL) support.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/attention/cute_dsl/fmha.py` around lines 244 - 275, Update the get_cute_dsl_fmha_kernel docstring to document the two new keyword parameters enable_sink and use_pdl and also add missing docs for varlen and with_lse: describe enable_sink (bool) meaning and effect on kernel execution, describe use_pdl (bool) and how it changes PDL/parameters or compilation behavior, describe varlen (bool) semantics for variable-length sequences, and describe with_lse (bool) for enabling local shared/exchange optimizations; reference the function name get_cute_dsl_fmha_kernel and ensure each parameter entry follows the existing docstring style (type and short description) alongside the other parameters.
455-467: Optional: validateH_q % H_k == 0and contiguity for clearer error messages.The new 5D reshape relies on (a)
H_qbeing divisible byH_kand (b)q/k/v/obeing contiguous. Both are typically true for callers fromprefill.py, but if a caller passes a non-contiguous view or a head ratio that doesn't divide cleanly, the failure surfaces as a genericRuntimeErrorfrom.view()(e.g., "shape '[...]' is invalid for input of size N"), which is opaque versus the explicit alignment assertion you added two lines below.♻️ Optional preconditions
+ assert H_q % H_k == 0, ( + f"H_q ({H_q}) must be divisible by H_k ({H_k}) for GQA reshape" + ) # Reshape to 5D matching kernel docstring: # q/o: (b=1, total, h_k, h_r, d/dv) # k/v: (b=1, total, h_k, 1, d/dv) h_r = H_q // H_k q_5d = q.view(1, total_q, H_k, h_r, D) k_5d = k.view(1, total_kv, H_k, 1, D) v_5d = v.view(1, total_kv, H_k, 1, D_v)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/attention/cute_dsl/fmha.py` around lines 455 - 467, Validate preconditions before reshaping: ensure H_q % H_k == 0 and that q, k, v, o (and lse if present) are contiguous so .view() won't raise an opaque RuntimeError; if the check fails, raise a clear ValueError indicating which condition failed (mention H_q and H_k for divisibility and the specific tensor name for contiguity). Add these checks just before computing h_r and constructing q_5d/k_5d/v_5d/o_5d/lse_4d so failures are explicit and point to the offending symbol.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@flashinfer/attention/cute_dsl/fmha.py`:
- Around line 377-389: The docstring for the FMHA constructor/function is stale:
update the enable_tvm_ffi parameter description to match the actual signature
default (set to True in the code) and clarify its behavior (TVM-FFI ABI vs CuTe
native ABI), and add a new documented entry for enable_pdl (include its type,
default value from the signature, and a short description of what enabling PDL
changes or controls). Reference the parameter names enable_tvm_ffi and
enable_pdl (and the surrounding docstring block in fmha.py) so the text matches
the function signature defaults and explains their effects concisely.
---
Nitpick comments:
In `@flashinfer/attention/cute_dsl/fmha.py`:
- Around line 244-275: Update the get_cute_dsl_fmha_kernel docstring to document
the two new keyword parameters enable_sink and use_pdl and also add missing docs
for varlen and with_lse: describe enable_sink (bool) meaning and effect on
kernel execution, describe use_pdl (bool) and how it changes PDL/parameters or
compilation behavior, describe varlen (bool) semantics for variable-length
sequences, and describe with_lse (bool) for enabling local shared/exchange
optimizations; reference the function name get_cute_dsl_fmha_kernel and ensure
each parameter entry follows the existing docstring style (type and short
description) alongside the other parameters.
- Around line 455-467: Validate preconditions before reshaping: ensure H_q % H_k
== 0 and that q, k, v, o (and lse if present) are contiguous so .view() won't
raise an opaque RuntimeError; if the check fails, raise a clear ValueError
indicating which condition failed (mention H_q and H_k for divisibility and the
specific tensor name for contiguity). Add these checks just before computing h_r
and constructing q_5d/k_5d/v_5d/o_5d/lse_4d so failures are explicit and point
to the offending symbol.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 575abdd7-7dea-48d3-ac5d-b7044aed5d18
📥 Commits
Reviewing files that changed from the base of the PR and between ab8968ee4c4241a379731bd2f290328b84601d85 and 9cac4c51cf756d6f3a09270dacf708ae4a1ce970.
📒 Files selected for processing (1)
flashinfer/attention/cute_dsl/fmha.py
|
/bot run |
|
/bot run |
|
/bot run |
f44c91a to
b5a7903
Compare
|
/bot run |
saltyminty
left a comment
There was a problem hiding this comment.
Internal CI failures look unrelated, approved.
0dbe280 to
75ac935
Compare
Update flashinfer cute-dsl backend to match the new DKG FMHA kernel API (28047647d5f on feature/fmha_fi_integration) shipped via cubin_publishing. Runtime API changes (flashinfer/attention/cute_dsl/fmha.py): - Drop front-padding requirement and remove the docstring section about it. - Reshape q/k/v/o tensors from 4D (B, S, H, D) to 5D matching kernel docstring: q/o: (1, total, H_k, H_q//H_k, D); k/v: (1, total, H_k, 1, D) - LSE reshaped to 4D (1, total_q, H_k, H_q//H_k). - TVM-FFI: pass torch.Tensors directly (not data_ptr()); drop trailing q_tensor env-stream-detection arg (removed in new DKG). - Add attention_sinks parameter and enable_sink to variant lookup; pass sink tensor through to kernel. - CuTe native ABI path updated to 5D from_dlpack with leading_dim=4. flashinfer/prefill.py: - Remove warnings about cute-dsl not supporting PDL/sinks; pass attention_sinks through to cute_dsl_fmha_ragged_prefill. - Drop front-padding caveat from trtllm_ragged_attention_deepseek docstring. Benchmark / test cleanup: - benchmarks/routines/attention.py: remove front_pad_q/front_pad_kv allocation and slicing for q/k/v and output. - tests/attention/test_trtllm_gen_attention.py: drop the `if backend == "cute-dsl"` front-padding branches in test_trtllm_gen_prefill (bf16) and test_trtllm_gen_prefill_fp8. - Add enable_sink parametrize to test_trtllm_gen_prefill, with sink_attention_unified reference for sink case. Currently parametrized to [False] only with TODO, pending DKG kernel sink semantics fix (kernel adds raw sink to row_sum without exp/max-shift, mismatching trtllm-gen logit-style sink). Verified: 432/432 non-sink cute-dsl tests pass; perf 1.07-1.13x vs trtllm-native on FP8 h192 (no regression vs prior cute-dsl baseline). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire use_pdl through the DSL FMHA Python wrapper to match the DKG kernel's new launch parameter: - _get_variant_name appends _pdl suffix - cute_dsl_fmha_ragged_prefill takes enable_pdl, passed to both tvm-ffi and cute-native call paths - trtllm_ragged_attention_deepseek threads enable_pdl through - output assumed_align 16 -> 32 (STG256) Re-enable enable_sink=[False, True] in test_trtllm_gen_prefill; DKG sink calc fix landed: 864 passed / 0 failed / 1728 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit added enable_sink as a required arg to test_trtllm_gen_prefill but did not update test_trtllm_gen_prefill_bs1, which calls it directly. Every parameterization of bs1 would have hit TypeError: missing 1 required positional argument: 'enable_sink'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cute_dsl_fmha_ragged_prefill passes assumed_align=32 for the output tensor, which enables 256-bit store instructions in the kernel. A user-provided non-aligned out tensor (e.g. sliced view) would crash opaquely; assert at the boundary instead. Docstring updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Point flashinfer at the newly published cute-dsl FMHA cubin bundle, which integrates DKG feature/fmha_fi_integration (front-padding removal, sink, STG.256, use_pdl variant). Verified via prefill UT (864 passed, 1728 skipped) and ragged bench at D=128/192 (cute-dsl >= trtllm-native across all measured shapes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
75ac935 to
8983e5e
Compare
Update cute-dsl fmha prefill (cubin integration) by removing front-pading, and add support for attention_sink and pdl.
Runtime API changes (flashinfer/attention/cute_dsl/fmha.py):
flashinfer/prefill.py:
Benchmark / test cleanup:
if backend == "cute-dsl"front-padding branches in test_trtllm_gen_prefill (bf16) and test_trtllm_gen_prefill_fp8.Verified: 432/432 non-sink cute-dsl tests pass; perf 1.07-1.13x vs trtllm-native on FP8 h192 (no regression vs prior cute-dsl baseline).
UT cwd:
python -m pytest tests/attention/test_trtllm_gen_attention.py::test_trtllm_gen_prefill -k "cute-dsl"benchmark results:
Cubin commit:
801e770219613fbf088bc074c414732b26cc550dDate: 2026-04-28
Backends:
cute-dslvstrtllm-nativeConfig: causal, num_qo_heads=128, num_kv_heads=128, head_dim_vo=128
Timing: CUPTI + CUDA Graph, num_iters=30, dry_run_iters=5
Cubin source: public artifactory (downloaded fresh, checksums verified)
head_dim_qk = 192 (FP8 e4m3 only — BF16 unsupported at D=192)
Result: cute-dsl wins all 5 shapes (1.07×–1.13×).
head_dim_qk = 128
FP8 e4m3
Result: cute-dsl wins all 5 shapes (1.08×–1.17×).
BF16
Result: cute-dsl wins on batch=1 shapes (1.04×–1.09×); parity on batch=4.
📌 Description
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Bug Fixes
Chores
Tests