Skip to content

cute-dsl fmha prefill (cubin integration): remove front-padding, add attention_sink, and pdl support#3181

Merged
saltyminty merged 5 commits into
flashinfer-ai:mainfrom
limin2021:cute-dsl-fmha-remove-padding-udpate
May 6, 2026
Merged

cute-dsl fmha prefill (cubin integration): remove front-padding, add attention_sink, and pdl support#3181
saltyminty merged 5 commits into
flashinfer-ai:mainfrom
limin2021:cute-dsl-fmha-remove-padding-udpate

Conversation

@limin2021
Copy link
Copy Markdown
Contributor

@limin2021 limin2021 commented Apr 26, 2026

Update cute-dsl fmha prefill (cubin integration) by removing front-pading, and add support for attention_sink and pdl.

Runtime API changes (flashinfer/attention/cute_dsl/fmha.py):

  • Drop front-padding requirement and remove the docstring section about it.
  • Reshape q/k/v/o tensors from 4D (B, S, H, D) to 5D matching kernel docstring: q/o: (1, total, H_k, H_q//H_k, D); k/v: (1, total, H_k, 1, D)
  • LSE reshaped to 4D (1, total_q, H_k, H_q//H_k).
  • TVM-FFI: pass torch.Tensors directly (not data_ptr()); drop trailing q_tensor env-stream-detection arg (removed in new DKG).
  • Add attention_sinks parameter and enable_sink to variant lookup; pass sink tensor through to kernel.
  • CuTe native ABI path updated to 5D from_dlpack with leading_dim=4.

flashinfer/prefill.py:

  • Remove warnings about cute-dsl not supporting PDL/sinks; pass attention_sinks through to cute_dsl_fmha_ragged_prefill.
  • Drop front-padding caveat from trtllm_ragged_attention_deepseek docstring.

Benchmark / test cleanup:

  • benchmarks/routines/attention.py: remove front_pad_q/front_pad_kv allocation and slicing for q/k/v and output.
  • tests/attention/test_trtllm_gen_attention.py: drop the if backend == "cute-dsl" front-padding branches in test_trtllm_gen_prefill (bf16) and test_trtllm_gen_prefill_fp8.
  • Add enable_sink parametrize to test_trtllm_gen_prefill, with sink_attention_unified reference for sink case. Currently parametrized to [False] only with TODO, pending DKG kernel sink semantics fix (kernel adds raw sink to row_sum without exp/max-shift, mismatching trtllm-gen logit-style sink).

Verified: 432/432 non-sink cute-dsl tests pass; perf 1.07-1.13x vs trtllm-native on FP8 h192 (no regression vs prior cute-dsl baseline).

UT cwd:

python -m pytest tests/attention/test_trtllm_gen_attention.py::test_trtllm_gen_prefill -k "cute-dsl"

benchmark results:

Cubin commit: 801e770219613fbf088bc074c414732b26cc550d
Date: 2026-04-28
Backends: cute-dsl vs trtllm-native
Config: causal, num_qo_heads=128, num_kv_heads=128, head_dim_vo=128
Timing: CUPTI + CUDA Graph, num_iters=30, dry_run_iters=5
Cubin source: public artifactory (downloaded fresh, checksums verified)


head_dim_qk = 192 (FP8 e4m3 only — BF16 unsupported at D=192)

Shape (B, S_qo, S_kv) cute-dsl (ms / TFLOPS) trtllm-native (ms / TFLOPS) Speedup
1, 8192, 8192 1.515 / 1814 1.628 / 1688 1.07×
1, 8192, 32768 8.582 / 2242 9.546 / 2016 1.11×
1, 8192, 65536 18.022 / 2288 20.025 / 2059 1.11×
4, 512, 81920 6.585 / 2081 7.454 / 1838 1.13×
4, 1024, 81920 12.481 / 2189 14.051 / 1944 1.13×

Result: cute-dsl wins all 5 shapes (1.07×–1.13×).


head_dim_qk = 128

FP8 e4m3

Shape (B, S_qo, S_kv) cute-dsl (ms / TFLOPS) trtllm-native (ms / TFLOPS) Speedup
1, 8192, 8192 1.452 / 1515 1.568 / 1403 1.08×
1, 8192, 32768 7.745 / 1988 9.058 / 1699 1.17×
1, 8192, 65536 16.209 / 2035 18.665 / 1767 1.15×
4, 512, 81920 5.846 / 1875 6.504 / 1685 1.11×
4, 1024, 81920 11.176 / 1955 12.486 / 1750 1.12×

Result: cute-dsl wins all 5 shapes (1.08×–1.17×).

BF16

Shape (B, S_qo, S_kv) cute-dsl (ms / TFLOPS) trtllm-native (ms / TFLOPS) Speedup
1, 8192, 8192 1.702 / 1292 1.778 / 1237 1.04×
1, 8192, 32768 10.192 / 1510 11.117 / 1385 1.09×
1, 8192, 65536 21.898 / 1506 23.266 / 1418 1.06×
4, 512, 81920 8.721 / 1257 8.706 / 1259 1.00×
4, 1024, 81920 16.114 / 1356 16.278 / 1342 1.01×

Result: cute-dsl wins on batch=1 shapes (1.04×–1.09×); parity on batch=4.

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Optional attention-sink and PDL support for ragged attention prefill and kernel selection; APIs now accept sink/PDL options.
  • Bug Fixes

    • Removed misleading warnings about sinks and fixed conditional LSE/output comparisons for ragged flows.
  • Chores

    • Reduced memory usage by eliminating front-padding; tensors and outputs allocate exact ragged sizes.
  • Tests

    • Expanded coverage for sink/PDL modes and simplified FP8 and output verification logic.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Threads attention_sinks and enable_pdl through the CuTe-DSL FMHA ragged prefill path, adds sink/PDL-aware kernel variant selection, reshapes Q/K/V/O to explicit 5D and LSE to 4D, and removes front-padding by allocating tensors sized to actual ragged token counts.

Changes

Cohort / File(s) Summary
Benchmarks & Prefill Allocations
benchmarks/routines/attention.py, flashinfer/prefill.py
Remove front-padded ragged allocations and slicing; allocate q/k/v and outputs to exact ragged lengths; stop emitting warnings and forward attention_sinks/enable_pdl into the cute-dsl kernel.
CuTe-DSL FMHA Kernel Wrapping & Sink/PDL
flashinfer/attention/cute_dsl/fmha.py
Add enable_sink/use_pdl to variant naming and kernel loader; cute_dsl_fmha_ragged_prefill accepts attention_sinks and enable_pdl; reshape tensors to 5D (q/k/v/o) and 4D (lse); update TVM-FFI and native-ABI call signatures and remove unused placeholder arg.
Tests — Ragged Prefill & FP8
tests/attention/test_trtllm_gen_attention.py
Parametrize tests over sink enabled/disabled; branch reference computation for sink path; remove front-padding/slicing in prefill and FP8 tests; allocate outputs matching reference shape and conditionally verify LSE when present.

Sequence Diagram(s)

sequenceDiagram
  participant Test as Client/Test
  participant Prefill as prefill.trtllm_ragged_attention_deepseek
  participant Wrapper as cute_dsl_fmha_ragged_prefill
  participant Kernel as TVM/native FMHA Kernel
  participant KV as KV Cache

  Test->>Prefill: call(inputs, attention_sinks?, enable_pdl?)
  Prefill->>Wrapper: forward q/k/v/o, lse, attention_sinks, enable_pdl
  Wrapper->>Kernel: prepare 5D q/k/v/o, 4D lse, pass attention_sinks & enable_pdl
  Kernel-->>Wrapper: compute output (and optional lse)
  Wrapper->>KV: write outputs into KV cache
  Wrapper-->>Prefill: return/write output (and lse if produced)
  Prefill-->>Test: return output (and lse if produced)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • nvpohanh
  • qsang-nv
  • aleozlx
  • jimmyzho
  • bkryu
  • cyx-6
  • yzh119
  • nv-yunzheq

Poem

🐰 I hopped through kernels, sinks in tow,
I stacked my tensors five-deep in a row,
No padding in my stride,
Ragged tokens ride,
A happy prefill — ready, hop, and go! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main changes: removing front-padding, adding attention_sink support, and adding PDL support to the cute-dsl FMHA prefill integration.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is comprehensive and complete. It includes a detailed summary of changes across multiple files, implementation details, test verification, benchmark results, and specific commit information. All required checklist items are marked as complete.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@limin2021 limin2021 changed the title cute-dsl fmha cubin integration: remove front-padding, add sink, add pdl support cute-dsl fmha prefill (cubin integration): remove front-padding, add sink, add pdl support Apr 26, 2026
@limin2021 limin2021 changed the title cute-dsl fmha prefill (cubin integration): remove front-padding, add sink, add pdl support cute-dsl fmha prefill (cubin integration): remove front-padding, add attention_sink, and pdl support Apr 26, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the front-padding requirement for the cute-dsl backend, simplifying tensor allocation across benchmarks, tests, and the core implementation. It also introduces support for attention sinks and updates tensor reshaping logic to 5D/4D for kernel compatibility. Review feedback identifies a potential risk of RuntimeErrors when using .view() on non-contiguous tensors and points out that the enable_pdl flag is currently being ignored despite the removal of its warning.

I am having trouble creating individual review comments. Click here to see my feedback.

flashinfer/attention/cute_dsl/fmha.py (451-457)

medium

The use of .view() on q, k, v, o, and lse assumes that these tensors are contiguous in memory. If any of these tensors are non-contiguous (e.g., resulting from a transpose or slice operation), this will raise a RuntimeError. Since this is a public API, it is safer to use .reshape() which handles non-contiguous tensors by creating a copy if necessary, or explicitly call .contiguous() before .view() if you want to ensure no silent copies are made without the user's knowledge. Additionally, an explicit check that H_k > 0 and H_q % H_k == 0 would provide a more helpful error message than the one produced by a failed view/reshape call.

    if H_k == 0 or H_q % H_k != 0:
        raise ValueError(f"Invalid GQA configuration: H_q={H_q}, H_k={H_k}")
    h_r = H_q // H_k
    q_5d = q.reshape(1, total_q, H_k, h_r, D)
    k_5d = k.reshape(1, total_kv, H_k, 1, D)
    v_5d = v.reshape(1, total_kv, H_k, 1, D_v)
    o_5d = o.reshape(1, total_q, H_k, h_r, D_v)
    # LSE: (1, total_q, h_k, h_r) — 4D row-major.
    lse_4d = lse.reshape(1, total_q, H_k, h_r) if lse is not None else None

flashinfer/prefill.py (3828-3832)

medium

The PR title and description claim to add PDL support for the cute-dsl backend, and the warning for enable_pdl was removed here. However, the enable_pdl parameter is not passed to cute_dsl_fmha_ragged_prefill (see line 3845 in the updated file), and that function's signature in flashinfer/attention/cute_dsl/fmha.py does not include an enable_pdl argument. This means the enable_pdl flag is currently silently ignored for the cute-dsl backend. If PDL is supported, it should be plumbed through to the kernel launch; otherwise, the warning should be retained.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/attention/test_trtllm_gen_attention.py (1)

2183-2204: ⚠️ Potential issue | 🔴 Critical

test_trtllm_gen_prefill_bs1 will TypeError — missing enable_sink argument.

test_trtllm_gen_prefill now requires enable_sink: bool (added at line 1855 with no default) but test_trtllm_gen_prefill_bs1 still calls it with only 9 positional args. Every parameterization of test_trtllm_gen_prefill_bs1 will fail with TypeError: test_trtllm_gen_prefill() missing 1 required positional argument: 'enable_sink' as soon as the function body runs.

🔧 Proposed fix
 `@pytest.mark.parametrize`("enable_pdl", [None])
+@pytest.mark.parametrize("enable_sink", [False])
 `@pytest.mark.parametrize`("max_q_len", [8192])
@@
 def test_trtllm_gen_prefill_bs1(
     backend: str,
     mla_dimensions: MLAHeadDimensions,
     batch_size: int,
     s_qo: int,
     s_kv: int,
     num_kv_heads: int,
     head_grp_size: int,
     causal: bool,
     skips_softmax: bool,
+    enable_sink: bool,
 ):
     test_trtllm_gen_prefill(
         backend,
         mla_dimensions,
         batch_size,
         s_qo,
         s_kv,
         num_kv_heads,
         head_grp_size,
         causal,
         skips_softmax,
+        enable_sink,
     )

Or, if the bs1 wrapper should always run with sinks disabled regardless, just hard-code False at the call site:

         skips_softmax,
+        False,  # enable_sink: kept disabled for bs1 until DKG sink semantics are aligned
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/attention/test_trtllm_gen_attention.py` around lines 2183 - 2204, The
test wrapper test_trtllm_gen_prefill_bs1 calls test_trtllm_gen_prefill without
the newly required enable_sink argument, causing a TypeError; update the call in
test_trtllm_gen_prefill_bs1 to supply the enable_sink boolean (e.g., pass False
if sinks should be disabled for bs1) or add enable_sink to the wrapper's
signature and forward it through to test_trtllm_gen_prefill so the function
receives the required parameter.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/attention/test_trtllm_gen_attention.py`:
- Around line 2183-2204: The test wrapper test_trtllm_gen_prefill_bs1 calls
test_trtllm_gen_prefill without the newly required enable_sink argument, causing
a TypeError; update the call in test_trtllm_gen_prefill_bs1 to supply the
enable_sink boolean (e.g., pass False if sinks should be disabled for bs1) or
add enable_sink to the wrapper's signature and forward it through to
test_trtllm_gen_prefill so the function receives the required parameter.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6832df13-24ce-44a4-b06c-f171eea859c9

📥 Commits

Reviewing files that changed from the base of the PR and between 5e1318c and 6f32e8c6fe782c5581fb6d3ab9d767e024d73dde.

📒 Files selected for processing (4)
  • benchmarks/routines/attention.py
  • flashinfer/attention/cute_dsl/fmha.py
  • flashinfer/prefill.py
  • tests/attention/test_trtllm_gen_attention.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/attention/test_trtllm_gen_attention.py (1)

2169-2201: ⚠️ Potential issue | 🔴 Critical

Critical: test_trtllm_gen_prefill_bs1 no longer matches the new test_trtllm_gen_prefill signature.

test_trtllm_gen_prefill now requires enable_sink (line 1852, no default), and is parametrized over [False, True]. However test_trtllm_gen_prefill_bs1:

  1. Does not parametrize enable_sink, and
  2. Calls test_trtllm_gen_prefill(...) with only 9 positional args (line 2191-2201) — enable_sink is missing.

This will raise TypeError: missing 1 required positional argument: 'enable_sink' for every parametrized invocation of test_trtllm_gen_prefill_bs1, breaking the bs1 path entirely.

🐛 Proposed fix
 `@pytest.mark.parametrize`("enable_pdl", [None])
+@pytest.mark.parametrize("enable_sink", [False, True])
 `@pytest.mark.parametrize`("max_q_len", [511])
 ...
 def test_trtllm_gen_prefill_bs1(
     backend: str,
     mla_dimensions: MLAHeadDimensions,
     batch_size: int,
     s_qo: int,
     s_kv: int,
     num_kv_heads: int,
     head_grp_size: int,
     causal: bool,
     skips_softmax: bool,
+    enable_sink: bool,
 ):
     test_trtllm_gen_prefill(
         backend,
         mla_dimensions,
         batch_size,
         s_qo,
         s_kv,
         num_kv_heads,
         head_grp_size,
         causal,
         skips_softmax,
+        enable_sink,
     )

Alternatively, give enable_sink a default of False in test_trtllm_gen_prefill if bs1 should keep its current narrower coverage.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/attention/test_trtllm_gen_attention.py` around lines 2169 - 2201, The
bs1 wrapper test_trtllm_gen_prefill_bs1 calls test_trtllm_gen_prefill but
doesn’t provide the newly required enable_sink argument; update
test_trtllm_gen_prefill_bs1 to either add a pytest.parametrize for
enable_sink=[False,True] (to match the upstream signature) or pass an explicit
enable_sink value when calling test_trtllm_gen_prefill (e.g., enable_sink=False)
so the call supplies the missing parameter; refer to the function names
test_trtllm_gen_prefill_bs1 and test_trtllm_gen_prefill and the parameter
enable_sink when making the change.
🧹 Nitpick comments (1)
flashinfer/attention/cute_dsl/fmha.py (1)

376-388: Document the new enable_pdl parameter.

The function signature added enable_pdl (line 333) but the docstring's Parameters section ends at skip_softmax_threshold_scale_factor with no entry for enable_pdl. While here, the enable_tvm_ffi description still says “Default False” although the signature default is True (pre-existing).

📝 Suggested doc additions
     skip_softmax_threshold_scale_factor : float, optional
         Threshold scale factor for skip-softmax sparsity (https://arxiv.org/abs/2512.12087).
         The actual threshold = scale_factor / max_kv_len, then converted to log2 domain.
         None or 0 disables skip-softmax.
+    enable_pdl : bool, optional
+        If True, select the `_pdl` kernel variant and enable Programmatic
+        Dependent Launch at runtime. Default False.
     """

Based on learnings: "Keep documentation in sync with code changes, especially when modifying flashinfer_api, backend_requirement, or TVM-FFI macros".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/attention/cute_dsl/fmha.py` around lines 376 - 388, Update the
docstring to document the new enable_pdl parameter and correct the
enable_tvm_ffi default: add a Parameters entry for enable_pdl describing its
type (bool), behavior (what enabling PDL does), and default value, and change
the enable_tvm_ffi description to reflect its actual default (True) instead of
"Default False"; locate the parameter list in the fmha function docstring near
the existing enable_tvm_ffi and skip_softmax_threshold_scale_factor entries and
add the new enable_pdl description right after them so the docs match the
function signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/attention/cute_dsl/fmha.py`:
- Around line 514-522: The output tensor passed to from_dlpack() uses
assumed_align=32 but we only validate shape when a user provides out, so
non-32-byte-aligned user tensors can crash the kernel; add an explicit alignment
check and fail fast before calling from_dlpack() (e.g., check o_5d.data_ptr() %
32 == 0 and raise a clear ValueError) for both branches (is_fp8_out true and
false) and/or document the 32-byte alignment precondition in the function
docstring at the top of the file so callers know to provide a 32B-aligned
tensor.

---

Outside diff comments:
In `@tests/attention/test_trtllm_gen_attention.py`:
- Around line 2169-2201: The bs1 wrapper test_trtllm_gen_prefill_bs1 calls
test_trtllm_gen_prefill but doesn’t provide the newly required enable_sink
argument; update test_trtllm_gen_prefill_bs1 to either add a pytest.parametrize
for enable_sink=[False,True] (to match the upstream signature) or pass an
explicit enable_sink value when calling test_trtllm_gen_prefill (e.g.,
enable_sink=False) so the call supplies the missing parameter; refer to the
function names test_trtllm_gen_prefill_bs1 and test_trtllm_gen_prefill and the
parameter enable_sink when making the change.

---

Nitpick comments:
In `@flashinfer/attention/cute_dsl/fmha.py`:
- Around line 376-388: Update the docstring to document the new enable_pdl
parameter and correct the enable_tvm_ffi default: add a Parameters entry for
enable_pdl describing its type (bool), behavior (what enabling PDL does), and
default value, and change the enable_tvm_ffi description to reflect its actual
default (True) instead of "Default False"; locate the parameter list in the fmha
function docstring near the existing enable_tvm_ffi and
skip_softmax_threshold_scale_factor entries and add the new enable_pdl
description right after them so the docs match the function signature.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c26fe6ec-5bfd-4032-9da0-a8baaf55c4e7

📥 Commits

Reviewing files that changed from the base of the PR and between 6f32e8c6fe782c5581fb6d3ab9d767e024d73dde and 9dfbf655d37af8f4c340b5053aa57ef9fa8dfa4e.

📒 Files selected for processing (3)
  • flashinfer/attention/cute_dsl/fmha.py
  • flashinfer/prefill.py
  • tests/attention/test_trtllm_gen_attention.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • flashinfer/prefill.py

Comment thread flashinfer/attention/cute_dsl/fmha.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
flashinfer/attention/cute_dsl/fmha.py (1)

377-389: ⚠️ Potential issue | 🟡 Minor

Stale docstring: enable_tvm_ffi default and missing enable_pdl entry.

Two issues in this docstring block:

  1. Line 379 states "Default False (CuTe native ABI)", but the signature on line 328 defaults to True. The default flipped without the docstring being updated.
  2. The enable_pdl parameter (line 333) is not documented at all.
📝 Proposed docstring fix
     enable_tvm_ffi : bool
         If True, use TVM-FFI ABI (pass data_ptr() for Pointer args, torch.Tensor
-        for Tensor args, no explicit stream). Default False (CuTe native ABI).
+        for Tensor args, no explicit stream). Default True. Set to False to use
+        the CuTe native ABI path.
     max_qo_len : int, optional
         Maximum query sequence length. Computed from qo_indptr if not provided.
         Pass this from plan() to avoid D2H copy during CUDA graph capture.
     max_kv_len : int, optional
         Maximum KV sequence length. Computed from kv_indptr if not provided.
     skip_softmax_threshold_scale_factor : float, optional
         Threshold scale factor for skip-softmax sparsity (https://arxiv.org/abs/2512.12087).
         The actual threshold = scale_factor / max_kv_len, then converted to log2 domain.
         None or 0 disables skip-softmax.
+    enable_pdl : bool
+        If True, select the PDL-enabled kernel variant and pass the PDL flag
+        through to the kernel call. Default False.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/attention/cute_dsl/fmha.py` around lines 377 - 389, The docstring
for the FMHA constructor/function is stale: update the enable_tvm_ffi parameter
description to match the actual signature default (set to True in the code) and
clarify its behavior (TVM-FFI ABI vs CuTe native ABI), and add a new documented
entry for enable_pdl (include its type, default value from the signature, and a
short description of what enabling PDL changes or controls). Reference the
parameter names enable_tvm_ffi and enable_pdl (and the surrounding docstring
block in fmha.py) so the text matches the function signature defaults and
explains their effects concisely.
🧹 Nitpick comments (2)
flashinfer/attention/cute_dsl/fmha.py (2)

244-275: Document the new enable_sink and use_pdl parameters.

Two new keyword parameters were added to get_cute_dsl_fmha_kernel (lines 241-242) but the docstring (lines 244-275) doesn't describe them. While at it, varlen and with_lse are also undocumented in this docstring. As per coding guidelines (retrieved learning): "Keep documentation in sync with code changes."

📝 Proposed docstring additions
     enable_skip_softmax : bool
         If True, load kernel compiled with skip-softmax support.
+    enable_sink : bool
+        If True, load kernel variant compiled with attention-sink support.
+    use_pdl : bool
+        If True, load kernel variant compiled with Programmatic Dependent Launch (PDL) support.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/attention/cute_dsl/fmha.py` around lines 244 - 275, Update the
get_cute_dsl_fmha_kernel docstring to document the two new keyword parameters
enable_sink and use_pdl and also add missing docs for varlen and with_lse:
describe enable_sink (bool) meaning and effect on kernel execution, describe
use_pdl (bool) and how it changes PDL/parameters or compilation behavior,
describe varlen (bool) semantics for variable-length sequences, and describe
with_lse (bool) for enabling local shared/exchange optimizations; reference the
function name get_cute_dsl_fmha_kernel and ensure each parameter entry follows
the existing docstring style (type and short description) alongside the other
parameters.

455-467: Optional: validate H_q % H_k == 0 and contiguity for clearer error messages.

The new 5D reshape relies on (a) H_q being divisible by H_k and (b) q/k/v/o being contiguous. Both are typically true for callers from prefill.py, but if a caller passes a non-contiguous view or a head ratio that doesn't divide cleanly, the failure surfaces as a generic RuntimeError from .view() (e.g., "shape '[...]' is invalid for input of size N"), which is opaque versus the explicit alignment assertion you added two lines below.

♻️ Optional preconditions
+    assert H_q % H_k == 0, (
+        f"H_q ({H_q}) must be divisible by H_k ({H_k}) for GQA reshape"
+    )
     # Reshape to 5D matching kernel docstring:
     #   q/o: (b=1, total, h_k, h_r, d/dv)
     #   k/v: (b=1, total, h_k, 1, d/dv)
     h_r = H_q // H_k
     q_5d = q.view(1, total_q, H_k, h_r, D)
     k_5d = k.view(1, total_kv, H_k, 1, D)
     v_5d = v.view(1, total_kv, H_k, 1, D_v)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/attention/cute_dsl/fmha.py` around lines 455 - 467, Validate
preconditions before reshaping: ensure H_q % H_k == 0 and that q, k, v, o (and
lse if present) are contiguous so .view() won't raise an opaque RuntimeError; if
the check fails, raise a clear ValueError indicating which condition failed
(mention H_q and H_k for divisibility and the specific tensor name for
contiguity). Add these checks just before computing h_r and constructing
q_5d/k_5d/v_5d/o_5d/lse_4d so failures are explicit and point to the offending
symbol.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@flashinfer/attention/cute_dsl/fmha.py`:
- Around line 377-389: The docstring for the FMHA constructor/function is stale:
update the enable_tvm_ffi parameter description to match the actual signature
default (set to True in the code) and clarify its behavior (TVM-FFI ABI vs CuTe
native ABI), and add a new documented entry for enable_pdl (include its type,
default value from the signature, and a short description of what enabling PDL
changes or controls). Reference the parameter names enable_tvm_ffi and
enable_pdl (and the surrounding docstring block in fmha.py) so the text matches
the function signature defaults and explains their effects concisely.

---

Nitpick comments:
In `@flashinfer/attention/cute_dsl/fmha.py`:
- Around line 244-275: Update the get_cute_dsl_fmha_kernel docstring to document
the two new keyword parameters enable_sink and use_pdl and also add missing docs
for varlen and with_lse: describe enable_sink (bool) meaning and effect on
kernel execution, describe use_pdl (bool) and how it changes PDL/parameters or
compilation behavior, describe varlen (bool) semantics for variable-length
sequences, and describe with_lse (bool) for enabling local shared/exchange
optimizations; reference the function name get_cute_dsl_fmha_kernel and ensure
each parameter entry follows the existing docstring style (type and short
description) alongside the other parameters.
- Around line 455-467: Validate preconditions before reshaping: ensure H_q % H_k
== 0 and that q, k, v, o (and lse if present) are contiguous so .view() won't
raise an opaque RuntimeError; if the check fails, raise a clear ValueError
indicating which condition failed (mention H_q and H_k for divisibility and the
specific tensor name for contiguity). Add these checks just before computing h_r
and constructing q_5d/k_5d/v_5d/o_5d/lse_4d so failures are explicit and point
to the offending symbol.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 575abdd7-7dea-48d3-ac5d-b7044aed5d18

📥 Commits

Reviewing files that changed from the base of the PR and between ab8968ee4c4241a379731bd2f290328b84601d85 and 9cac4c51cf756d6f3a09270dacf708ae4a1ce970.

📒 Files selected for processing (1)
  • flashinfer/attention/cute_dsl/fmha.py

Copy link
Copy Markdown
Contributor

@nvpohanh nvpohanh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. cc @leejnau

@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Apr 28, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !617 has been created, and the CI pipeline #49759961 is currently running. I'll report back once the pipeline job completes.

@limin2021
Copy link
Copy Markdown
Contributor Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !617 has been updated with latest changes, and the CI pipeline #49769417 is currently running. I'll report back once the pipeline job completes.

@limin2021
Copy link
Copy Markdown
Contributor Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !617 has been created, and the CI pipeline #49800198 is currently running. I'll report back once the pipeline job completes.

@saltyminty saltyminty force-pushed the cute-dsl-fmha-remove-padding-udpate branch from f44c91a to b5a7903 Compare May 1, 2026 20:09
@saltyminty
Copy link
Copy Markdown
Collaborator

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !617 has been updated with latest changes, and the CI pipeline #50034667 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Collaborator

@saltyminty saltyminty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal CI failures look unrelated, approved.

@saltyminty saltyminty force-pushed the cute-dsl-fmha-remove-padding-udpate branch 2 times, most recently from 0dbe280 to 75ac935 Compare May 5, 2026 17:08
limin2021 and others added 5 commits May 5, 2026 10:39
Update flashinfer cute-dsl backend to match the new DKG FMHA kernel API
(28047647d5f on feature/fmha_fi_integration) shipped via cubin_publishing.

Runtime API changes (flashinfer/attention/cute_dsl/fmha.py):
- Drop front-padding requirement and remove the docstring section about it.
- Reshape q/k/v/o tensors from 4D (B, S, H, D) to 5D matching kernel docstring:
  q/o: (1, total, H_k, H_q//H_k, D); k/v: (1, total, H_k, 1, D)
- LSE reshaped to 4D (1, total_q, H_k, H_q//H_k).
- TVM-FFI: pass torch.Tensors directly (not data_ptr()); drop trailing q_tensor
  env-stream-detection arg (removed in new DKG).
- Add attention_sinks parameter and enable_sink to variant lookup; pass sink
  tensor through to kernel.
- CuTe native ABI path updated to 5D from_dlpack with leading_dim=4.

flashinfer/prefill.py:
- Remove warnings about cute-dsl not supporting PDL/sinks; pass attention_sinks
  through to cute_dsl_fmha_ragged_prefill.
- Drop front-padding caveat from trtllm_ragged_attention_deepseek docstring.

Benchmark / test cleanup:
- benchmarks/routines/attention.py: remove front_pad_q/front_pad_kv allocation
  and slicing for q/k/v and output.
- tests/attention/test_trtllm_gen_attention.py: drop the `if backend ==
  "cute-dsl"` front-padding branches in test_trtllm_gen_prefill (bf16) and
  test_trtllm_gen_prefill_fp8.
- Add enable_sink parametrize to test_trtllm_gen_prefill, with sink_attention_unified
  reference for sink case. Currently parametrized to [False] only with TODO,
  pending DKG kernel sink semantics fix (kernel adds raw sink to row_sum
  without exp/max-shift, mismatching trtllm-gen logit-style sink).

Verified: 432/432 non-sink cute-dsl tests pass; perf 1.07-1.13x vs trtllm-native
on FP8 h192 (no regression vs prior cute-dsl baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wire use_pdl through the DSL FMHA Python wrapper to match the DKG
kernel's new launch parameter:
- _get_variant_name appends _pdl suffix
- cute_dsl_fmha_ragged_prefill takes enable_pdl, passed to both
  tvm-ffi and cute-native call paths
- trtllm_ragged_attention_deepseek threads enable_pdl through
- output assumed_align 16 -> 32 (STG256)

Re-enable enable_sink=[False, True] in test_trtllm_gen_prefill;
DKG sink calc fix landed: 864 passed / 0 failed / 1728 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous commit added enable_sink as a required arg to
test_trtllm_gen_prefill but did not update test_trtllm_gen_prefill_bs1,
which calls it directly. Every parameterization of bs1 would have hit
TypeError: missing 1 required positional argument: 'enable_sink'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cute_dsl_fmha_ragged_prefill passes assumed_align=32 for the output
tensor, which enables 256-bit store instructions in the kernel. A
user-provided non-aligned out tensor (e.g. sliced view) would crash
opaquely; assert at the boundary instead. Docstring updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Point flashinfer at the newly published cute-dsl FMHA cubin bundle,
which integrates DKG feature/fmha_fi_integration (front-padding removal,
sink, STG.256, use_pdl variant). Verified via prefill UT (864 passed,
1728 skipped) and ragged bench at D=128/192 (cute-dsl >= trtllm-native
across all measured shapes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saltyminty saltyminty force-pushed the cute-dsl-fmha-remove-padding-udpate branch from 75ac935 to 8983e5e Compare May 5, 2026 17:39
@saltyminty saltyminty merged commit 89af11c into flashinfer-ai:main May 6, 2026
72 of 78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants