Improve flash.cute paged_kv cpasync by v0i0 · Pull Request #2156 · Dao-AILab/flash-attention

v0i0 · 2026-01-08T22:15:19Z

No description provided.

drisspg · 2026-01-08T23:56:38Z

flash_attn/cute/paged_kv.py

-            should_load = tXcX[0, m, 0][0] < seqlenk_row_limit
+        for m in cutlass.range_constexpr(cute.size(tXsX, mode=[1])):
+            row_valid = tXcX[0, m, 0][0] < seqlenk_row_limit
+            should_load = cute.make_fragment_like(tXsX[None, m, 0], cute.Boolean)


nit: make_rmem_tensor_like

drisspg

LGTM
CC @jayhshah lll let you take a look

reubenconducts · 2026-01-09T00:15:23Z

Some very quick tests show ever so slightly worse perf than #2104; have you measured? that is, adding your changes onto #2104.

v0i0 · 2026-01-09T00:22:00Z

Some very quick tests show ever so slightly worse perf than #2104; have you measured? that is, adding your changes onto #2104.

first commit in the PR has the benchmark script I used

reubenconducts · 2026-01-09T00:43:58Z

Some very quick tests show ever so slightly worse perf than #2104; have you measured? that is, adding your changes onto #2104.

first commit in the PR has the benchmark script I used

Thanks! I stand corrected, the combination of this PR and 2104 appears to almost close the gap between page size < 128 and = 128:

COMBINED PRS
====================================================================================================
PAGED ATTENTION BENCHMARK
====================================================================================================
Page sizes: [1, 4, 8, 16, 32, 64, 128]
Head dimensions: [128]
Batch sizes: [4]
Sequence lengths: [65536]
Causal: True, dtype: torch.bfloat16
Testing fragmented page tables: False
====================================================================================================

### headdim=128, batch=4, seqlen=65536 ###
  Baseline (no paging): 0.653ms, 6.6 TFLOPS
  page_size=  1 (contiguous): 0.731ms, 5.8750 TFLOPS, overhead: +11.9%
  page_size=  4 (contiguous): 0.721ms, 5.9529 TFLOPS, overhead: +10.4%
  page_size=  8 (contiguous): 0.720ms, 5.9626 TFLOPS, overhead: +10.2%
  page_size= 16 (contiguous): 0.720ms, 5.9638 TFLOPS, overhead: +10.2%
  page_size= 32 (contiguous): 0.728ms, 5.8962 TFLOPS, overhead: +11.5%
  page_size= 64 (contiguous): 0.753ms, 5.7043 TFLOPS, overhead: +15.2%
  page_size=128 (contiguous): 0.656ms, 6.5462 TFLOPS, overhead: +0.4%

jayhshah · 2026-01-09T04:10:56Z

It's interesting that you use a predicate of size (atom_v, rest_v) instead of (rest_v) but this still generates the right LDGSTS instructions. If I switch over to

should_load = cute.make_fragment_like(tXsX[(0, None), None, 0], cute.Boolean)
for m in cutlass.range_constexpr(cute.size(tXsX, mode=[1])):
    row_valid = tXcX[0, m, 0][0] < seqlenk_row_limit
    should_load[None, m].fill(row_valid)

and also set the gmem layout correctly via

mX_paged_cur_copy_cur = mX_paged_cur_copy[None, ki]
tXsX_cur = tXsX[None, m, k]
mX_paged_cur_copy_cur = cute.make_tensor(mX_paged_cur_copy_cur.iterator, tXsX_cur.layout)

then use

cute.copy(
    self.gmem_tiled_copy_KV,
    mX_paged_cur_copy_cur,
    tXsX_cur,
    pred=should_load[None, m],
)

I can sometimes get better results, e.g.

NOT CAUSAL
### headdim=128, batch=4, seqlen=8192 ###
  Baseline (no paging): 1.454ms, 1512.8 TFLOPS
  OLD page_size= 64 (contiguous): 3.819ms, 575.8 TFLOPS, overhead: +162.7%
  NEW page_size= 64 (contiguous): 3.418ms, 643.3 TFLOPS, overhead: +135.3%

CAUSAL
### headdim=128, batch=4, seqlen=8192 ###
  Baseline (no paging): 0.780ms, 1409.4 TFLOPS
  OLD page_size= 64 (contiguous): 1.834ms, 599.6 TFLOPS, overhead: +134.8%
  NEW page_size= 64 (contiguous): 1.793ms, 613.3 TFLOPS, overhead: +129.8%

But this prefetch computation for the predicate sometimes hurts perf as well like for varlen. We can merge this PR and revisit once the distributed offset PR is also merged in.

Replace if/elif branching with predicated cp.async for paged KV loading. This simplifies the code by removing the fill_swizzled helper and using a single loop with pred= parameter. Based on: Dao-AILab/flash-attention#2156

v0i0 · 2026-01-12T18:09:57Z

@jayhshah took a quick look & it seemed like the reduced predicate tensor increases spilling (which in general seems to be an issue in this code). So gonna merge as is.

Improve flash.cute paged_kv cpasync

v0i0 added 2 commits December 24, 2025 00:30

improve paged cpasync

11b32fd

cleanup

c15ffe3

v0i0 requested review from drisspg and jayhshah January 8, 2026 22:24

drisspg reviewed Jan 8, 2026

View reviewed changes

jayhshah approved these changes Jan 9, 2026

View reviewed changes

v0i0 merged commit dbf08eb into Dao-AILab:main Jan 12, 2026

elewarr pushed a commit to elewarr/flash-attention that referenced this pull request Feb 4, 2026

Merge pull request Dao-AILab#2156 from v0i0/v0i0/improve-paged-ldgsts

b196d99

Improve flash.cute paged_kv cpasync

YangWang92 pushed a commit to YangWang92/flash-attention that referenced this pull request Feb 15, 2026

Merge pull request Dao-AILab#2156 from v0i0/v0i0/improve-paged-ldgsts

c0fab89

Improve flash.cute paged_kv cpasync

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve flash.cute paged_kv cpasync#2156

Improve flash.cute paged_kv cpasync#2156
v0i0 merged 2 commits intoDao-AILab:mainfrom
v0i0:v0i0/improve-paged-ldgsts

v0i0 commented Jan 8, 2026

Uh oh!

drisspg Jan 8, 2026

Uh oh!

drisspg left a comment

Uh oh!

reubenconducts commented Jan 9, 2026 •

edited

Loading

Uh oh!

v0i0 commented Jan 9, 2026

Uh oh!

reubenconducts commented Jan 9, 2026

Uh oh!

jayhshah commented Jan 9, 2026

Uh oh!

v0i0 commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

v0i0 commented Jan 8, 2026

Uh oh!

drisspg Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

reubenconducts commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

v0i0 commented Jan 9, 2026

Uh oh!

reubenconducts commented Jan 9, 2026

Uh oh!

jayhshah commented Jan 9, 2026

Uh oh!

v0i0 commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

reubenconducts commented Jan 9, 2026 •

edited

Loading