split out varlen batch search into utils by reubenconducts · Pull Request #2556 · Dao-AILab/flash-attention

reubenconducts · 2026-05-12T19:07:11Z

…ero-len (Dao-AILab#2568), varlen batch search (Dao-AILab#2556)

@reubenconducts

…d, use shared utility Addresses @reubenconducts's May 22 review comments on Dao-AILab#2520: 1. Rename mTileCumsum -> mCuTotalMBlocks across all 9 kernels + scheduler + interface for consistency with the convention introduced in Dao-AILab#2224 (already in main; used by blocksparse, and by Dao-AILab#2559). 2. Drop num_head (and the related pack_gqa arch-conditional remap) from the host cumsum. Per-batch cumsum is now pure m_blocks; the scheduler handles num_head separately. Removes the SM80/SM120 vs SM90/100/110/MLA branching that previously mirrored the pack_gqa_layout reshape behavior. 3. Replace the inline binary search in _varlen_coord_map's cumsum-on branch with a call to utils.get_batch_from_cu_tensor (the shared utility from Dao-AILab#2556). The existing snap-to-group-boundary + warp-scan structure is preserved — the cumsum serves as a hint to skip ahead, and the warp-scan refines to the exact batch using _get_num_m_blocks (which already handles pack_gqa, q_stage, cluster, etc.). This matches the scheduler-side approach in Dao-AILab#2559. The pack_gqa seqlen multiplier stays in _compute_cu_total_m_blocks so that per-batch m_block counts match the kernel's _get_num_m_blocks formula — the snap is forward-only, so under-estimating per-batch counts is safe but over-estimating (which dropping the multiplier would cause when pack_gqa is on) would land the snap past the correct batch and the warp-scan couldn't recover. Verified on SM100: - 72 new tests (test_varlen_scheduler_binary_search_correctness{,_bwd}): pass - existing test_varlen (B=20 slice, 576 cases): pass - existing test_flash_attn_mla_absorbed_varlen (480 cases): pass

* split out varlen batch search into utils * more descriptive name

reubenconducts added 2 commits May 12, 2026 19:05

split out varlen batch search into utils

201f2a1

more descriptive name

62f107c

jayhshah approved these changes May 14, 2026

View reviewed changes

jayhshah merged commit 0409f9a into Dao-AILab:main May 14, 2026

ussoewwin added a commit to ussoewwin/flash-attention that referenced this pull request May 21, 2026

Merge upstream/main: split-kv blocksparse (Dao-AILab#2536), hdim256 z…

b88e772

…ero-len (Dao-AILab#2568), varlen batch search (Dao-AILab#2556)

reubenconducts mentioned this pull request May 22, 2026

[Cute] Fix O(N^2) per-CTA scan in SingleTileVarlenScheduler #2520

Open

reubenconducts added a commit to reubenconducts/flash-attention that referenced this pull request Jun 2, 2026

split out varlen batch search into utils (Dao-AILab#2556)

69b8709

* split out varlen batch search into utils * more descriptive name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split out varlen batch search into utils#2556

split out varlen batch search into utils#2556
jayhshah merged 2 commits into
Dao-AILab:mainfrom
reubenconducts:rstern/search-util

reubenconducts commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

reubenconducts commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants