Include sm_110 in Blackwell-family arch gating (follow-up to #2572) by Johnsonms · Pull Request #2590 · Dao-AILab/flash-attention

Johnsonms · 2026-05-25T08:02:46Z

Follow-up to #2572: align remaining arch // 10 == 10 checks with the codebase’s Blackwell-family convention (arch // 10 in [10, 11]) so sm_110 (Thor) is not unintentionally excluded.

flash_fwd_sm100.py already accepts both sm_10x and sm_11x, and interface.py dispatches both through the same FlashAttentionForwardSm100 / MLA paths. However, several remaining == 10 checks still cause sm_110 to silently miss optimized paths.

Split into two commits for easier review.

Commit 1 — `flash_bwd_postprocess.py`

Updates two 2CTA gating sites:

use_2cta_instrs
2CTA TMEM remap branch

Previously, sm_110 could not enter the 2CTA postprocess path even when the rest of the backward pipeline was already using 2CTA. This aligns the logic with the existing Blackwell-family convention used elsewhere in the codebase.

Commit 2 — `interface.py`

Updates three inconsistent gating sites:

q_stage heuristic
use_dedicated_hd256_kernel (fwd)
use_dedicated_hd256_kernel (bwd)

Currently, sm_110 falls back to less optimized paths (q_stage=1 and generic hd=256 kernels). Nearby logic in the same file already uses arch // 10 in [10, 11], so these appear to be oversights rather than intentional restrictions.

If any of these paths are intentionally sm_100-only (e.g. unvalidated on sm_110), happy to revert individual hunks.

Also updates one stale # SM100 only comment to # Blackwell family for consistency.

Intentionally unchanged

interface.py:480 still keeps the FP8 assertion:

arch // 10 == 10

with the message:

"FP8 is only supported on SM100 (compute capability 10.x) for FA4 CuTe."

This appears intentional, so FP8 gating was left unchanged.

Note on implementation style

#2572 used is_family_of because that file operates on the Arch enum. These files use arch: int, so this PR follows the existing local convention:

arch // 10 in [10, 11]

rather than introducing a broader Arch enum refactor.

@jayhshah

The 2CTA gating in flash_bwd_postprocess.py used `arch // 10 == 10`, which only matches SM 10.x (B100/B200/B300) and misses SM 11.x (Thor). The rest of the codebase (e.g. interface.py:549, 563, 834) consistently gates Blackwell-family 2CTA features as `arch // 10 in [10, 11]`. Bring the two postprocess sites in line with that convention. Flagged by @jayhshah in Dao-AILab#2572 follow-up discussion.

Three sites in interface.py gate Blackwell-family behavior using `arch // 10 == 10`, which appears inconsistent with the rest of the file's `arch // 10 in [10, 11]` convention (used at lines 549, 563, 834, 974, 1035, etc.): - L533: `q_stage` heuristic for Blackwell forward - L579: `use_dedicated_hd256_kernel` (forward) - L1335: `use_dedicated_hd256_kernel` (backward) The dispatch in `_flash_attn_fwd` already routes both sm_10x and sm_11x through the same `FlashAttentionForwardSm100` / MLA classes, so these gates likely should treat them the same. NOTE FOR REVIEWERS: I'm not certain these are all oversight vs. intentional SM100-only paths. If any of them is intentional, please flag so I can revert just that hunk. The FP8 assert at L480 is left untouched on purpose — its error message reads as deliberate.

Pre-existing format drift surfaced by pre-commit. Not in the cute_exclude pattern, so it gets auto-fixed when other files in flash_attn/cute/ are touched in the same commit chain.

@jayhshah

…b#2572) (Dao-AILab#2590) * Fix bwd postprocess 2CTA gating to include sm_11x The 2CTA gating in flash_bwd_postprocess.py used `arch // 10 == 10`, which only matches SM 10.x (B100/B200/B300) and misses SM 11.x (Thor). The rest of the codebase (e.g. interface.py:549, 563, 834) consistently gates Blackwell-family 2CTA features as `arch // 10 in [10, 11]`. Bring the two postprocess sites in line with that convention. Flagged by @jayhshah in Dao-AILab#2572 follow-up discussion. * Include sm_110 in interface.py Blackwell-family heuristics Three sites in interface.py gate Blackwell-family behavior using `arch // 10 == 10`, which appears inconsistent with the rest of the file's `arch // 10 in [10, 11]` convention (used at lines 549, 563, 834, 974, 1035, etc.): - L533: `q_stage` heuristic for Blackwell forward - L579: `use_dedicated_hd256_kernel` (forward) - L1335: `use_dedicated_hd256_kernel` (backward) The dispatch in `_flash_attn_fwd` already routes both sm_10x and sm_11x through the same `FlashAttentionForwardSm100` / MLA classes, so these gates likely should treat them the same. NOTE FOR REVIEWERS: I'm not certain these are all oversight vs. intentional SM100-only paths. If any of them is intentional, please flag so I can revert just that hunk. The FP8 assert at L480 is left untouched on purpose — its error message reads as deliberate. * Apply ruff format to flash_bwd_sm100.py Pre-existing format drift surfaced by pre-commit. Not in the cute_exclude pattern, so it gets auto-fixed when other files in flash_attn/cute/ are touched in the same commit chain.

Johnsonms added 2 commits May 25, 2026 07:27

Johnsonms requested review from jayhshah, tridao and v0i0 May 25, 2026 08:03

Johnsonms marked this pull request as ready for review May 25, 2026 23:35

Apply ruff format to flash_bwd_sm100.py

77efc89

Pre-existing format drift surfaced by pre-commit. Not in the cute_exclude pattern, so it gets auto-fixed when other files in flash_attn/cute/ are touched in the same commit chain.

Johnsonms force-pushed the johnson/sm110-blackwell-gating branch from c5c2fbb to 77efc89 Compare May 25, 2026 23:37

jayhshah approved these changes May 26, 2026

View reviewed changes

jayhshah merged commit 59cf537 into Dao-AILab:main May 26, 2026
1 check passed

jayhshah mentioned this pull request May 26, 2026

[CuTe,Sm110] Fix sm110 2cta dQ postprocess #2491

Closed

Johnsonms deleted the johnson/sm110-blackwell-gating branch May 30, 2026 06:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include sm_110 in Blackwell-family arch gating (follow-up to #2572)#2590

Include sm_110 in Blackwell-family arch gating (follow-up to #2572)#2590
jayhshah merged 3 commits into
Dao-AILab:mainfrom
Johnsonms:johnson/sm110-blackwell-gating

Johnsonms commented May 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Johnsonms commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commit 1 — flash_bwd_postprocess.py

Commit 2 — interface.py

Intentionally unchanged

Note on implementation style

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Johnsonms commented May 25, 2026 •

edited

Loading

Commit 1 — `flash_bwd_postprocess.py`

Commit 2 — `interface.py`