Include sm_110 in Blackwell-family arch gating (follow-up to #2572)#2590
Merged
jayhshah merged 3 commits intoMay 26, 2026
Merged
Conversation
The 2CTA gating in flash_bwd_postprocess.py used `arch // 10 == 10`, which only matches SM 10.x (B100/B200/B300) and misses SM 11.x (Thor). The rest of the codebase (e.g. interface.py:549, 563, 834) consistently gates Blackwell-family 2CTA features as `arch // 10 in [10, 11]`. Bring the two postprocess sites in line with that convention. Flagged by @jayhshah in Dao-AILab#2572 follow-up discussion.
Three sites in interface.py gate Blackwell-family behavior using `arch // 10 == 10`, which appears inconsistent with the rest of the file's `arch // 10 in [10, 11]` convention (used at lines 549, 563, 834, 974, 1035, etc.): - L533: `q_stage` heuristic for Blackwell forward - L579: `use_dedicated_hd256_kernel` (forward) - L1335: `use_dedicated_hd256_kernel` (backward) The dispatch in `_flash_attn_fwd` already routes both sm_10x and sm_11x through the same `FlashAttentionForwardSm100` / MLA classes, so these gates likely should treat them the same. NOTE FOR REVIEWERS: I'm not certain these are all oversight vs. intentional SM100-only paths. If any of them is intentional, please flag so I can revert just that hunk. The FP8 assert at L480 is left untouched on purpose — its error message reads as deliberate.
Pre-existing format drift surfaced by pre-commit. Not in the cute_exclude pattern, so it gets auto-fixed when other files in flash_attn/cute/ are touched in the same commit chain.
c5c2fbb to
77efc89
Compare
jayhshah
approved these changes
May 26, 2026
reubenconducts
pushed a commit
to reubenconducts/flash-attention
that referenced
this pull request
Jun 2, 2026
…b#2572) (Dao-AILab#2590) * Fix bwd postprocess 2CTA gating to include sm_11x The 2CTA gating in flash_bwd_postprocess.py used `arch // 10 == 10`, which only matches SM 10.x (B100/B200/B300) and misses SM 11.x (Thor). The rest of the codebase (e.g. interface.py:549, 563, 834) consistently gates Blackwell-family 2CTA features as `arch // 10 in [10, 11]`. Bring the two postprocess sites in line with that convention. Flagged by @jayhshah in Dao-AILab#2572 follow-up discussion. * Include sm_110 in interface.py Blackwell-family heuristics Three sites in interface.py gate Blackwell-family behavior using `arch // 10 == 10`, which appears inconsistent with the rest of the file's `arch // 10 in [10, 11]` convention (used at lines 549, 563, 834, 974, 1035, etc.): - L533: `q_stage` heuristic for Blackwell forward - L579: `use_dedicated_hd256_kernel` (forward) - L1335: `use_dedicated_hd256_kernel` (backward) The dispatch in `_flash_attn_fwd` already routes both sm_10x and sm_11x through the same `FlashAttentionForwardSm100` / MLA classes, so these gates likely should treat them the same. NOTE FOR REVIEWERS: I'm not certain these are all oversight vs. intentional SM100-only paths. If any of them is intentional, please flag so I can revert just that hunk. The FP8 assert at L480 is left untouched on purpose — its error message reads as deliberate. * Apply ruff format to flash_bwd_sm100.py Pre-existing format drift surfaced by pre-commit. Not in the cute_exclude pattern, so it gets auto-fixed when other files in flash_attn/cute/ are touched in the same commit chain.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #2572: align remaining
arch // 10 == 10checks with the codebase’s Blackwell-family convention (arch // 10 in [10, 11]) sosm_110(Thor) is not unintentionally excluded.flash_fwd_sm100.pyalready accepts bothsm_10xandsm_11x, andinterface.pydispatches both through the sameFlashAttentionForwardSm100/ MLA paths. However, several remaining== 10checks still causesm_110to silently miss optimized paths.Split into two commits for easier review.
Commit 1 —
flash_bwd_postprocess.pyUpdates two 2CTA gating sites:
use_2cta_instrsPreviously,
sm_110could not enter the 2CTA postprocess path even when the rest of the backward pipeline was already using 2CTA. This aligns the logic with the existing Blackwell-family convention used elsewhere in the codebase.Commit 2 —
interface.pyUpdates three inconsistent gating sites:
q_stageheuristicuse_dedicated_hd256_kernel(fwd)use_dedicated_hd256_kernel(bwd)Currently,
sm_110falls back to less optimized paths (q_stage=1and generic hd=256 kernels). Nearby logic in the same file already usesarch // 10 in [10, 11], so these appear to be oversights rather than intentional restrictions.If any of these paths are intentionally
sm_100-only (e.g. unvalidated onsm_110), happy to revert individual hunks.Also updates one stale
# SM100 onlycomment to# Blackwell familyfor consistency.Intentionally unchanged
interface.py:480still keeps the FP8 assertion:with the message:
This appears intentional, so FP8 gating was left unchanged.
Note on implementation style
#2572 used
is_family_ofbecause that file operates on theArchenum. These files usearch: int, so this PR follows the existing local convention:rather than introducing a broader
Archenum refactor.