Skip to content

Include sm_110 in Blackwell-family arch gating (follow-up to #2572)#2590

Merged
jayhshah merged 3 commits into
Dao-AILab:mainfrom
Johnsonms:johnson/sm110-blackwell-gating
May 26, 2026
Merged

Include sm_110 in Blackwell-family arch gating (follow-up to #2572)#2590
jayhshah merged 3 commits into
Dao-AILab:mainfrom
Johnsonms:johnson/sm110-blackwell-gating

Conversation

@Johnsonms
Copy link
Copy Markdown
Collaborator

@Johnsonms Johnsonms commented May 25, 2026

Follow-up to #2572: align remaining arch // 10 == 10 checks with the codebase’s Blackwell-family convention (arch // 10 in [10, 11]) so sm_110 (Thor) is not unintentionally excluded.

flash_fwd_sm100.py already accepts both sm_10x and sm_11x, and interface.py dispatches both through the same FlashAttentionForwardSm100 / MLA paths. However, several remaining == 10 checks still cause sm_110 to silently miss optimized paths.

Split into two commits for easier review.

Commit 1 — flash_bwd_postprocess.py

Updates two 2CTA gating sites:

  • use_2cta_instrs
  • 2CTA TMEM remap branch

Previously, sm_110 could not enter the 2CTA postprocess path even when the rest of the backward pipeline was already using 2CTA. This aligns the logic with the existing Blackwell-family convention used elsewhere in the codebase.

Commit 2 — interface.py

Updates three inconsistent gating sites:

  • q_stage heuristic
  • use_dedicated_hd256_kernel (fwd)
  • use_dedicated_hd256_kernel (bwd)

Currently, sm_110 falls back to less optimized paths (q_stage=1 and generic hd=256 kernels). Nearby logic in the same file already uses arch // 10 in [10, 11], so these appear to be oversights rather than intentional restrictions.

If any of these paths are intentionally sm_100-only (e.g. unvalidated on sm_110), happy to revert individual hunks.

Also updates one stale # SM100 only comment to # Blackwell family for consistency.

Intentionally unchanged

interface.py:480 still keeps the FP8 assertion:

arch // 10 == 10

with the message:

"FP8 is only supported on SM100 (compute capability 10.x) for FA4 CuTe."

This appears intentional, so FP8 gating was left unchanged.

Note on implementation style

#2572 used is_family_of because that file operates on the Arch enum. These files use arch: int, so this PR follows the existing local convention:

arch // 10 in [10, 11]

rather than introducing a broader Arch enum refactor.

Johnsonms added 2 commits May 25, 2026 07:27
The 2CTA gating in flash_bwd_postprocess.py used `arch // 10 == 10`,
which only matches SM 10.x (B100/B200/B300) and misses SM 11.x (Thor).
The rest of the codebase (e.g. interface.py:549, 563, 834) consistently
gates Blackwell-family 2CTA features as `arch // 10 in [10, 11]`.

Bring the two postprocess sites in line with that convention.

Flagged by @jayhshah in Dao-AILab#2572 follow-up discussion.
Three sites in interface.py gate Blackwell-family behavior using
`arch // 10 == 10`, which appears inconsistent with the rest of the
file's `arch // 10 in [10, 11]` convention (used at lines 549, 563,
834, 974, 1035, etc.):

- L533: `q_stage` heuristic for Blackwell forward
- L579: `use_dedicated_hd256_kernel` (forward)
- L1335: `use_dedicated_hd256_kernel` (backward)

The dispatch in `_flash_attn_fwd` already routes both sm_10x and sm_11x
through the same `FlashAttentionForwardSm100` / MLA classes, so these
gates likely should treat them the same.

NOTE FOR REVIEWERS: I'm not certain these are all oversight vs. intentional
SM100-only paths. If any of them is intentional, please flag so I can
revert just that hunk. The FP8 assert at L480 is left untouched on
purpose — its error message reads as deliberate.
@Johnsonms Johnsonms requested review from jayhshah, tridao and v0i0 May 25, 2026 08:03
@Johnsonms Johnsonms marked this pull request as ready for review May 25, 2026 23:35
Pre-existing format drift surfaced by pre-commit. Not in the
cute_exclude pattern, so it gets auto-fixed when other files in
flash_attn/cute/ are touched in the same commit chain.
@Johnsonms Johnsonms force-pushed the johnson/sm110-blackwell-gating branch from c5c2fbb to 77efc89 Compare May 25, 2026 23:37
@jayhshah jayhshah merged commit 59cf537 into Dao-AILab:main May 26, 2026
1 check passed
@Johnsonms Johnsonms deleted the johnson/sm110-blackwell-gating branch May 30, 2026 06:14
reubenconducts pushed a commit to reubenconducts/flash-attention that referenced this pull request Jun 2, 2026
…b#2572) (Dao-AILab#2590)

* Fix bwd postprocess 2CTA gating to include sm_11x

The 2CTA gating in flash_bwd_postprocess.py used `arch // 10 == 10`,
which only matches SM 10.x (B100/B200/B300) and misses SM 11.x (Thor).
The rest of the codebase (e.g. interface.py:549, 563, 834) consistently
gates Blackwell-family 2CTA features as `arch // 10 in [10, 11]`.

Bring the two postprocess sites in line with that convention.

Flagged by @jayhshah in Dao-AILab#2572 follow-up discussion.

* Include sm_110 in interface.py Blackwell-family heuristics

Three sites in interface.py gate Blackwell-family behavior using
`arch // 10 == 10`, which appears inconsistent with the rest of the
file's `arch // 10 in [10, 11]` convention (used at lines 549, 563,
834, 974, 1035, etc.):

- L533: `q_stage` heuristic for Blackwell forward
- L579: `use_dedicated_hd256_kernel` (forward)
- L1335: `use_dedicated_hd256_kernel` (backward)

The dispatch in `_flash_attn_fwd` already routes both sm_10x and sm_11x
through the same `FlashAttentionForwardSm100` / MLA classes, so these
gates likely should treat them the same.

NOTE FOR REVIEWERS: I'm not certain these are all oversight vs. intentional
SM100-only paths. If any of them is intentional, please flag so I can
revert just that hunk. The FP8 assert at L480 is left untouched on
purpose — its error message reads as deliberate.

* Apply ruff format to flash_bwd_sm100.py

Pre-existing format drift surfaced by pre-commit. Not in the
cute_exclude pattern, so it gets auto-fixed when other files in
flash_attn/cute/ are touched in the same commit chain.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants