[Hipblaslt] [Subtile] Tail-loop support for BF16 and MXFP4 #7683
Open
sebvince wants to merge 56 commits into
Open
[Hipblaslt] [Subtile] Tail-loop support for BF16 and MXFP4 #7683sebvince wants to merge 56 commits into
sebvince wants to merge 56 commits into
Conversation
bnemanich
added a commit
that referenced
this pull request
May 21, 2026
The K-tail scaffold's per-mmak early exit emits `s_cmp_le_u32 LoopCounterL, MIK*(subIterK+1)` with the threshold as a compile-time literal. gfx950's VOPC / SOPC inline-constant range is [-16, 64]; literals past 64 land in the assembler's 32-bit literal slot, which some opcodes either reject or silently mis-encode. The threshold exceeds 64 for typical bf16 (MIK=32, subIterK >= 2) and for FP4 (MIK=128) at every mmak >= 0, so the existing emit was already at risk on those configs even though our test fixtures happened to cover only the inline range. Adds `_emitSubtileScalarCmpLitOrStaged` which, for a given cmp op / src0 / literal, emits the cmp directly when the literal fits in the inline range, and otherwise stages it through a scratch sgpr via `s_mov_b32 sScratch, <lit>` first. The per-mmak early exit now goes through this helper. Idea ported from sebvince's #7683 commit 6d7e743 (`Dont use literal for cmp when diff >64`). Test additions in `test_subtile_tailloop_emit.py`: - `test_per_mmak_early_exit_inline_when_consumedK_fits`: bf16 fixture (consumedK=32) must use the inline form. - `test_per_mmak_early_exit_staged_when_consumedK_exceeds_inline`: FP4 fixture (consumedK >= 128) must stage every cmp. - `test_per_mmak_early_exit_boundary_consumedK_64`: custom bf16 DU=128 fixture pins the inclusive upper bound (32, 64 inline; 96 staged). - `test_per_mmak_early_exit_threshold_progression` updated to accept both inline and staged forms via a new helper `_collect_per_mmak_early_exit_thresholds` that pairs each cmp with its preceding `stage literal ...` s_mov_b32 by sgpr name. End-to-end verification on subtile_bf16.yaml shows the production emit using 0x20 / 0x40 inline and staging 0x60..0xe0 through sgprs; both subtile_bf16.yaml and subtile_bf16_anyk_largemt.yaml pass with the change. Unit suite: 871 passed / 5 skipped / 2 xfailed (baseline 868). Co-authored-by: Cursor <cursoragent@cursor.com>
bnemanich
added a commit
that referenced
this pull request
May 21, 2026
The byte-refine K-tail mask path previously emitted the full
per-(operand, ir) chain (`v_mov seed` + per-mod runtime-gated
`v_cmp + v_cndmask + v_and`) inside the per-mmak loop body. The
chain is independent of A/B tile data — it depends only on
`kPosBase + mmak*MIK + ir*elementsPerVgpr` and `LoopCounterL` —
so it can be hoisted out once before the loop and the per-mmak
step collapses to pure `v_and_b32 vIdx, vMask[..], vIdx`.
This commit lands that hoist on the byte-refine path:
* `_emitTailSubLaneMaskChainIntoVgpr`: extracted chain emitter
that writes its mask to a caller-owned target VGPR (mirrors
the body of `_emitTailSubLaneMaskRefineSubtile._emitChain`;
keeps the bpe-parametric mod chain, the static `ASEM*bpe %
bpr == 0` skip, and the runtime `LoopCounterL &
(elementsPerVgpr-1)` gate).
* `_emitTailSubLaneMaskPrecomputeSubtile`: allocates one scratch
VGPR per (operand, mmak, ir) and emits the chain into each.
When `bpeA == bpeB` (the common bf16/bf16 case) A and B share
the same mask VGPR per (mmak, ir), halving the persistent
VGPR cost. Returns the `(operand, mmak, ir) -> vgpr` lookup
plus the list of VGPRs to release after the loop.
* `_emitTailSubLaneMaskApplySubtile`: per-mmak v_and-only apply
step. Comment carries `apply precomputed mask to ...` so the
new shape is distinguishable from the legacy inline emit.
`_emitTailLoopScaffoldSubtile` now precomputes the mask map once
before the per-mmak loop and calls apply per mmak. Cleanup of the
precomputed VGPRs happens after the loop completes (the runtime
per-mmak early-exit branches past the cleanup so emit-time pool
accounting balances).
Gated by `kernel.get("SubtileTailMaskPrecompute", True)`. Setting
it False reverts to the legacy `_emitTailSubLaneMaskRefineSubtile`
per-mmak inline chain. The coarse path (ASEM>=numMIInUnroll, MX
scale) is untouched by this commit.
Scope:
* Byte-refine path ONLY (ASEM<numMIInUnroll, integer bpe, no MX).
The user-flagged risk of regressing the per-mmak LR + MFMA
interleave on MT320x288 is avoided by NOT extending the
precompute to the coarse path: that path's higher tile-grid
counts would inflate the persistent VGPR cost on the
high-water configs.
* MT320x288 (subtile_bf16_anyk_largemt): byte-refine kernels
stay at .vgpr_count=249 (same as the per-mmak-LR baseline);
the +numMmaks*vgprPerInUnroll persistent mask VGPRs (8 for
DU=64) do not push past the D-tile high-water.
Tests in `test_subtile_anyk_emit.py`:
* `TestAnyKEmit_Precompute`: new class pinning the hoist
(precompute block sits BEFORE the per-mmak ds_read wait;
apply block contains no cmps/cndmasks), the A/B mask VGPR
dedupe (bf16/bf16 → one VGPR per (mmak, ir) shared across
operands), the config switch behaviour (off reverts to
legacy inline shape), the NoTailLoop no-op, and the
coarse-path identity (ASEM=32 emit byte-identical with the
switch on vs off).
* `TestAnyKEmit_K2::test_k2_emits_per_operand_byte_refine`,
`TestAnyKEmit_K1::test_k1_no_narrow_load_and_emits_partial_chain`:
updated to recognise the new `apply precomputed mask to ...`
apply comment and to verify the chain/gate live in the
precompute prefix rather than per-mmak.
Idea ported from sebvince's #7683 commit a0fee26 (`Precompute
mask for all subIterK`) with subsequent refinements (e9f5f55,
d999a28, 4a4d6a5). Sebvince's framework computes per-iteration
masks (BF16 3-state blend + boundary precompute); this port keeps
our existing bpe-parametric mod chain and just moves the
emit-site from inside the loop to before it.
Unit suite: 876 passed / 5 skipped / 2 xfailed (baseline 868;
+3 from Ask 1, +5 from Ask 2 precompute pins).
Yaml smokes (all PASS):
- subtile_bf16
- subtile_bf16_tail
- subtile_bf16_anyk_k2
- subtile_bf16_anyk_odd
- subtile_bf16_anyk_k8
- subtile_bf16_anyk_largemt (MT320x288 byte-refine, vgpr=249)
- subtile_mxfp4_tail
Co-authored-by: Cursor <cursoragent@cursor.com>
Codecov Report❌ Patch coverage is ❌ Your project status has failed because the head coverage (27.58%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #7683 +/- ##
============================================
- Coverage 61.87% 30.35% -31.51%
============================================
Files 2086 161 -1925
Lines 357038 67735 -289303
Branches 53806 13045 -40761
============================================
- Hits 220892 20560 -200332
+ Misses 117348 45380 -71968
+ Partials 18798 1795 -17003
🚀 New features to boost your workflow:
|
bnemanich
added a commit
that referenced
this pull request
May 23, 2026
In the subtile tail scaffold, the PGR>0 SRD-advance / LWA-XOR
block (which undoes the mainloop-exit's last `gr_inc + lw_swap` so
the tail's GR+LR can re-issue at K_aligned with LWA/LR in sync)
sat AFTER `calculateLoopNumIter(-1)`, i.e. on the not-taken side of
the K%DU==0 early-exit cmp/branch to SkipTailLoopL.
Sebvince's #PR-7683 design (`build_tailloop_pgr0` + `Subtile/
Kernel.mainLoop`) emits the analogous mainloop-exit code as part
of `emitMainAndExitLoops`, BEFORE the orchestrator calls
`calculateLoopNumIter(-1)` and `emitTailLoop`. Nakajee's MT320x320
side-by-side flagged sebvince's placement as preferable for two
reasons:
- The SRD-add / LWA-xor chain has small dependency lengths; on
sebvince's side it co-issues with calculateLoopNumIter's
scalar divide-and-remainder, whereas our previous placement
serialized it after the cmp+branch (a hard dependency on the
scc / branch resolution).
- The OrigLoopCounter snapshot SGPR our scaffold held across
`calculateLoopNumIter` (because it resets OrigLoopCounter to 0
at the start of the tail-number-of-iters compute) becomes
unnecessary once the gate runs before that reset -- the gate
can read `sgprOrigLoopCounter` directly. One fewer SGPR live
across the tail-entry block.
This commit moves the PGR>0 gating block (origCounter == 0 c=0
branch + 0 < origCounter < PGR small-counter realign + origCounter
>= PGR large-counter SRD-advance) to ABOVE `calculateLoopNumIter`
and drops the `savedOrigKAlignDUs` snapshot SGPR. Safety: on the
early-exit path (K%DU == 0 -> SkipTailLoopL) the now-unconditional
SRD-advance / LWA-XOR is harmless -- the post-tail write-out reads
SrdC/SrdD, never SrdA/B/MXSA/MXSB or LocalWriteBaseAddr.
Test updates in `test_subtile_tailloop_emit.py`:
- `test_emits_origCounter_snapshot{PGR1,PGR2}` -> renamed and
flipped to negative pins
`test_omits_origCounter_snapshot_after_hoist` (the snapshot
s_mov_b32 must NOT appear).
- New `test_gating_reads_OrigLoopCounter_directly`: the c=0 cmp
src0 must spell out `s[sgprOrigLoopCounter]`.
- New `test_gating_block_precedes_TailLoopBeginL`: structural
pin that the c=0 origCounter cmp appears BEFORE both the
`skip to end of tail loop b/c numIter==0` early-exit cmp and
the `TailLoopBeginL` label.
- `test_emits_small_counter_compare`: updated to require the
`s_cmp_lt_u32 s[sgprOrigLoopCounter], 1` form (no anonymous
snapshot sgpr).
Idea ported from sebvince #7683 (`build_tailloop_pgr0` design /
`Subtile/Kernel.mainLoop` wrapper order). Nakajee's MT320x320
side-by-side comparison flagged this placement difference and
preferred sebvince's; this commit lands the move.
Unit suite: 879 passed / 5 skipped / 2 xfailed (baseline 877;
+2 from the new gating positive pins).
Yaml smokes (all PASS):
- subtile_bf16
- subtile_bf16_anyk_odd
- subtile_bf16_anyk_largemt (MT320x288 byte-refine)
Co-authored-by: Cursor <cursoragent@cursor.com>
bnemanich
added a commit
that referenced
this pull request
May 23, 2026
The byte-refine K-tail mask precompute previously emitted one bpe-parametric `byteRefine[<op> ir=N mmak=M]` chain per (operand, mmak, ir): seed `0xFFFFFFFF` mov + per-mod (mod chain length 2 for bf16, 4 for fp8) `v_add + v_cmp_ge_i32 + v_cndmask_b32 vSeed` + mod=0 close. For bf16 byte-refine that is 7 instructions per (mmak, ir) chain, multiplied by `numMmaks * vgprPerInUnroll` chains (e.g. 2*4 = 8 chains -> 56 instr on the DU=64 path, 4*4 = 16 chains -> 112 instr on the DU=128 path). Sebvince's #7683 design factors out the per-lane invariants once (`emit_mask_k_init`) and turns the per-subIterK chain into a pure 3-state diff/boundary cmp+cndmask (`emit_mask_k`). Nakajee's MT320x320 side-by-side review of #7661 vs #7683 flagged this as "much simpler" and asked to verify equivalence before adopting. Equivalence verification (sebvince's BF16 chain vs ours, with `numMIInUnroll = 8`, `kStride = 2`, `vgprPerInUnroll = 4`): laneK_0 := tidInK * numMIInUnroll (== our kPosBase) diff := LoopCounterL - laneK_0 (signed) effective_diff_n := diff - n*MIK (n = mmak / subIterK) d := LoopCounterL % numMIInUnroll (== effective_diff_n in the boundary range) For each (lane, mmak=n, vgpr i): sFull = diff > n*MIK + numMIInUnroll - 1 <=> effective_diff_n >= numMIInUnroll <=> ALL of this lane's K is in range -> mask = -1 sZero = diff <= n*MIK <=> effective_diff_n <= 0 <=> NONE of this lane's K is in range -> mask = 0 otherwise -> mask = boundary[i] where d <= i*2 -> 0 d == i*2+1 -> 0x0000FFFF (low bf16 in, high past) d >= i*2+2 -> -1 Truth-table check (tidInK=0, mmak=0, K_remain in {1..8}): K_rem=1: i=0 mask=0x0000FFFF (low K=0 in, high K=1 past) i=1..3 mask=0 K_rem=2: i=0 -1 ; i=1..3 mask=0 K_rem=3: i=0 -1 ; i=1 mask=0x0000FFFF ; i=2,3 mask=0 K_rem=4: i=0,1 -1 ; i=2,3 mask=0 K_rem=5: i=0,1 -1 ; i=2 mask=0x0000FFFF ; i=3 mask=0 K_rem=6: i=0,1,2 -1 ; i=3 mask=0 K_rem=7: i=0,1,2 -1 ; i=3 mask=0x0000FFFF K_rem=8: i=0..3 -1 (sFull fires for tidInK=0) Our existing chain (bpe-parametric mod chain) produces the same masks bit-for-bit for every entry above (and for tidInK > 0 the diff/sFull/sZero arithmetic and our `K_pos vs LoopCounterL` cmps algebraically coincide). Verified by hand on the full K_remain 1..8 x i 0..3 truth table. Caveats (all non-blocking for our gauntlet): * Sebvince's chain does NOT have our `(ASEM * bpe) % bpr == 0` static skip; the boundary[i] init runs unconditionally. For ASEM=2/4 (currently fast-path with static skip dropping the mod>0 chain) this emits a few more instructions, but the mask bit-pattern stays identical (d is even -> halfKeep branch never fires; only the full/zero outcomes are reachable). * Sebvince's chain assumes `bpeA == bpeB` (one shared boundary[i] per vgpr position, applied to both operands' tiles). All current gauntlet configs are symmetric bpe (`_subtileTailByteShiftApplies`'s integer-bpe gate plus `_emitTailSrdTightenSubtile`'s explicit `bpeA == bpeB` check). For asymmetric bpe (no current YAML) the dispatcher falls back to the legacy chain. * Storage: init holds `1 (diff) + vgprPerInUnroll (boundary)` = 5 persistent VGPRs for BF16 (numMIInUnroll=8 -> vgprPerInUnroll =4), on top of our existing `numMmaks * vgprPerInUnroll` precomputed masks. On MT320x288 byte-refine (`subtile_bf16_anyk_largemt`, baseline vgpr_count=249) the +5 fits well within the 256-VGPR budget; verified by re-running the yaml under the new emit (still PASS). This commit lands the adoption: * `_emitTailSubLaneMaskInitSebvince(kPosBaseVgpr, numMIInUnroll, bpe, vgprPerInUnroll)` -> emits `diff = sgprLoopCounterL - kPosBase` (signed v_sub_i32) + `d = sgprLoopCounterL & 7` + per-i `(d<hi) ? halfKeep : full ; (d<lo) ? 0 : prev` boundary cndmasks. Returns the persistent diff and boundary[i] vgprs. * `_emitTailSubLaneMaskChainIntoVgprSebvince(diffVgpr, boundaryMaskVgpr, ...)` -> per-(operand, mmak, ir): two cmps + two cndmasks. VOPC inline-range staging via `_subtileCmpSrc1FitsInline` for `mmak*MIK + ... > 64` (BF16 DU=128, mmak=2,3). * `_emitTailSubLaneMaskPrecomputeSubtile` dispatches: bpeA == bpeB == 2 && `SubtileTailMaskSebvinceForm=True` (default) -> sebvince form; otherwise -> legacy chain. Legacy chain retained for fp8 / int8 byte-refine (asymmetric bpe gates deferred) and as a reversibility escape hatch. Tests in `test_subtile_anyk_emit.py`: * `_emit_anyk_tail_asm` gains a `sebvinceForm` kwarg (default True) so individual tests can opt back into the legacy chain for regression coverage. * `TestAnyKEmit_K4::test_k4_sebvince_form_emits_init_and_per_mmak_chain` (replaces `test_k4_byte_refine_mod0_only`): pins sebvince init marker + per-(mmak, ir) sFull cndmask count + absence of the legacy mod>0 / mod=0 byteRefine chain seed. * `TestAnyKEmit_K4::test_k4_legacy_form_emits_mod0_only`: new regression pin for `SubtileTailMaskSebvinceForm=False` (legacy chain still emits the mod=0-only chain on ASEM=4). * `TestAnyKEmit_K2::test_k2_emits_sebvince_form_chain` (replaces `test_k2_emits_per_operand_byte_refine`): pins cmp count growth vs K%32 baseline + sebvince diff init + sFull/sZero pairing + `d = LoopCounterL % 8` init. * `TestAnyKEmit_K1::test_k1_no_narrow_load_and_emits_sebvince_chain` (replaces `test_k1_no_narrow_load_and_emits_partial_chain`): pins absence of narrow d16 load + sebvince init + boundary cndmask + per-(mmak, ir) sFull/sZero cndmasks; the chain sits in the precompute prefix (before per-mmak ds_read wait). * `TestAnyKEmit_Precompute::test_precompute_block_before_per_mmak_loop`: rewritten to pin sebvince diff init + per-mmak sFull cmps in the precompute section; apply section has no chain primitives. * `TestAnyKEmit_Precompute::test_precompute_hoisted_above_dtl_wait_and_barrier`: rewritten to pin sebvince `diff` + per-(mmak, ir) sFull/sZero markers above the DTL wait. Asm excerpt from `subtile_bf16_anyk_odd.yaml` MT128x128 DU=64 (`Cijk_Alik_Bljk_BSS_BH ... MT128x128x64`), the new precompute prefix (lines 1414-1453): ``` v_sub_i32 v12, s[sgprLoopCounterL], v11 // diff = LoopCounterL - kPosBase v_mov_b32 v13, 0x0000FFFF // halfKeep v_and_b32 v14, 7, s[sgprLoopCounterL] // d = LoopCounterL % 8 v_cmp_lt_i32 s[68:69], v14, 2 // boundary[0]: d < 2 ? v_cndmask_b32 v15, -1, v13, s[68:69] // boundary[0] = (d<2) ? halfKeep : full v_cmp_lt_i32 s[68:69], v14, 1 // boundary[0]: d < 1 ? v_cndmask_b32 v15, v15, 0, s[68:69] // boundary[0] = (d<1) ? 0 : prev ... boundary[1..3] (3 vgprs, same shape) ... v_cmp_gt_i32 s[68:69], v12, 7 // mmak=0 ir=0: sFull = diff > 7 v_cndmask_b32 v13, v15, -1, s[68:69] // mmak=0 ir=0 = sFull ? full : boundary[0] v_cmp_le_i32 s[68:69], v12, 0 // mmak=0 ir=0: sZero = diff <= 0 v_cndmask_b32 v13, v13, 0, s[68:69] // mmak=0 ir=0 = sZero ? 0 : prev ... mmak=0 ir=1..3 + mmak=1 ir=0..3 (8 per-(mmak, ir) chains) ... v_cmp_gt_i32 s[68:69], v12, 39 // mmak=1 ir=0: sFull = diff > 39 v_cmp_le_i32 s[68:69], v12, 32 // mmak=1 ir=0: sZero = diff <= 32 ... ``` The per-mmak apply step keeps the existing `apply precomputed mask to ValuA/B[idx]` v_and_b32 emit (one v_and per VGPR per operand, mask source-vgpr indexed into our per-(mmak, ir) precomputed pool). Unit suite: 880 passed / 5 skipped / 2 xfailed (baseline 877; +3 from Item 1 + Item 2 new pins). Yaml smokes (all PASS): - subtile_bf16 - subtile_bf16_tail - subtile_bf16_anyk_k2 - subtile_bf16_anyk_odd - subtile_bf16_anyk_k8 - subtile_bf16_anyk_largemt (MT320x288 byte-refine; +5 vgprs over baseline, well under 256-VGPR budget) The mxfp4 / mxfp4_tail / mxfp4_tail_smoke yamls go through the coarse cmp path (MX scales gate `_subtileTailByteShiftApplies` to False), so they are unchanged by this commit; verified by running them in-place and inspecting the post-tail emit. Idea ported from sebvince #7683 commits a0fee26 (`Precompute mask for all subIterK`), e9f5f55b (`Simplify emit_mask_k`), d999a288 (`Simplify emit_mask_k_init`), and 4a4d6a596 (`Remove hardcoded values`). Co-authored-by: Cursor <cursoragent@cursor.com>
This reverts commit 59230da.
92adc6a to
c7ccdb7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds tail-loop support to the gfx950 subtile path for BF16 and MXFP4, removing the prior K-must-be-multiple-of-DepthU restriction. Subtile kernels now accept any K and emit a per-PGR tail loop that handles the remainder.
What this enables
Before this PR, subtile kernels required K % DepthU == 0. Any K with a remainder fell off the subtile path (or was rejected at Solution time). After this PR:
Code change
Produce PGR0 styled schedule in the Logical scheduler for the Tail loop.
Tests
Test plan