Skip to content

[Hipblaslt] [Subtile] Tail-loop support for BF16 and MXFP4 #7683

Open
sebvince wants to merge 56 commits into
ROCm:developfrom
sebvince:tailloop_bf16_fp4
Open

[Hipblaslt] [Subtile] Tail-loop support for BF16 and MXFP4 #7683
sebvince wants to merge 56 commits into
ROCm:developfrom
sebvince:tailloop_bf16_fp4

Conversation

@sebvince
Copy link
Copy Markdown
Contributor

@sebvince sebvince commented May 21, 2026

Summary

Adds tail-loop support to the gfx950 subtile path for BF16 and MXFP4, removing the prior K-must-be-multiple-of-DepthU restriction. Subtile kernels now accept any K and emit a per-PGR tail loop that handles the remainder.

What this enables

Before this PR, subtile kernels required K % DepthU == 0. Any K with a remainder fell off the subtile path (or was rejected at Solution time). After this PR:

  • BF16 / BSS / BBS: any K accepted.
  • MXFP4 (F4 + MXBlock=32): any K that is a multiple of 32 (the MX block size).

Code change

Produce PGR0 styled schedule in the Logical scheduler for the Tail loop.

  • Mask emit: per-K-block lane masks precomputed once and reused across sub-K iterations
  • Sub-VGPR masking: K%8==0 fast path avoids per-lane select
  • M/N dedup: shared mask application across the M and N axes
  • Multi-partition VGPR reuse: tile VGPRs reused across partitions to keep register pressure flat
  • K<DU skip: jumps directly into the tail loop when no main iteration fits
  • Early exit on each subIterK
  • Used flat single partition for tailloop

Tests

  • subtile_bf16.yaml: TAIL-loop sweep (K=1..128 step 1, every ASEM bucket) on MT 128x128 / 320x320 / 96x128 (odd M wavetile) across DU=64/128, PGR=0/1/2, BBS+bias+hipblaslt_all+ScaleAlphaVec
  • subtile_mxfp4.yaml: TAIL-loop sweep (K=32..512 step 32) on MT 128x128 / 64x64 across DU=256/512, PGR=0/1/2, F4BS+bias+hipblaslt_all+ScaleAlphaVec

Test plan

  • subtile-test bf16: PASSED=15066, FAILED=0 Rejected: 294
  • subtile-test mxfp4: PASSED=2883, FAILED=0 Rejected: 531

bnemanich added a commit that referenced this pull request May 21, 2026
The K-tail scaffold's per-mmak early exit emits
`s_cmp_le_u32 LoopCounterL, MIK*(subIterK+1)` with the threshold as a
compile-time literal. gfx950's VOPC / SOPC inline-constant range is
[-16, 64]; literals past 64 land in the assembler's 32-bit literal
slot, which some opcodes either reject or silently mis-encode. The
threshold exceeds 64 for typical bf16 (MIK=32, subIterK >= 2) and
for FP4 (MIK=128) at every mmak >= 0, so the existing emit was
already at risk on those configs even though our test fixtures
happened to cover only the inline range.

Adds `_emitSubtileScalarCmpLitOrStaged` which, for a given cmp op /
src0 / literal, emits the cmp directly when the literal fits in the
inline range, and otherwise stages it through a scratch sgpr via
`s_mov_b32 sScratch, <lit>` first. The per-mmak early exit now goes
through this helper.

Idea ported from sebvince's #7683 commit 6d7e743 (`Dont use literal
for cmp when diff >64`).

Test additions in `test_subtile_tailloop_emit.py`:
  - `test_per_mmak_early_exit_inline_when_consumedK_fits`: bf16
    fixture (consumedK=32) must use the inline form.
  - `test_per_mmak_early_exit_staged_when_consumedK_exceeds_inline`:
    FP4 fixture (consumedK >= 128) must stage every cmp.
  - `test_per_mmak_early_exit_boundary_consumedK_64`: custom
    bf16 DU=128 fixture pins the inclusive upper bound (32, 64
    inline; 96 staged).
  - `test_per_mmak_early_exit_threshold_progression` updated to
    accept both inline and staged forms via a new helper
    `_collect_per_mmak_early_exit_thresholds` that pairs each cmp
    with its preceding `stage literal ...` s_mov_b32 by sgpr name.

End-to-end verification on subtile_bf16.yaml shows the production
emit using 0x20 / 0x40 inline and staging 0x60..0xe0 through sgprs;
both subtile_bf16.yaml and subtile_bf16_anyk_largemt.yaml pass with
the change.

Unit suite: 871 passed / 5 skipped / 2 xfailed (baseline 868).

Co-authored-by: Cursor <cursoragent@cursor.com>
bnemanich added a commit that referenced this pull request May 21, 2026
The byte-refine K-tail mask path previously emitted the full
per-(operand, ir) chain (`v_mov seed` + per-mod runtime-gated
`v_cmp + v_cndmask + v_and`) inside the per-mmak loop body. The
chain is independent of A/B tile data — it depends only on
`kPosBase + mmak*MIK + ir*elementsPerVgpr` and `LoopCounterL` —
so it can be hoisted out once before the loop and the per-mmak
step collapses to pure `v_and_b32 vIdx, vMask[..], vIdx`.

This commit lands that hoist on the byte-refine path:

  * `_emitTailSubLaneMaskChainIntoVgpr`: extracted chain emitter
    that writes its mask to a caller-owned target VGPR (mirrors
    the body of `_emitTailSubLaneMaskRefineSubtile._emitChain`;
    keeps the bpe-parametric mod chain, the static `ASEM*bpe %
    bpr == 0` skip, and the runtime `LoopCounterL &
    (elementsPerVgpr-1)` gate).
  * `_emitTailSubLaneMaskPrecomputeSubtile`: allocates one scratch
    VGPR per (operand, mmak, ir) and emits the chain into each.
    When `bpeA == bpeB` (the common bf16/bf16 case) A and B share
    the same mask VGPR per (mmak, ir), halving the persistent
    VGPR cost. Returns the `(operand, mmak, ir) -> vgpr` lookup
    plus the list of VGPRs to release after the loop.
  * `_emitTailSubLaneMaskApplySubtile`: per-mmak v_and-only apply
    step. Comment carries `apply precomputed mask to ...` so the
    new shape is distinguishable from the legacy inline emit.

`_emitTailLoopScaffoldSubtile` now precomputes the mask map once
before the per-mmak loop and calls apply per mmak. Cleanup of the
precomputed VGPRs happens after the loop completes (the runtime
per-mmak early-exit branches past the cleanup so emit-time pool
accounting balances).

Gated by `kernel.get("SubtileTailMaskPrecompute", True)`. Setting
it False reverts to the legacy `_emitTailSubLaneMaskRefineSubtile`
per-mmak inline chain. The coarse path (ASEM>=numMIInUnroll, MX
scale) is untouched by this commit.

Scope:
  * Byte-refine path ONLY (ASEM<numMIInUnroll, integer bpe, no MX).
    The user-flagged risk of regressing the per-mmak LR + MFMA
    interleave on MT320x288 is avoided by NOT extending the
    precompute to the coarse path: that path's higher tile-grid
    counts would inflate the persistent VGPR cost on the
    high-water configs.
  * MT320x288 (subtile_bf16_anyk_largemt): byte-refine kernels
    stay at .vgpr_count=249 (same as the per-mmak-LR baseline);
    the +numMmaks*vgprPerInUnroll persistent mask VGPRs (8 for
    DU=64) do not push past the D-tile high-water.

Tests in `test_subtile_anyk_emit.py`:
  * `TestAnyKEmit_Precompute`: new class pinning the hoist
    (precompute block sits BEFORE the per-mmak ds_read wait;
    apply block contains no cmps/cndmasks), the A/B mask VGPR
    dedupe (bf16/bf16 → one VGPR per (mmak, ir) shared across
    operands), the config switch behaviour (off reverts to
    legacy inline shape), the NoTailLoop no-op, and the
    coarse-path identity (ASEM=32 emit byte-identical with the
    switch on vs off).
  * `TestAnyKEmit_K2::test_k2_emits_per_operand_byte_refine`,
    `TestAnyKEmit_K1::test_k1_no_narrow_load_and_emits_partial_chain`:
    updated to recognise the new `apply precomputed mask to ...`
    apply comment and to verify the chain/gate live in the
    precompute prefix rather than per-mmak.

Idea ported from sebvince's #7683 commit a0fee26 (`Precompute
mask for all subIterK`) with subsequent refinements (e9f5f55,
d999a28, 4a4d6a5). Sebvince's framework computes per-iteration
masks (BF16 3-state blend + boundary precompute); this port keeps
our existing bpe-parametric mod chain and just moves the
emit-site from inside the loop to before it.

Unit suite: 876 passed / 5 skipped / 2 xfailed (baseline 868;
+3 from Ask 1, +5 from Ask 2 precompute pins).

Yaml smokes (all PASS):
  - subtile_bf16
  - subtile_bf16_tail
  - subtile_bf16_anyk_k2
  - subtile_bf16_anyk_odd
  - subtile_bf16_anyk_k8
  - subtile_bf16_anyk_largemt (MT320x288 byte-refine, vgpr=249)
  - subtile_mxfp4_tail

Co-authored-by: Cursor <cursoragent@cursor.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 21, 2026

Codecov Report

❌ Patch coverage is 56.13718% with 243 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py 1.00% 198 Missing ⚠️
...ite/Tensile/Components/Subtile/LogicalScheduler.py 86.00% 23 Missing and 5 partials ⚠️
...t/tensilelite/Tensile/Components/Subtile/Kernel.py 7.14% 13 Missing ⚠️
...e/Tensile/Components/Subtile/InstructionEmitter.py 97.84% 1 Missing and 2 partials ⚠️
...lt/tensilelite/Tensile/SolutionStructs/Solution.py 0.00% 1 Missing ⚠️

❌ Your project status has failed because the head coverage (27.58%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##           develop    #7683       +/-   ##
============================================
- Coverage    61.87%   30.35%   -31.51%     
============================================
  Files         2086      161     -1925     
  Lines       357038    67735   -289303     
  Branches     53806    13045    -40761     
============================================
- Hits        220892    20560   -200332     
+ Misses      117348    45380    -71968     
+ Partials     18798     1795    -17003     
Flag Coverage Δ
TensileLite 27.58% <56.14%> (+1.64%) ⬆️
hipBLAS ?
hipBLASLt 41.27% <ø> (ø)
hipCUB ?
hipDNN ?
hipFFT ?
hipRAND ?
hipSOLVER ?
hipSPARSE ?
rocBLAS ?
rocFFT ?
rocRAND ?
rocSOLVER ?
rocSPARSE ?
Files with missing lines Coverage Δ
...ects/hipblaslt/tensilelite/Tensile/KernelWriter.py 9.43% <ø> (-0.15%) ⬇️
...lt/tensilelite/Tensile/SolutionStructs/Solution.py 7.19% <0.00%> (-0.12%) ⬇️
...e/Tensile/Components/Subtile/InstructionEmitter.py 96.74% <97.84%> (+3.83%) ⬆️
...t/tensilelite/Tensile/Components/Subtile/Kernel.py 62.39% <7.14%> (-1.55%) ⬇️
...ite/Tensile/Components/Subtile/LogicalScheduler.py 87.68% <86.00%> (+0.47%) ⬆️
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py 7.38% <1.00%> (-0.14%) ⬇️

... and 1946 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bnemanich added a commit that referenced this pull request May 23, 2026
In the subtile tail scaffold, the PGR>0 SRD-advance / LWA-XOR
block (which undoes the mainloop-exit's last `gr_inc + lw_swap` so
the tail's GR+LR can re-issue at K_aligned with LWA/LR in sync)
sat AFTER `calculateLoopNumIter(-1)`, i.e. on the not-taken side of
the K%DU==0 early-exit cmp/branch to SkipTailLoopL.

Sebvince's #PR-7683 design (`build_tailloop_pgr0` + `Subtile/
Kernel.mainLoop`) emits the analogous mainloop-exit code as part
of `emitMainAndExitLoops`, BEFORE the orchestrator calls
`calculateLoopNumIter(-1)` and `emitTailLoop`. Nakajee's MT320x320
side-by-side flagged sebvince's placement as preferable for two
reasons:

  - The SRD-add / LWA-xor chain has small dependency lengths; on
    sebvince's side it co-issues with calculateLoopNumIter's
    scalar divide-and-remainder, whereas our previous placement
    serialized it after the cmp+branch (a hard dependency on the
    scc / branch resolution).
  - The OrigLoopCounter snapshot SGPR our scaffold held across
    `calculateLoopNumIter` (because it resets OrigLoopCounter to 0
    at the start of the tail-number-of-iters compute) becomes
    unnecessary once the gate runs before that reset -- the gate
    can read `sgprOrigLoopCounter` directly. One fewer SGPR live
    across the tail-entry block.

This commit moves the PGR>0 gating block (origCounter == 0 c=0
branch + 0 < origCounter < PGR small-counter realign + origCounter
>= PGR large-counter SRD-advance) to ABOVE `calculateLoopNumIter`
and drops the `savedOrigKAlignDUs` snapshot SGPR. Safety: on the
early-exit path (K%DU == 0 -> SkipTailLoopL) the now-unconditional
SRD-advance / LWA-XOR is harmless -- the post-tail write-out reads
SrdC/SrdD, never SrdA/B/MXSA/MXSB or LocalWriteBaseAddr.

Test updates in `test_subtile_tailloop_emit.py`:
  - `test_emits_origCounter_snapshot{PGR1,PGR2}` -> renamed and
    flipped to negative pins
    `test_omits_origCounter_snapshot_after_hoist` (the snapshot
    s_mov_b32 must NOT appear).
  - New `test_gating_reads_OrigLoopCounter_directly`: the c=0 cmp
    src0 must spell out `s[sgprOrigLoopCounter]`.
  - New `test_gating_block_precedes_TailLoopBeginL`: structural
    pin that the c=0 origCounter cmp appears BEFORE both the
    `skip to end of tail loop b/c numIter==0` early-exit cmp and
    the `TailLoopBeginL` label.
  - `test_emits_small_counter_compare`: updated to require the
    `s_cmp_lt_u32 s[sgprOrigLoopCounter], 1` form (no anonymous
    snapshot sgpr).

Idea ported from sebvince #7683 (`build_tailloop_pgr0` design /
`Subtile/Kernel.mainLoop` wrapper order). Nakajee's MT320x320
side-by-side comparison flagged this placement difference and
preferred sebvince's; this commit lands the move.

Unit suite: 879 passed / 5 skipped / 2 xfailed (baseline 877;
+2 from the new gating positive pins).

Yaml smokes (all PASS):
  - subtile_bf16
  - subtile_bf16_anyk_odd
  - subtile_bf16_anyk_largemt (MT320x288 byte-refine)

Co-authored-by: Cursor <cursoragent@cursor.com>
bnemanich added a commit that referenced this pull request May 23, 2026
The byte-refine K-tail mask precompute previously emitted one
bpe-parametric `byteRefine[<op> ir=N mmak=M]` chain per
(operand, mmak, ir): seed `0xFFFFFFFF` mov + per-mod (mod chain
length 2 for bf16, 4 for fp8) `v_add + v_cmp_ge_i32 +
v_cndmask_b32 vSeed` + mod=0 close. For bf16 byte-refine that is
7 instructions per (mmak, ir) chain, multiplied by `numMmaks *
vgprPerInUnroll` chains (e.g. 2*4 = 8 chains -> 56 instr on the
DU=64 path, 4*4 = 16 chains -> 112 instr on the DU=128 path).

Sebvince's #7683 design factors out the per-lane invariants once
(`emit_mask_k_init`) and turns the per-subIterK chain into a pure
3-state diff/boundary cmp+cndmask (`emit_mask_k`). Nakajee's
MT320x320 side-by-side review of #7661 vs #7683 flagged this as
"much simpler" and asked to verify equivalence before adopting.

Equivalence verification (sebvince's BF16 chain vs ours, with
`numMIInUnroll = 8`, `kStride = 2`, `vgprPerInUnroll = 4`):

  laneK_0 := tidInK * numMIInUnroll        (== our kPosBase)
  diff    := LoopCounterL - laneK_0        (signed)
  effective_diff_n := diff - n*MIK         (n = mmak / subIterK)
  d       := LoopCounterL % numMIInUnroll  (== effective_diff_n
                                            in the boundary range)

For each (lane, mmak=n, vgpr i):

  sFull = diff > n*MIK + numMIInUnroll - 1
       <=> effective_diff_n >= numMIInUnroll
       <=> ALL of this lane's K is in range -> mask = -1
  sZero = diff <= n*MIK
       <=> effective_diff_n <= 0
       <=> NONE of this lane's K is in range -> mask = 0
  otherwise -> mask = boundary[i] where
       d <=  i*2   -> 0
       d ==  i*2+1 -> 0x0000FFFF   (low bf16 in, high past)
       d >=  i*2+2 -> -1

Truth-table check (tidInK=0, mmak=0, K_remain in {1..8}):

  K_rem=1: i=0 mask=0x0000FFFF (low K=0 in, high K=1 past)
           i=1..3 mask=0
  K_rem=2: i=0 -1   ; i=1..3 mask=0
  K_rem=3: i=0 -1   ; i=1 mask=0x0000FFFF ; i=2,3 mask=0
  K_rem=4: i=0,1 -1 ; i=2,3 mask=0
  K_rem=5: i=0,1 -1 ; i=2 mask=0x0000FFFF ; i=3 mask=0
  K_rem=6: i=0,1,2 -1 ; i=3 mask=0
  K_rem=7: i=0,1,2 -1 ; i=3 mask=0x0000FFFF
  K_rem=8: i=0..3 -1  (sFull fires for tidInK=0)

Our existing chain (bpe-parametric mod chain) produces the same
masks bit-for-bit for every entry above (and for tidInK > 0 the
diff/sFull/sZero arithmetic and our `K_pos vs LoopCounterL` cmps
algebraically coincide). Verified by hand on the full K_remain
1..8 x i 0..3 truth table.

Caveats (all non-blocking for our gauntlet):

  * Sebvince's chain does NOT have our `(ASEM * bpe) % bpr == 0`
    static skip; the boundary[i] init runs unconditionally. For
    ASEM=2/4 (currently fast-path with static skip dropping the
    mod>0 chain) this emits a few more instructions, but the
    mask bit-pattern stays identical (d is even -> halfKeep
    branch never fires; only the full/zero outcomes are reachable).
  * Sebvince's chain assumes `bpeA == bpeB` (one shared boundary[i]
    per vgpr position, applied to both operands' tiles). All
    current gauntlet configs are symmetric bpe (`_subtileTailByteShiftApplies`'s
    integer-bpe gate plus `_emitTailSrdTightenSubtile`'s explicit
    `bpeA == bpeB` check). For asymmetric bpe (no current YAML)
    the dispatcher falls back to the legacy chain.
  * Storage: init holds `1 (diff) + vgprPerInUnroll (boundary)` =
    5 persistent VGPRs for BF16 (numMIInUnroll=8 -> vgprPerInUnroll
    =4), on top of our existing `numMmaks * vgprPerInUnroll`
    precomputed masks. On MT320x288 byte-refine (`subtile_bf16_anyk_largemt`,
    baseline vgpr_count=249) the +5 fits well within the 256-VGPR
    budget; verified by re-running the yaml under the new emit
    (still PASS).

This commit lands the adoption:

  * `_emitTailSubLaneMaskInitSebvince(kPosBaseVgpr, numMIInUnroll,
    bpe, vgprPerInUnroll)` -> emits `diff = sgprLoopCounterL -
    kPosBase` (signed v_sub_i32) + `d = sgprLoopCounterL & 7` +
    per-i `(d<hi) ? halfKeep : full ; (d<lo) ? 0 : prev` boundary
    cndmasks. Returns the persistent diff and boundary[i] vgprs.
  * `_emitTailSubLaneMaskChainIntoVgprSebvince(diffVgpr,
    boundaryMaskVgpr, ...)` -> per-(operand, mmak, ir): two cmps
    + two cndmasks. VOPC inline-range staging via
    `_subtileCmpSrc1FitsInline` for `mmak*MIK + ... > 64` (BF16
    DU=128, mmak=2,3).
  * `_emitTailSubLaneMaskPrecomputeSubtile` dispatches: bpeA ==
    bpeB == 2 && `SubtileTailMaskSebvinceForm=True` (default) ->
    sebvince form; otherwise -> legacy chain. Legacy chain
    retained for fp8 / int8 byte-refine (asymmetric bpe gates
    deferred) and as a reversibility escape hatch.

Tests in `test_subtile_anyk_emit.py`:

  * `_emit_anyk_tail_asm` gains a `sebvinceForm` kwarg (default
    True) so individual tests can opt back into the legacy chain
    for regression coverage.
  * `TestAnyKEmit_K4::test_k4_sebvince_form_emits_init_and_per_mmak_chain`
    (replaces `test_k4_byte_refine_mod0_only`): pins sebvince
    init marker + per-(mmak, ir) sFull cndmask count + absence
    of the legacy mod>0 / mod=0 byteRefine chain seed.
  * `TestAnyKEmit_K4::test_k4_legacy_form_emits_mod0_only`: new
    regression pin for `SubtileTailMaskSebvinceForm=False`
    (legacy chain still emits the mod=0-only chain on ASEM=4).
  * `TestAnyKEmit_K2::test_k2_emits_sebvince_form_chain` (replaces
    `test_k2_emits_per_operand_byte_refine`): pins cmp count
    growth vs K%32 baseline + sebvince diff init + sFull/sZero
    pairing + `d = LoopCounterL % 8` init.
  * `TestAnyKEmit_K1::test_k1_no_narrow_load_and_emits_sebvince_chain`
    (replaces `test_k1_no_narrow_load_and_emits_partial_chain`):
    pins absence of narrow d16 load + sebvince init + boundary
    cndmask + per-(mmak, ir) sFull/sZero cndmasks; the chain
    sits in the precompute prefix (before per-mmak ds_read wait).
  * `TestAnyKEmit_Precompute::test_precompute_block_before_per_mmak_loop`:
    rewritten to pin sebvince diff init + per-mmak sFull cmps in
    the precompute section; apply section has no chain primitives.
  * `TestAnyKEmit_Precompute::test_precompute_hoisted_above_dtl_wait_and_barrier`:
    rewritten to pin sebvince `diff` + per-(mmak, ir) sFull/sZero
    markers above the DTL wait.

Asm excerpt from `subtile_bf16_anyk_odd.yaml` MT128x128 DU=64
(`Cijk_Alik_Bljk_BSS_BH ... MT128x128x64`), the new precompute
prefix (lines 1414-1453):

```
v_sub_i32 v12, s[sgprLoopCounterL], v11    // diff = LoopCounterL - kPosBase
v_mov_b32 v13, 0x0000FFFF                  // halfKeep
v_and_b32 v14, 7, s[sgprLoopCounterL]      // d = LoopCounterL % 8
v_cmp_lt_i32 s[68:69], v14, 2              // boundary[0]: d < 2 ?
v_cndmask_b32 v15, -1, v13, s[68:69]       // boundary[0] = (d<2) ? halfKeep : full
v_cmp_lt_i32 s[68:69], v14, 1              // boundary[0]: d < 1 ?
v_cndmask_b32 v15, v15, 0, s[68:69]        // boundary[0] = (d<1) ? 0 : prev
... boundary[1..3] (3 vgprs, same shape) ...
v_cmp_gt_i32 s[68:69], v12, 7              // mmak=0 ir=0: sFull = diff > 7
v_cndmask_b32 v13, v15, -1, s[68:69]       // mmak=0 ir=0 = sFull ? full : boundary[0]
v_cmp_le_i32 s[68:69], v12, 0              // mmak=0 ir=0: sZero = diff <= 0
v_cndmask_b32 v13, v13, 0, s[68:69]        // mmak=0 ir=0 = sZero ? 0 : prev
... mmak=0 ir=1..3 + mmak=1 ir=0..3 (8 per-(mmak, ir) chains) ...
v_cmp_gt_i32 s[68:69], v12, 39             // mmak=1 ir=0: sFull = diff > 39
v_cmp_le_i32 s[68:69], v12, 32             // mmak=1 ir=0: sZero = diff <= 32
...
```

The per-mmak apply step keeps the existing `apply precomputed
mask to ValuA/B[idx]` v_and_b32 emit (one v_and per VGPR per
operand, mask source-vgpr indexed into our per-(mmak, ir)
precomputed pool).

Unit suite: 880 passed / 5 skipped / 2 xfailed (baseline 877;
+3 from Item 1 + Item 2 new pins).

Yaml smokes (all PASS):
  - subtile_bf16
  - subtile_bf16_tail
  - subtile_bf16_anyk_k2
  - subtile_bf16_anyk_odd
  - subtile_bf16_anyk_k8
  - subtile_bf16_anyk_largemt (MT320x288 byte-refine; +5 vgprs
    over baseline, well under 256-VGPR budget)

The mxfp4 / mxfp4_tail / mxfp4_tail_smoke yamls go through the
coarse cmp path (MX scales gate `_subtileTailByteShiftApplies`
to False), so they are unchanged by this commit; verified by
running them in-place and inspecting the post-tail emit.

Idea ported from sebvince #7683 commits a0fee26
(`Precompute mask for all subIterK`), e9f5f55b (`Simplify
emit_mask_k`), d999a288 (`Simplify emit_mask_k_init`), and
4a4d6a596 (`Remove hardcoded values`).

Co-authored-by: Cursor <cursoragent@cursor.com>
@sebvince sebvince force-pushed the tailloop_bf16_fp4 branch from 92adc6a to c7ccdb7 Compare May 27, 2026 13:46
@sebvince sebvince marked this pull request as ready for review May 28, 2026 17:39
@sebvince sebvince requested a review from a team as a code owner May 28, 2026 17:39
@sebvince sebvince requested a review from nakajee May 28, 2026 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants