[hipBLASLt] [TensileLite] Add initial tail loop support for Subtile path#7636
[hipBLASLt] [TensileLite] Add initial tail loop support for Subtile path#7636bnemanich wants to merge 12 commits into
Conversation
Codecov Report❌ Patch coverage is ❌ Your project status has failed because the head coverage (77.83%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #7636 +/- ##
===========================================
+ Coverage 61.95% 61.96% +0.01%
===========================================
Files 2086 2086
Lines 357070 357436 +366
Branches 53779 53884 +105
===========================================
+ Hits 221221 221469 +248
- Misses 117055 117132 +77
- Partials 18794 18835 +41
*This pull request uses carry forward flags. Click here to find out more.
🚀 New features to boost your workflow:
|
|
It does not have code to skip first prefetch GR for sgprLoopCounterL=0 case. |
|
I think it might be good to have an option to put some non 0 value for MX swizzle padding. |
| if kernel["ProblemType"].get("MXBlockB", 0) > 0: | ||
| realignTcs.append('MXSB') | ||
|
|
||
| c0ResetLabel = Label("PGRTailC0Reset%s" % loopChar, "") |
There was a problem hiding this comment.
We should not need any special treatment for OrigCount==0 case in tailloop.
Init acc should be done in preloop.
| src0=sgpr(savedOrigCounterSgpr), | ||
| src1=pgr, | ||
| comment="origCounter < PGR?")) | ||
| module.add(SCBranchSCC1( |
There was a problem hiding this comment.
We should not need special treatment for LoopCounter < 2 case.
SrdA/B/MXSA/MXSB should be already adjusted in preloop.
|
|
||
| kPosBaseVgpr = self.vgprPool.checkOut(1, "kReg_first") | ||
| with self.allocTmpSgpr(1) as tmpSgprInfo: | ||
| module.add(vectorStaticRemainder(-1, kPosBaseVgpr, "Serial", |
There was a problem hiding this comment.
We can put this after GR and before wait.
| module = Module("tailLaneMaskApplySubtile") | ||
| laneSGPRCount = self.states.laneSGPRCount | ||
| for idx in aTile.regList.indices: | ||
| module.add(VCndMaskB32(dst=vgpr(idx), src0=vgpr(idx), src1=0, |
There was a problem hiding this comment.
We do not need to generate cndmask against same vgpr more than once.
Please add done check or something to avoid generating cndmask for same vgpr twice or more.
| for tailTile in tailAllocTiles: | ||
| tailTile.allocVgprTileRegisters_legacy(self, kernel) | ||
|
|
||
| # Re-issue one DepthU-shaped GR + LR. Byte-layout identical to a |
There was a problem hiding this comment.
We should adjust SrdA/B/MXSA/MXSB + 2 to avoid out of range access.
| # K_rem=32 (tail-only, K << DU) | ||
| - Exact: [128, 128, 1, 32] | ||
| # K_rem=256 (2 full last MFMAs, K_rem mod 128 == 0) | ||
| - Exact: [128, 128, 1, 256] |
There was a problem hiding this comment.
Would you please add more sizes for DU512?
(similar coverage as section 2 above)
| # Decorative: batched + partial-MFMA tail | ||
| - Exact: [128, 128, 2, 160] | ||
| # Decorative: odd M,N + smallest tail | ||
| - Exact: [ 63, 63, 1, 32] |
There was a problem hiding this comment.
Could we have some more iteration cases?
- 2 iteration + Tail
- 3 iteration + Tail
Followups on top of "subtile: address PR review for K-tail loop" addressing the remaining nakajee review comments on PR #7636 and fixing the unit-test failures introduced by the prior commit. 1. Comment 1 (no special OrigCounter==0 treatment in tail loop): move the c=0 skip gate fully upstream into `kernelBodySubtile`. The new `SkipSubtileMainLoop<L>` gate wraps the scheduler-emitted mainLoop (preLoop + mainloop + NGLL + NLL) and bypasses all of them when `OrigLoopCounter == 0`. With the gate in place, SRDs stay at K=0, LWA/LRA stay at buf 0, and accD stays at zero from `initVgprTilesToZero`, so the tail body just runs from setupNewTile's defaults -- no undo needed. The legacy `PGRTailC0Reset<L>` block in `_emitTailLoopScaffoldSubtile` (which used to zero accD, subtract one DU from each Srd<tc>, and XOR LWA back to buf 0 for PGR>=2) is removed entirely. The c=0 compare now branches straight into `PGRTailEntry<L>`. The legacy `SkipPreLoopGR<L>` gate that lived in the obsolete `_kernelBody` path is removed (subtile kernels never go through `kernelBody`; they use `kernelBodySubtile`). 2. Comment 3 (move kPosBase setup after GR, before wait): relocate the per-lane `kPosBase = tidInK * numMIInUnroll` vector remainder / divide / multiply chain to fire between the tail's global-read issue and the post-GR `s_waitcnt`. The vector math now overlaps with buffer-load memory latency instead of serializing in front of it. 3. Comment 4 (dedupe per-MFMA cndmask): the previous tail body emitted one `v_cndmask_b32` per (mmak, mma1, mma0) MFMA over each ValuA/B input vgpr. Because ValuA depends only on `(mma0, mmak)` and ValuB only on `(mma1, mmak)`, the same A vgprs were cndmasked `len(mma1)` times and the same B vgprs `len(mma0)` times per mmak. The cndmask chain is hoisted out of the inner (mma1, mma0) MFMA grid into the per-mmak block, with a `seenVgpr` set tracking which ValuA/B/MXSA/MXSB vgprs have already been masked. Each unique vgpr now gets exactly one `v_cndmask_b32` per mmak. The downstream MFMA grid runs cndmask-free. The now-redundant `_emitTailLaneMaskApplySubtile` helper is removed (its body is inlined into the scaffold). 4. Comment 5 (clamp Srd<tc>+2 at tail entry): documented as a follow-up TODO at the tail GR site. The buffer NumRecords field is a single linear-byte limit, so a tight per-row K clamp is only achievable for the last M-row of the tile. The per-MFMA lane mask remains the actual correctness mechanism for in-range M rows. Filed for a follow-up PR. 5. CI fix: update `test_subtile_tailloop_emit.py`: - `test_emits_c0_reset_compare_and_branch` (PGR=1, PGR=2): now asserts the c=0 compare branches to `PGRTailEntry<L>` (not `PGRTailC0Reset<L>`) and that the legacy reset block is absent. - Replace `test_emits_c0_srd_subtract_with_borrow`, `test_emits_c0_lwa_xor_realign`, `test_emits_c0_reset_label_and_accD_zero`, `test_omits_c0_srd_subtract_pgr1`, `test_emits_c0_accD_zero_pgr1`, `_extract_c0_block` with negative-assertion tests `test_omits_c0_reset_label`, `test_omits_c0_srd_subtract`, `test_omits_c0_lwa_xor_undo`, `test_omits_c0_reset_label_pgr1`, `test_omits_c0_undo_instructions_pgr1` that pin the legacy block / instructions as absent. Validation: full unit suite green (`pytest Tensile/Tests/unit/`: 821 passed, 5 skipped, 2 xfailed). Co-authored-by: Cursor <cursoragent@cursor.com>
CI hipBLASLt build on PR #7636 fails with: RuntimeError: Unknown assertion key: AssertFree1DivByMT1LowbitGT1 at `Contractions.py:ProblemPredicate.FromOriginalKeyPair` while `parseLibraryLogicData` walks a gfx950 logic YAML. Root cause: PR #7443 ("manual revert KRingShift") removed the parser handlers for `AssertFree1DivByMT1LowbitGT1` / `AssertKRingShiftTailWrapOnly`, and PR #7513 cleaned the matching keys out of the hipBLASLt / hipSPARSELt library logic YAML files. Both reverts are in our branch's history, but origin/develop has new gfx950 logic YAMLs (e.g. `gfx950/gfx950_128cu/Equality/ gfx950_Cijk_Alik_Bljk_HHS_BH_BiasSH_HAS_SAV_UserArgs.yaml`, added by #7125 "update cucount") that were generated *before* the revert pair and therefore still carry the deprecated keys. When CI merges our branch with origin/develop the merged tree contains a YAML the parser can no longer read. Fix: silently ignore the two deprecated keys in `ProblemPredicate.FromOriginalKeyPair`. Matches the spirit of the revert + cleanup pair without requiring follow-up YAML cleanup in develop, and is forward-compatible against any future stale YAML that slips through the same gap. Genuinely unknown `Assert*` keys still raise loudly. New unit tests in `test_Contractions_deprecated_asserts.py` pin the silent-ignore behavior for the two specific keys, verify that unrecognized `Assert*` keys still raise, and check that the existing known `AssertFree0/1ElementMultiple` + `AssertSummationElementMultiple` predicate path is unaffected. Validation: - new tests: 7 passed - full unit suite: 828 passed, 5 skipped, 2 xfailed (was 821 passed) - end-to-end smoke: drove `FromOriginalKeyPair` on a representative stale-key entry set; no `RuntimeError` raised; recognized keys still produce real predicates. Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses the latest nakajee review comment on PR #7636: > module.add(SMovB32(dst=sgpr("LoopCounterL"), src=0, > comment="single-iter tail: force closeLoop fall-through")) > module.add(self.closeLoop(kernel, tensorParametersA, tensorParametersB, > -1, finalLoop=True)) > > Seems like closeLoop here is not necessary. The subtile tail body processes the entire K_tail in a single pass via the `mmak` loop (every lane mask is emitted against the current `LoopCounterL = K mod DU` snapshot), so `closeLoop(... finalLoop=True)` would only contribute dead code: - per-iter `s_sub_i32 LoopCounterL, ..., MIK` decrement, - `s_cbranch_scc0 TailLoopBeginL` back-edge (no-op since we intentionally underflowed LoopCounterL to negative), - `OrigLoopCounter += MIK` increment (only consumed by the legacy LRO-damage recovery block, which is already bypassed for `UseSubtileImpl=1` kernels inside `closeLoop`), - `TailLoopEndL:` label (nothing branches to it — the only early-exit target is `SkipTailLoopL:`, emitted by the surviving `closeLoop(... emitEndLabelOnly=True)` call). Dropping the `closeLoop(... finalLoop=True)` call also removes the need for the `s_mov_b32 LoopCounterL, 0` workaround that was there purely to force the now-removed back-edge to fall through. Net asm delta per kernel (subtile tail body): - 1 `s_mov_b32 LoopCounterL, 0` - 1 `s_sub_i32 LoopCounterL, LoopCounterL, MIK` - 1 `s_add_u32 OrigLoopCounter, OrigLoopCounter, MIK` - 1 `s_cmp_le_i32 LoopCounterL, 0` - 1 `s_cbranch_scc0 label_TailLoopBeginL` - 1 `label_TailLoopEndL:` label All removed. Test updates: - `test_emits_loop_counter_zero_before_closeloop` (PGR=0) is inverted into `test_omits_closeloop_emit`: asserts that the `s_sub_i32 LoopCounterL ... dec counterL` decrement, the `s_mov_b32 LoopCounterL, 0` workaround, and `TailLoopEndL` are all absent from the tail block. - `test_pgr2_loop_counter_zero_before_closeloop` (PGR=2) is inverted into `test_pgr2_omits_closeloop_emit` with the same set of negative assertions. - `_emitTailLoopScaffoldSubtile` docstring updated to enumerate the three PGR>0 entry-gate paths (origCounter==0 / small / large) and to document that closeLoop is intentionally not emitted by the subtile scaffold. - `TestEmitAllLoopsTail_PGR0` class docstring and `test_omits_tail_when_NoTailLoop_true` docstring updated to note that `TailLoopEndL` is never emitted (was previously only asserted absent under NoTailLoop=True). Full unit suite: 828 passed, 5 skipped, 2 xfailed. Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses the second-to-last open nakajee review comment on PR #7636: > module.add(SCmpLtU32( > src0=sgpr(savedOrigCounterSgpr), > src1=pgr, > comment="origCounter < PGR?")) > module.add(SCBranchSCC1( > ... > > We should not need special treatment for LoopCounter < 2 case. > SrdA/B/MXSA/MXSB should be already adjusted in preloop. Background ---------- For PGR=2 origCounter==1 the previous tail scaffold emitted a "small-counter realign" block that XOR'd `LocalWriteBaseAddr{A,B, MXSA,MXSB}` back with their `Swap<tc>` masks to re-align LWA with the LDS buffer NLL drained from. The misalignment came from preLoop: GR(MT 0) GR_INC <-- SRD advance + LWA toggle to buf 1 Wait + Sync LR drains MT 0 from buf 0 SkipOp(LE 1, NLL) <-- for origCounter==1, branches here GR(MT 1) writes buf 1 SkipOp(LE 2, NGLL) The `GR_INC`'s LDS toggle still fired even when `SkipOp(LE 1, NLL)` bypassed `GR(MT 1)`, so on the c==1 path NLL ended with LWA at buf 1 while LR was still at buf 0 -- the tail had to XOR LWA back. Fix (per the review) -------------------- Split the compound `GRIncOp` into two ops just for preLoop's use: - `GRPtrIncOp`: SRD pointer advance only. - `GRLDSwapOp`: LDS double-buffer XOR only. The mainloop keeps the compound `GRIncOp` (its two halves are always paired one-to-one with an iteration). preLoop is restructured so the SRD advance fires before `SkipOp(LE 1, NLL)` (the tail still needs SRD at K=DU for c==1) and the LDS swap fires AFTER the SkipOp, right before `GR(MT 1)`: GR(MT 0) gr_ptr_inc (per tensor) <-- SRD advance, always fires Wait + Sync LR drains MT 0 from buf 0 SkipOp(LE 1, NLL) <-- branch over both lds_swap and GR(MT 1) for c==1 gr_lds_swap (per tensor) <-- LDS toggle, skipped for c==1 GR(MT 1) writes buf 1 SkipOp(LE 2, NGLL) With the swap gated by the same SkipOp that gates GR(MT 1), LWA and LR stay aligned for every origCounter: - c==0: upstream SkipSubtileMainLoop gate skips all of preLoop; LWA at initial buf 0, LR untouched. - c==1: SkipOp(LE 1) taken -> no LDS toggle; LWA at buf 0, LR at buf 0 (NLL drained MT 0 from buf 0). - c==2: SkipOp(LE 1) not taken, LDS toggle fires (LWA->buf 1), GR(MT 1) writes buf 1; SkipOp(LE 2, NGLL) taken; NGLL/NLL drain MT 1 from buf 1; LR ends at buf 1. - c>=3: SkipOp(LE 1)/SkipOp(LE 2) both not taken; mainloop runs (c-2) iters, each toggling LWA and LR; iter-pair cancellation keeps LWA and LR aligned at exit. Tail-scaffold simplification ---------------------------- With the preLoop fix in place, the tail's small-counter realign block (`SXorB32 LocalWriteBaseAddr<tc>, ..., Swap<tc>`) becomes unreachable / dead and is removed. The c==0 and small-counter compares also collapse into a single `s_cmp_lt_u32 origCounter, PGR` -- both cases need no SRD/LWA fix-up, only the large-counter case (`origCounter >= PGR`) still needs the existing +1 DU SRD advance to undo the per-iter GR_INC off-by-one at loop exit. Net asm delta per PGR=2 kernel (subtile tail body): - 1 `s_cmp_eq_u32 origCounter, 0` compare [removed] - 1 `s_cbranch_scc1 PGRTailEntry` (c=0 branch) [removed] - `PGRTailSmallCounterRealign<L>:` label [removed] - 4 `s_xor_b32 LocalWriteBaseAddr<tc>, ..., Swap<tc>` (per A/B/MXSA/MXSB) [removed] - 1 `s_branch PGRTailEntry` (large-counter fall-through bypass) [removed] Test updates ------------ - LogicalScheduler.py: new `GRPtrIncOp` and `GRLDSwapOp` dataclasses with `gr_ptr_inc` / `gr_lds_swap` kinds; `build_preloop` PGR=2 branch uses the split pair around `SkipOp(LE 1, NLL)`. - InstructionEmitter.py: new `emit_gr_ptr_inc` (ptr-only) and `emit_gr_lds_swap` (lds-only) handlers wired into the dispatch table. - KernelWriter.py `_emitTailLoopScaffoldSubtile`: small-counter realign block removed; c=0 and `<PGR` compares folded into one `s_cmp_lt_u32 origCounter, PGR`; docstring + inline comments updated. - test_subtile_tailloop_emit.py: invert `test_emits_small_counter_lwa_realign` to assert absence; replace `test_emits_c0_reset_compare_and_branch` (PGR=1 & PGR=2) with `test_emits_lt_pgr_compare_and_branch`; rename `test_emits_small_counter_compare` to `test_emits_lt_pgr_compare_immediate` and update its docstring; refresh class / module docstrings. - test_SubtileBasedLogicalScheduler.py: extend `TestBuildPreloop.test_256x256_fp4` to require both `gr_ptr_inc` and `gr_lds_swap` (and forbid the legacy `gr_inc`); add a strict ordering invariant pinning `gr_lds_swap` AFTER `skip(LE:1:NLL)`. - test_SubtileBasedSchedulerRef.py: refresh the three preloop golden references (256x256 FP4 1x1; 320x320 BF16 1x5 offset1 and offset_all) to show `gr_ptr_inc` ... skip ... `gr_lds_swap` in place of the old `gr_inc` ... skip ... shape. Validation: full unit suite green (828 passed, 5 skipped, 2 xfailed). Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses nakajee's open OOR review on PR #7661 (carried forward from PR #7636 "Comment 5" TODO at the tail GR site): > This still does not have any prevention for out of array access. > What we need is set SrdA/B/MXSA/MXSB + 2 to the exact end of array > (but 4 byte alignment). [...] > remainK = (k%DepthU); remainKalign = remainK & 0xfffffffe; > SrdA/B -= (DepthU - remainKalign) * bpe Before this commit, the subtile K-tail re-issued a DepthU-shaped GR against an SRD whose NumRecords (Srd<tc>+2) still spanned the full DepthU bytes after the last live K-element. For K_remain < DepthU, the last m-row's per-thread `buffer_load_b128` past `K_remain*bpe` could read past A/B's allocated end-of-array (buffer-OOB does NOT bail on past-allocation reads; only on past- NumRecords reads). The per-MFMA lane mask + sub-lane refine zeroed those VGPRs after the load, so values were correct, but the buffer engine could still touch unmapped pages and trigger an HSA fault on non-contiguous A/B allocations. New `_emitTailSrdTightenSubtile` runs at tail entry (right after the PGR>0 entry-gate `PGRTailEntry<L>` label, before `openLoop`) and shrinks each `Srd<tc>+2` by `DepthU*bpe - roundUp(K_remain*bpe, loadBytesGR)`. The `roundUp(..., loadBytesGR)` (vs nakajee's literal `& 0xfffffffe`) is what keeps our wide DTL load valid for the trailing odd-K element on the last m-row: the per-thread load is B128 (16 B) for bf16/fp16, so a single thread covers up to 8 K elements -- the load must succeed for the thread that holds `K_remain - 1`. nakajee's literal align-down assumed the narrow-trailing-element strategy (`buffer_load_d16_b16 ... lds` + lane-0-only); that path is rejected by the gfx950 assembler and was previously deleted in `7df7d24` ("remove dead bf16 narrow-load helper"). The align-up variant preserves nakajee's intent (clip past-K reads on the last m-row) without needing the narrow load. A single runtime `s_cmp_lt_u32 alignedBytes, DepthU*bpe` + `s_cbranch_scc0 TailSrdTightenSkip<L>` short-circuits the SSub chain when `alignedBytes >= DepthU*bpe` (K_remain close to DepthU with wide per-thread loads -- the natural SRD limit already covers every read). The skip label is the join target. Gating: - non-MX (`MXBlock{A,B} == 0`): MX scales have host re-scatter padding (`DataInitialization.rearrangePaddedMXScaleLayout`); MX data + MXSA/MXSB SRD tightening needs nakajee's swizzleBlock-aware formula (`remainK_MX = roundUp(remainK / 256) * 256`, `SrdMXSA/B -= (DepthU - remainK_MX) * swizzleSize0 = 32`) and is a separate follow-up. - non-swizzled A/B (`SwizzleTensor{A,B}=False`): subtile mxfp4 swizzled A/B need the same swizzleBlock formula; same follow-up. - symmetric per-tensor `bpe in {1, 2}` and matching `loadWidthGR`; bf16 / fp16 / int8 anyk paths are the immediate consumer. Per-kernel asm delta (bf16 fixture, B128 load, DU=64 -> depthUBytes=128): +5 s_* (lshl/add/and/cmp/cbranch) +1 s_sub_u32 (delta) +2 s_sub_u32 (Srd{A,B}+2 tighten) +1 label 8 instructions + 1 label, all inside the tail entry; no per-iter cost. Statically gated to no-op on MX / swizzled / non-bf16 paths. No SRD restore is needed: the tail body is the last GR site for A/B before the kernel epilogue (epilogue uses SrdC / SrdD). Stale TODO comment at the tail GR site (which previously said the tightening was filed as a follow-up) is updated to reflect the new state (A/B done here; MX + swizzled deferred with a pointer to nakajee's spec). Tests (`Tensile/Tests/unit/test_subtile_tailloop_emit.py::TestTailSrdTightenSubtile`, 10 new): - Pin the emit-time `s_lshl_b32 + s_add_u32 + s_and_b32` aligned-K chain, the runtime no-op `s_cmp_lt_u32 + s_cbranch_scc0` skip, the `s_sub_u32 SrdA+2 / SrdB+2` tightening (with delta precompute), and the `TailSrdTightenSkip<L>:` join label. - Strict order: branch < SrdA+2 sub < skip label (so the short-circuit actually short-circuits). - PGR>0 placement: tightening fires AFTER `PGRTailEntry<L>:` so c=0 / small-counter / large-counter paths have all converged onto the same SRD state, then BEFORE the tail GR. - Negative pins: MX fp4 emits no tightening (MXBlock>0 gate); NoTailLoop=True emits no tightening (scaffold early-returns). Validation: - Full unit suite: 887 passed, 5 skipped, 2 xfailed (was 887 / same). - gfx950 yaml gauntlet on MI355X: subtile_bf16.yaml : 7082 / 7082 PASS subtile_bf16_tail.yaml : 450 / 450 PASS subtile_bf16_anyk_k8.yaml : 183 / 183 PASS subtile_bf16_anyk_k2.yaml : 183 / 183 PASS subtile_bf16_anyk_odd.yaml : 117 / 117 PASS subtile_bf16_anyk_largemt.yaml: 4 / 4 PASS subtile_mxfp4.yaml : 2691 / 2691 PASS subtile_mxfp4_tail.yaml : 68 / 68 PASS subtile_mxfp4_tail_smoke.yaml: PASS Total ~10,778 problem runs, 0 failures. Co-authored-by: Cursor <cursoragent@cursor.com>
The subtile branch of minASEMforMX forced ASEM to 256 (= DepthU) for MX kernels, making `NoTailLoop = (ASEM % DepthU == 0)` always True and silently rounding K up to a multiple of DepthU on the host side. With the subtile K%32 tail loop now wired in (subsequent commit), the subtile MX path can use the same ASEM=32 minimum as the rest of the MX path. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 61e236a)
Lifts the subtile K constraint from `K % DepthU == 0` to `K % 32 == 0`
by adding a real K%32 tail loop to the subtile emit path. The tail
scaffold lives in `KernelWriter._emitTailLoopScaffoldSubtile`; it
emits the LoopCounterL early-exit, per-lane kPosBase, re-issued
DepthU-shaped GR + LR for A/B (and MXSA/MXSB), per-mmak
`v_cmp_ge_i32` + per-MFMA `v_cndmask_b32` lane-mask, and MFMAs into
the existing D accumulators. The body runs as a single pass; the
legacy closeLoop's per-iter decrement is short-circuited by zeroing
LoopCounterL beforehand.
PGR>0 is gated by three mutually-exclusive entry paths keyed off a
snapshot of OrigLoopCounter (= K // DU) taken before
`calculateLoopNumIter` resets it:
- origCounter == 0: c=0 reset (zero accD, undo preLoop
GR_INC SRD-advance + LWA-XOR for
PGR>=2), fall through to tail body.
- 0 < origCounter < PGR: small-counter LWA realign (XOR back to
buf 0 to match LR; no SRD advance).
- origCounter >= PGR: large-counter +1 DU SRD advance to land
at K_aligned.
The PGR=0 mainloop gets an underflow pre-guard
(`s_cmp_eq_u32 LoopCounterL, 0` + branch to SkipMainloop) so the
do-while body is bypassed when K < DepthU.
`KernelWriterAssembly.closeLoop`'s legacy LRO/LWA damage-recovery
references `vgprLocalReadAddrA/B` symbols subtile kernels never
define, so it is bypassed for `UseSubtileImpl`. The host-side MX
scale initializer re-scatters the canonical DGen layout into the
padded mxsa/mxsb tensor descriptor on gfx950 so the CPU reference
observes the same scale bytes the GPU consumes.
Other touch-ups:
- `Subtile/Kernel.py`: name the 4-tiles-per-VGPR scale packing
constant (`MX_SCALE_TILES_PER_VGPR`); split per-mmak setup from
per-MFMA cndmask via `_emitTailKPosCmpSubtile` /
`_emitTailLaneMaskApplySubtile`; drop the dead `preLoop`.
- `SubtileScaleEmit.localReadDoScaleSubtile`: use
`lrLocalSubtileGrid` (not the [1,1]-frozen `localSubtileGrid`)
so all scale VGPRs get a ds_read on MT >= 128x128.
Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit bd0c7f9)
End-to-end yaml tests in `Tensile/Tests/common/gemm/gfx950`:
- `subtile_bf16_tail.yaml`: BF16 tail across K_rem in {16, 32, 48, 64},
DepthU in {64, 128}, PGR in {0, 1, 2}, StreamK on/off, MT128x128
and MT256x256.
- `subtile_mxfp4_tail.yaml`: MXFP4 tail across the K_rem partitions
(partial-MFMA: 32, 64, 96, 160, ...; full-last-MFMA: 128, 256),
DU=256 and DU=512, PGR / StreamK / bias / dest-type cells (~14
kernels, ~80 problem runs).
- `subtile_mxfp4_tail_smoke.yaml`: 1-kernel smoke variant.
Unit tests in `Tensile/Tests/unit`:
- `_subtile_tailloop_fixtures.py`: shared kwa-builder, kernel-keys
helper, tile-info population, and skiptoend wrapper used by
`test_subtile_tailloop_emit.py` and the new tail-emit class in
`test_SubtileBasedLogicalScheduler.py`.
- `test_subtile_tailloop_emit.py`: tail-body content assertions
(no SRD rewind on PGR=0, single-iter forcing, kPosBase init,
lane mask on A/B/MXSA/MXSB, one ds_read per scale group, no
LDS pre-zero, PGR>0 entry-gate branches).
- `test_solution_subtile_tailloop.py`: solution-level gating
(ASEM=32 + NoTailLoop=False for subtile MX).
- `test_SubtileBasedLogicalScheduler.py`: PGR=0 mainloop underflow
pre-guard structural assertions + tail-loop scaffold placement.
Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 4fb7e10)
The transA=true (A) and transB=false (B) code paths each had a copy of the same loop that re-scatters DGen's canonical scale bytes into the padded mxsa/mxsb descriptor strides. Extract the shared body into a single static helper, `rearrangePaddedMXScaleLayout`, and call it from both sides. No behavioral change; the helper preserves the existing needsRearrange short-circuit and the per-batch byte layout. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 2a34490)
Solution.py rejects every subtile solution with `StreamK == 0` (`"UseSubtileImpl=1 supports StreamK only (no support for GSU)"`). Sections 2-6 of `subtile_mxfp4_tail.yaml` were pinned to `StreamK: [0]` and therefore produced 0 valid solutions at SolutionStructs, failing CI with `SystemExit: -1` after section 1 ran. Section 1 was hiding the bug behind `StreamK: [0, 3]`, which silently drops the SK=0 half. Change sections 2-6 to `StreamK: [3]` and update the header / per-section comments to reflect the SK=0 silent filtering in section 1 and the SK=3-only reality elsewhere. No other yaml content changes. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 1d68fc2)
Two reviewer-requested improvements on top of the K%32 tail-loop work:
1. Skip the preLoop GR block when origCounter == 0.
`_kernelBody` now wraps the `if PrefetchGlobalRead:` block with an
`OrigLoopCounter == 0` gate (Subtile + PGR>0 only) that branches
past the wait-for-PGR, preLoop local write, LWA swap, and PGR>=2
prefetch GR/GR_INC sequence to a fresh `SkipPreLoopGR<loopChar>`
label placed just before the unrolled loop opens.
With the gate, SRDs and LWAs stay at `setupNewTile`'s defaults when
K < DepthU, so the tail scaffold's c=0 reset path collapses to a
single `initVgprTilesToZero` -- the previous PGR>=2 "undo preLoop
GR_INC" SRD subtract + LWA XOR loops are removed. The tail body's
re-issued GR + LR runs from the same setupNewTile state on every
K_rem < DepthU problem regardless of PGR.
2. Configurable MX scale padding fill byte.
New `GlobalParameters["MXScalePadByte"]` (default 0, mirrors current
behaviour) plumbed through:
- `ClientWriter.py` -> `--mx-scale-pad-byte <N>` (only emitted
when non-zero, keeps run.sh and CI calls unchanged in default
configurations).
- `client/main.cpp` registers `--mx-scale-pad-byte` (0..255).
- `DataInitialization` stores `m_mxScalePadByte` and uses it for
the bulk `memset` of `pristineE8A/B.cpuInput.valid`, for the
inner `memset` in `rearrangePaddedMXScaleLayout`, and for the
initial fill of the preswizzled `gpuScaleBuf`.
Setting `MXScalePadByte: 255` (E8M0 NaN/Inf) seeds every byte of
MX scale padding with a poison value; any A/B lane-mask leak in
the GPU kernel into the padded region then propagates inf/NaN
through the MFMA and trips validation instead of silently being
multiplied by 0.
Validation: subtile_mxfp4_tail.yaml passes locally with the default
(MXScalePadByte=0, exercises Comment 1's gate across PGR=[0,1,2] and
K_rem in [32, 96, 128, 160, 224, 288, 384]) and with MXScalePadByte=255
(50/50 problem runs PASSED, confirming the kernel's per-MFMA lane mask
correctly suppresses padded scale lanes).
Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 24369f3)
Set `MXScalePadByte: 255` (E8M0 0xFF = NaN/Inf) on the three subtile MX-format yamls so CI actively exercises the new poison-padding option: - subtile_mxfp4.yaml (aligned-K, M/N edge coverage) - subtile_mxfp4_tail.yaml (K%32 tail coverage, all PGR variants) - subtile_mxfp4_tail_smoke.yaml (fast tail smoke) Any A/B/MXSA/MXSB lane in the GPU kernel that reads a padded scale position will now pull an E8M0 NaN, propagate it through the MFMA into accD, and trip validation -- both for K-direction padding (subtile tail-loop mask) and for M/N-direction padding (wave-tile OOB lanes). Locally re-verified that all three yamls still pass with the poison padding active (no spurious failures introduced). Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit c7009b4)
Followups on top of "subtile: address PR review for K-tail loop" addressing the remaining nakajee review comments on PR #7636 and fixing the unit-test failures introduced by the prior commit. 1. Comment 1 (no special OrigCounter==0 treatment in tail loop): move the c=0 skip gate fully upstream into `kernelBodySubtile`. The new `SkipSubtileMainLoop<L>` gate wraps the scheduler-emitted mainLoop (preLoop + mainloop + NGLL + NLL) and bypasses all of them when `OrigLoopCounter == 0`. With the gate in place, SRDs stay at K=0, LWA/LRA stay at buf 0, and accD stays at zero from `initVgprTilesToZero`, so the tail body just runs from setupNewTile's defaults -- no undo needed. The legacy `PGRTailC0Reset<L>` block in `_emitTailLoopScaffoldSubtile` (which used to zero accD, subtract one DU from each Srd<tc>, and XOR LWA back to buf 0 for PGR>=2) is removed entirely. The c=0 compare now branches straight into `PGRTailEntry<L>`. The legacy `SkipPreLoopGR<L>` gate that lived in the obsolete `_kernelBody` path is removed (subtile kernels never go through `kernelBody`; they use `kernelBodySubtile`). 2. Comment 3 (move kPosBase setup after GR, before wait): relocate the per-lane `kPosBase = tidInK * numMIInUnroll` vector remainder / divide / multiply chain to fire between the tail's global-read issue and the post-GR `s_waitcnt`. The vector math now overlaps with buffer-load memory latency instead of serializing in front of it. 3. Comment 4 (dedupe per-MFMA cndmask): the previous tail body emitted one `v_cndmask_b32` per (mmak, mma1, mma0) MFMA over each ValuA/B input vgpr. Because ValuA depends only on `(mma0, mmak)` and ValuB only on `(mma1, mmak)`, the same A vgprs were cndmasked `len(mma1)` times and the same B vgprs `len(mma0)` times per mmak. The cndmask chain is hoisted out of the inner (mma1, mma0) MFMA grid into the per-mmak block, with a `seenVgpr` set tracking which ValuA/B/MXSA/MXSB vgprs have already been masked. Each unique vgpr now gets exactly one `v_cndmask_b32` per mmak. The downstream MFMA grid runs cndmask-free. The now-redundant `_emitTailLaneMaskApplySubtile` helper is removed (its body is inlined into the scaffold). 4. Comment 5 (clamp Srd<tc>+2 at tail entry): documented as a follow-up TODO at the tail GR site. The buffer NumRecords field is a single linear-byte limit, so a tight per-row K clamp is only achievable for the last M-row of the tile. The per-MFMA lane mask remains the actual correctness mechanism for in-range M rows. Filed for a follow-up PR. 5. CI fix: update `test_subtile_tailloop_emit.py`: - `test_emits_c0_reset_compare_and_branch` (PGR=1, PGR=2): now asserts the c=0 compare branches to `PGRTailEntry<L>` (not `PGRTailC0Reset<L>`) and that the legacy reset block is absent. - Replace `test_emits_c0_srd_subtract_with_borrow`, `test_emits_c0_lwa_xor_realign`, `test_emits_c0_reset_label_and_accD_zero`, `test_omits_c0_srd_subtract_pgr1`, `test_emits_c0_accD_zero_pgr1`, `_extract_c0_block` with negative-assertion tests `test_omits_c0_reset_label`, `test_omits_c0_srd_subtract`, `test_omits_c0_lwa_xor_undo`, `test_omits_c0_reset_label_pgr1`, `test_omits_c0_undo_instructions_pgr1` that pin the legacy block / instructions as absent. Validation: full unit suite green (`pytest Tensile/Tests/unit/`: 821 passed, 5 skipped, 2 xfailed). Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 664e471)
Two nakajee review comments on `subtile_mxfp4_tail.yaml`:
1. Section 3 (DU=512): "Would you please add more sizes for DU512?
(similar coverage as section 2 above)"
Section 3 previously had just K_rem in {32, 256} — a proof-of-life
only. Expand to mirror Section 2's K_rem coverage (which exercises
partial-first-MFMA, full-last-MFMA, partial-near-full, and the
N*DU + tail entry-gate paths). New K values for DU=512:
- K_rem=96, 128, 224, 384 (K_rem < DU coverage)
- K = 1*512 + 128 = 640 (1 main iter + full-last-MFMA tail;
PGR=2 small-counter realign path)
- K = 2*512 + 32 = 1056 (2 main iters + tail; PGR=2
large-counter SRD-advance path)
Section 3 now runs 8 K_rem × PGR={0,2} = 16 problem runs (was 4).
2. Section 1 decorative tail: "Could we have some more iteration cases?
2 iteration + Tail, 3 iteration + Tail"
Section 1 previously only had K = 1*DU + 32 = 288 (1 main iter +
tail; PGR=2 small-counter realign). Add the multi-iter analogues:
- K = 2*256 + 32 = 544 (PGR=2 large-counter SRD-advance)
- K = 3*256 + 32 = 800 (PGR>=2 large-counter, NLL drained)
Adds 2 K values × PGR={0,1,2} = 6 problem runs.
Header coverage summary and Section 3's banner updated to match the
new content. YAML parses cleanly; existing `Tensile/Tests/unit/`
subtile suite stays green (116 passed, 1 xfailed).
Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 210a5bb)
CI hipBLASLt build on PR #7636 fails with: RuntimeError: Unknown assertion key: AssertFree1DivByMT1LowbitGT1 at `Contractions.py:ProblemPredicate.FromOriginalKeyPair` while `parseLibraryLogicData` walks a gfx950 logic YAML. Root cause: PR #7443 ("manual revert KRingShift") removed the parser handlers for `AssertFree1DivByMT1LowbitGT1` / `AssertKRingShiftTailWrapOnly`, and PR #7513 cleaned the matching keys out of the hipBLASLt / hipSPARSELt library logic YAML files. Both reverts are in our branch's history, but origin/develop has new gfx950 logic YAMLs (e.g. `gfx950/gfx950_128cu/Equality/ gfx950_Cijk_Alik_Bljk_HHS_BH_BiasSH_HAS_SAV_UserArgs.yaml`, added by #7125 "update cucount") that were generated *before* the revert pair and therefore still carry the deprecated keys. When CI merges our branch with origin/develop the merged tree contains a YAML the parser can no longer read. Fix: silently ignore the two deprecated keys in `ProblemPredicate.FromOriginalKeyPair`. Matches the spirit of the revert + cleanup pair without requiring follow-up YAML cleanup in develop, and is forward-compatible against any future stale YAML that slips through the same gap. Genuinely unknown `Assert*` keys still raise loudly. New unit tests in `test_Contractions_deprecated_asserts.py` pin the silent-ignore behavior for the two specific keys, verify that unrecognized `Assert*` keys still raise, and check that the existing known `AssertFree0/1ElementMultiple` + `AssertSummationElementMultiple` predicate path is unaffected. Validation: - new tests: 7 passed - full unit suite: 828 passed, 5 skipped, 2 xfailed (was 821 passed) - end-to-end smoke: drove `FromOriginalKeyPair` on a representative stale-key entry set; no `RuntimeError` raised; recognized keys still produce real predicates. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 9d034c9)
Addresses the latest nakajee review comment on PR #7636: > module.add(SMovB32(dst=sgpr("LoopCounterL"), src=0, > comment="single-iter tail: force closeLoop fall-through")) > module.add(self.closeLoop(kernel, tensorParametersA, tensorParametersB, > -1, finalLoop=True)) > > Seems like closeLoop here is not necessary. The subtile tail body processes the entire K_tail in a single pass via the `mmak` loop (every lane mask is emitted against the current `LoopCounterL = K mod DU` snapshot), so `closeLoop(... finalLoop=True)` would only contribute dead code: - per-iter `s_sub_i32 LoopCounterL, ..., MIK` decrement, - `s_cbranch_scc0 TailLoopBeginL` back-edge (no-op since we intentionally underflowed LoopCounterL to negative), - `OrigLoopCounter += MIK` increment (only consumed by the legacy LRO-damage recovery block, which is already bypassed for `UseSubtileImpl=1` kernels inside `closeLoop`), - `TailLoopEndL:` label (nothing branches to it — the only early-exit target is `SkipTailLoopL:`, emitted by the surviving `closeLoop(... emitEndLabelOnly=True)` call). Dropping the `closeLoop(... finalLoop=True)` call also removes the need for the `s_mov_b32 LoopCounterL, 0` workaround that was there purely to force the now-removed back-edge to fall through. Net asm delta per kernel (subtile tail body): - 1 `s_mov_b32 LoopCounterL, 0` - 1 `s_sub_i32 LoopCounterL, LoopCounterL, MIK` - 1 `s_add_u32 OrigLoopCounter, OrigLoopCounter, MIK` - 1 `s_cmp_le_i32 LoopCounterL, 0` - 1 `s_cbranch_scc0 label_TailLoopBeginL` - 1 `label_TailLoopEndL:` label All removed. Test updates: - `test_emits_loop_counter_zero_before_closeloop` (PGR=0) is inverted into `test_omits_closeloop_emit`: asserts that the `s_sub_i32 LoopCounterL ... dec counterL` decrement, the `s_mov_b32 LoopCounterL, 0` workaround, and `TailLoopEndL` are all absent from the tail block. - `test_pgr2_loop_counter_zero_before_closeloop` (PGR=2) is inverted into `test_pgr2_omits_closeloop_emit` with the same set of negative assertions. - `_emitTailLoopScaffoldSubtile` docstring updated to enumerate the three PGR>0 entry-gate paths (origCounter==0 / small / large) and to document that closeLoop is intentionally not emitted by the subtile scaffold. - `TestEmitAllLoopsTail_PGR0` class docstring and `test_omits_tail_when_NoTailLoop_true` docstring updated to note that `TailLoopEndL` is never emitted (was previously only asserted absent under NoTailLoop=True). Full unit suite: 828 passed, 5 skipped, 2 xfailed. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 11173ca)
Addresses the second-to-last open nakajee review comment on PR #7636: > module.add(SCmpLtU32( > src0=sgpr(savedOrigCounterSgpr), > src1=pgr, > comment="origCounter < PGR?")) > module.add(SCBranchSCC1( > ... > > We should not need special treatment for LoopCounter < 2 case. > SrdA/B/MXSA/MXSB should be already adjusted in preloop. Background ---------- For PGR=2 origCounter==1 the previous tail scaffold emitted a "small-counter realign" block that XOR'd `LocalWriteBaseAddr{A,B, MXSA,MXSB}` back with their `Swap<tc>` masks to re-align LWA with the LDS buffer NLL drained from. The misalignment came from preLoop: GR(MT 0) GR_INC <-- SRD advance + LWA toggle to buf 1 Wait + Sync LR drains MT 0 from buf 0 SkipOp(LE 1, NLL) <-- for origCounter==1, branches here GR(MT 1) writes buf 1 SkipOp(LE 2, NGLL) The `GR_INC`'s LDS toggle still fired even when `SkipOp(LE 1, NLL)` bypassed `GR(MT 1)`, so on the c==1 path NLL ended with LWA at buf 1 while LR was still at buf 0 -- the tail had to XOR LWA back. Fix (per the review) -------------------- Split the compound `GRIncOp` into two ops just for preLoop's use: - `GRPtrIncOp`: SRD pointer advance only. - `GRLDSwapOp`: LDS double-buffer XOR only. The mainloop keeps the compound `GRIncOp` (its two halves are always paired one-to-one with an iteration). preLoop is restructured so the SRD advance fires before `SkipOp(LE 1, NLL)` (the tail still needs SRD at K=DU for c==1) and the LDS swap fires AFTER the SkipOp, right before `GR(MT 1)`: GR(MT 0) gr_ptr_inc (per tensor) <-- SRD advance, always fires Wait + Sync LR drains MT 0 from buf 0 SkipOp(LE 1, NLL) <-- branch over both lds_swap and GR(MT 1) for c==1 gr_lds_swap (per tensor) <-- LDS toggle, skipped for c==1 GR(MT 1) writes buf 1 SkipOp(LE 2, NGLL) With the swap gated by the same SkipOp that gates GR(MT 1), LWA and LR stay aligned for every origCounter: - c==0: upstream SkipSubtileMainLoop gate skips all of preLoop; LWA at initial buf 0, LR untouched. - c==1: SkipOp(LE 1) taken -> no LDS toggle; LWA at buf 0, LR at buf 0 (NLL drained MT 0 from buf 0). - c==2: SkipOp(LE 1) not taken, LDS toggle fires (LWA->buf 1), GR(MT 1) writes buf 1; SkipOp(LE 2, NGLL) taken; NGLL/NLL drain MT 1 from buf 1; LR ends at buf 1. - c>=3: SkipOp(LE 1)/SkipOp(LE 2) both not taken; mainloop runs (c-2) iters, each toggling LWA and LR; iter-pair cancellation keeps LWA and LR aligned at exit. Tail-scaffold simplification ---------------------------- With the preLoop fix in place, the tail's small-counter realign block (`SXorB32 LocalWriteBaseAddr<tc>, ..., Swap<tc>`) becomes unreachable / dead and is removed. The c==0 and small-counter compares also collapse into a single `s_cmp_lt_u32 origCounter, PGR` -- both cases need no SRD/LWA fix-up, only the large-counter case (`origCounter >= PGR`) still needs the existing +1 DU SRD advance to undo the per-iter GR_INC off-by-one at loop exit. Net asm delta per PGR=2 kernel (subtile tail body): - 1 `s_cmp_eq_u32 origCounter, 0` compare [removed] - 1 `s_cbranch_scc1 PGRTailEntry` (c=0 branch) [removed] - `PGRTailSmallCounterRealign<L>:` label [removed] - 4 `s_xor_b32 LocalWriteBaseAddr<tc>, ..., Swap<tc>` (per A/B/MXSA/MXSB) [removed] - 1 `s_branch PGRTailEntry` (large-counter fall-through bypass) [removed] Test updates ------------ - LogicalScheduler.py: new `GRPtrIncOp` and `GRLDSwapOp` dataclasses with `gr_ptr_inc` / `gr_lds_swap` kinds; `build_preloop` PGR=2 branch uses the split pair around `SkipOp(LE 1, NLL)`. - InstructionEmitter.py: new `emit_gr_ptr_inc` (ptr-only) and `emit_gr_lds_swap` (lds-only) handlers wired into the dispatch table. - KernelWriter.py `_emitTailLoopScaffoldSubtile`: small-counter realign block removed; c=0 and `<PGR` compares folded into one `s_cmp_lt_u32 origCounter, PGR`; docstring + inline comments updated. - test_subtile_tailloop_emit.py: invert `test_emits_small_counter_lwa_realign` to assert absence; replace `test_emits_c0_reset_compare_and_branch` (PGR=1 & PGR=2) with `test_emits_lt_pgr_compare_and_branch`; rename `test_emits_small_counter_compare` to `test_emits_lt_pgr_compare_immediate` and update its docstring; refresh class / module docstrings. - test_SubtileBasedLogicalScheduler.py: extend `TestBuildPreloop.test_256x256_fp4` to require both `gr_ptr_inc` and `gr_lds_swap` (and forbid the legacy `gr_inc`); add a strict ordering invariant pinning `gr_lds_swap` AFTER `skip(LE:1:NLL)`. - test_SubtileBasedSchedulerRef.py: refresh the three preloop golden references (256x256 FP4 1x1; 320x320 BF16 1x5 offset1 and offset_all) to show `gr_ptr_inc` ... skip ... `gr_lds_swap` in place of the old `gr_inc` ... skip ... shape. Validation: full unit suite green (828 passed, 5 skipped, 2 xfailed). Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 78e5093)
78e5093 to
c19651b
Compare
Summary
Lifts the subtile-impl K constraint from
K % DepthU == 0toK % 32 == 0bywiring a real K%32 tail loop into the BF16 and MXFP4 subtile emit path. Adds the
host-side scale-tensor fixup that the new tail kernels need on gfx950, plus end-to-end
yaml fixtures and unit tests that exercise the new tail behavior and pre-existing
non-tail behavior.
Before this PR, subtile MX kernels silently rounded K up to a multiple of DepthU
(256 / 512) on the host (
minASEMforMX=256+NoTailLoop=True), so any problemwith
K % DepthU != 0ran on padded data. After this PR, subtile kernels honorthe same ASEM=32 minimum as the rest of the MX path and produce correct results
for every
K % 32 == 0.What's in here
The branch is split into four logically-orderable commits:
subtile: drop minASEMforMX=256 conditional for MX path— removes theconditional that forced ASEM=256 for subtile MX kernels. On its own this
would expose the tail-emit gap; the next commit fills it.
subtile: emit K%32 tail loop for BF16 and MXFP4— the core change.KernelWriter._emitTailLoopScaffoldSubtileemits theLoopCounterLearly-exit, per-lanekPosBase, a re-issuedDepthU-shaped GR + LR for A/B (and MXSA/MXSB), the per-mmak
v_cmp_ge_i32+ per-MFMAv_cndmask_b32lane-mask, and MFMAs intothe existing D accumulators. The body runs as a single pass; the
legacy
closeLoopper-iter decrement is short-circuited by zeroingLoopCounterLbeforehand.snapshot of
OrigLoopCounter(=K // DU) taken beforecalculateLoopNumIterresets it:origCounter == 0: c=0 reset (zero accD, undo preLoopGR_INC SRD-advance + LWA-XOR for PGR>=2), fall through.
0 < origCounter < PGR: small-counter LWA realign (XOR backto buf 0 to match LR; no SRD advance).
origCounter >= PGR: large-counter +1 DU SRD advance to landat
K_aligned.(
s_cmp_eq_u32 LoopCounterL, 0+ branch toSkipMainloop) so thedo-while body is bypassed when
K < DepthU.KernelWriterAssembly.closeLoop's legacy LRO/LWA damage-recoveryreferences
vgprLocalReadAddrA/Bsymbols that subtile kernelsnever define, so it is bypassed for
UseSubtileImpl.DataInitializationre-scatters the canonical DGenscale layout into the padded mxsa/mxsb tensor descriptor (gfx950).
(
MX_SCALE_TILES_PER_VGPR); split per-mmak cmp from per-MFMAcndmask via
_emitTailKPosCmpSubtile/_emitTailLaneMaskApplySubtile; fixSubtileScaleEmit.localReadDoScaleSubtileto uselrLocalSubtileGrid(not the [1,1]-frozenlocalSubtileGrid) soall scale VGPRs get a ds_read on MT >= 128x128.
subtile: add K-tail yaml fixtures and unit-test coverage— covers thenew emit path:
Tensile/Tests/common/gemm/gfx950/subtile_bf16_tail.yaml: BF16 tailacross
K_rem ∈ {16, 32, 48, 64},DepthU ∈ {64, 128},PGR ∈ {0, 1, 2}, StreamK on/off, MT128x128 and MT256x256.subtile_mxfp4_tail.yaml: MXFP4 tail across the partial-MFMA(32, 64, 96, 160, …) and full-last-MFMA (128, 256)
K_rempartitions, DU=256 and DU=512, PGR / StreamK / bias / dest-type
cells (~14 kernels, ~80 problem runs).
subtile_mxfp4_tail_smoke.yaml: a 1-kernel smoke variant.test_subtile_tailloop_emit.py— tail-body content assertions(no SRD rewind on PGR=0, single-iter forcing,
kPosBaseinit,lane mask on A/B/MXSA/MXSB, one
ds_readper scale group, noLDS pre-zero, PGR>0 entry-gate branches).
test_solution_subtile_tailloop.py— solution-level gating(ASEM=32 +
NoTailLoop=Falsefor subtile MX).test_SubtileBasedLogicalScheduler.py— PGR=0 mainloopunderflow pre-guard structural assertions + tail-loop
scaffold placement.
_subtile_tailloop_fixtures.py— shared kwa-builder, kernel-keyshelper, and tile-info population used by the above.
subtile: dedupe scale re-scatter in DataInitialization— extracts theA-side and B-side scale-re-scatter loops in
client/src/DataInitialization.cppinto a single static helper,rearrangePaddedMXScaleLayout. No behavioral change.Scope of impact
K % DepthU != 0: previously broken (hostsilently rounded K up), now correct. New behavior.
K % DepthU == 0: assembled identicallyto before — the tail body is gated on
LoopCounterL != 0and thePGR>0 entry-path snapshot reduces to the existing fast path. Verified
by static assembly diff on representative kernels.
K % DepthU != 0: previously rejectedby gating, now emit and run a tail body.
padded mxsa/mxsb stride genuinely diverges from the canonical DGen
stride; otherwise it short-circuits. Unaffected on non-MX paths.
Test plan
pytest projects/hipblaslt/tensilelite/Tensile/Tests/unit/test_subtile_tailloop_emit.py projects/hipblaslt/tensilelite/Tensile/Tests/unit/test_solution_subtile_tailloop.py projects/hipblaslt/tensilelite/Tensile/Tests/unit/test_SubtileBasedLogicalScheduler.pysubtile_bf16_tail.yaml,subtile_mxfp4_tail.yaml,subtile_mxfp4_tail_smoke.yamlK % DepthU == 0produce byte-identical kernel binaries (or only the entry-gate
snapshot diff).
Submission Checklist