[hipBLASLt] [TensileLite] Add initial tail loop support for Subtile path by bnemanich · Pull Request #7636 · ROCm/rocm-libraries

bnemanich · 2026-05-20T14:25:21Z

Summary

Lifts the subtile-impl K constraint from K % DepthU == 0 to K % 32 == 0 by
wiring a real K%32 tail loop into the BF16 and MXFP4 subtile emit path. Adds the
host-side scale-tensor fixup that the new tail kernels need on gfx950, plus end-to-end
yaml fixtures and unit tests that exercise the new tail behavior and pre-existing
non-tail behavior.
Before this PR, subtile MX kernels silently rounded K up to a multiple of DepthU
(256 / 512) on the host (minASEMforMX=256 + NoTailLoop=True), so any problem
with K % DepthU != 0 ran on padded data. After this PR, subtile kernels honor
the same ASEM=32 minimum as the rest of the MX path and produce correct results
for every K % 32 == 0.

What's in here

The branch is split into four logically-orderable commits:

subtile: drop minASEMforMX=256 conditional for MX path — removes the
conditional that forced ASEM=256 for subtile MX kernels. On its own this
would expose the tail-emit gap; the next commit fills it.
subtile: emit K%32 tail loop for BF16 and MXFP4 — the core change.
- New KernelWriter._emitTailLoopScaffoldSubtile emits the
  LoopCounterL early-exit, per-lane kPosBase, a re-issued
  DepthU-shaped GR + LR for A/B (and MXSA/MXSB), the per-mmak
  v_cmp_ge_i32 + per-MFMA v_cndmask_b32 lane-mask, and MFMAs into
  the existing D accumulators. The body runs as a single pass; the
  legacy closeLoop per-iter decrement is short-circuited by zeroing
  LoopCounterL beforehand.
- PGR>0 is gated by three mutually-exclusive entry paths keyed off a
  snapshot of OrigLoopCounter (= K // DU) taken before
  calculateLoopNumIter resets it:
  - origCounter == 0: c=0 reset (zero accD, undo preLoop
    GR_INC SRD-advance + LWA-XOR for PGR>=2), fall through.
  - 0 < origCounter < PGR: small-counter LWA realign (XOR back
    to buf 0 to match LR; no SRD advance).
  - origCounter >= PGR: large-counter +1 DU SRD advance to land
    at K_aligned.
- PGR=0 mainloop gets an underflow pre-guard
  (s_cmp_eq_u32 LoopCounterL, 0 + branch to SkipMainloop) so the
  do-while body is bypassed when K < DepthU.
- KernelWriterAssembly.closeLoop's legacy LRO/LWA damage-recovery
  references vgprLocalReadAddrA/B symbols that subtile kernels
  never define, so it is bypassed for UseSubtileImpl.
- Host-side DataInitialization re-scatters the canonical DGen
  scale layout into the padded mxsa/mxsb tensor descriptor (gfx950).
- Minor touch-ups: name the 4-tiles-per-VGPR scale packing constant
  (MX_SCALE_TILES_PER_VGPR); split per-mmak cmp from per-MFMA
  cndmask via _emitTailKPosCmpSubtile /
  _emitTailLaneMaskApplySubtile; fix
  SubtileScaleEmit.localReadDoScaleSubtile to use
  lrLocalSubtileGrid (not the [1,1]-frozen localSubtileGrid) so
  all scale VGPRs get a ds_read on MT >= 128x128.
subtile: add K-tail yaml fixtures and unit-test coverage — covers the
new emit path:
- Tensile/Tests/common/gemm/gfx950/subtile_bf16_tail.yaml: BF16 tail
  across K_rem ∈ {16, 32, 48, 64}, DepthU ∈ {64, 128},
  PGR ∈ {0, 1, 2}, StreamK on/off, MT128x128 and MT256x256.
- subtile_mxfp4_tail.yaml: MXFP4 tail across the partial-MFMA
  (32, 64, 96, 160, …) and full-last-MFMA (128, 256) K_rem
  partitions, DU=256 and DU=512, PGR / StreamK / bias / dest-type
  cells (~14 kernels, ~80 problem runs).
- subtile_mxfp4_tail_smoke.yaml: a 1-kernel smoke variant.
- Unit tests:
  - test_subtile_tailloop_emit.py — tail-body content assertions
    (no SRD rewind on PGR=0, single-iter forcing, kPosBase init,
    lane mask on A/B/MXSA/MXSB, one ds_read per scale group, no
    LDS pre-zero, PGR>0 entry-gate branches).
  - test_solution_subtile_tailloop.py — solution-level gating
    (ASEM=32 + NoTailLoop=False for subtile MX).
  - test_SubtileBasedLogicalScheduler.py — PGR=0 mainloop
    underflow pre-guard structural assertions + tail-loop
    scaffold placement.
  - _subtile_tailloop_fixtures.py — shared kwa-builder, kernel-keys
    helper, and tile-info population used by the above.
subtile: dedupe scale re-scatter in DataInitialization — extracts the
A-side and B-side scale-re-scatter loops in
client/src/DataInitialization.cpp into a single static helper,
rearrangePaddedMXScaleLayout. No behavioral change.

Scope of impact

Subtile MX kernels with K % DepthU != 0: previously broken (host
silently rounded K up), now correct. New behavior.
Subtile MX kernels with K % DepthU == 0: assembled identically
to before — the tail body is gated on LoopCounterL != 0 and the
PGR>0 entry-path snapshot reduces to the existing fast path. Verified
by static assembly diff on representative kernels.
Subtile BF16 kernels with K % DepthU != 0: previously rejected
by gating, now emit and run a tail body.
Non-subtile kernels: untouched.
Host-side data init: the re-scatter only fires on gfx950 when the
padded mxsa/mxsb stride genuinely diverges from the canonical DGen
stride; otherwise it short-circuits. Unaffected on non-MX paths.

Test plan

CI: existing unit + e2e gfx950 jobs stay green
New unit tests pass:
pytest projects/hipblaslt/tensilelite/Tensile/Tests/unit/test_subtile_tailloop_emit.py projects/hipblaslt/tensilelite/Tensile/Tests/unit/test_solution_subtile_tailloop.py projects/hipblaslt/tensilelite/Tensile/Tests/unit/test_SubtileBasedLogicalScheduler.py
New yaml fixtures run clean on gfx950:
subtile_bf16_tail.yaml, subtile_mxfp4_tail.yaml,
subtile_mxfp4_tail_smoke.yaml
Spot-check: pre-existing subtile MX yamls with K % DepthU == 0
produce byte-identical kernel binaries (or only the entry-gate
snapshot diff).

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

codecov-commenter · 2026-05-20T16:31:53Z

Codecov Report

❌ Patch coverage is 91.20370% with 19 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ects/hipblaslt/tensilelite/Tensile/KernelWriter.py	94.01%	9 Missing and 1 partial ⚠️
...ects/hipblaslt/tensilelite/Tensile/ClientWriter.py	0.00%	6 Missing ⚠️
...t/tensilelite/Tensile/Components/Subtile/Kernel.py	60.00%	2 Missing ⚠️
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py	0.00%	1 Missing ⚠️

❌ Your project status has failed because the head coverage (77.83%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #7636      +/-   ##
===========================================
+ Coverage    61.95%   61.96%   +0.01%     
===========================================
  Files         2086     2086              
  Lines       357070   357436     +366     
  Branches     53779    53884     +105     
===========================================
+ Hits        221221   221469     +248     
- Misses      117055   117132      +77     
- Partials     18794    18835      +41

Flag	Coverage Δ		*Carryforward flag
TensileLite	`26.73% <91.20%> (+0.29%)`	⬆️
hipBLAS	`90.65% <ø> (ø)`		Carriedforward from e0ae7a1
hipBLASLt	`41.27% <ø> (ø)`		Carriedforward from e0ae7a1
hipCUB	`82.21% <ø> (ø)`		Carriedforward from e0ae7a1
hipDNN	`85.87% <ø> (ø)`		Carriedforward from e0ae7a1
hipFFT	`51.12% <ø> (ø)`		Carriedforward from e0ae7a1
hipRAND	`76.12% <ø> (ø)`		Carriedforward from e0ae7a1
hipSOLVER	`69.24% <ø> (ø)`		Carriedforward from e0ae7a1
hipSPARSE	`85.09% <ø> (ø)`		Carriedforward from e0ae7a1
rocBLAS	`48.11% <ø> (ø)`		Carriedforward from e0ae7a1
rocFFT	`52.07% <ø> (ø)`		Carriedforward from e0ae7a1
rocRAND	`57.03% <ø> (ø)`		Carriedforward from e0ae7a1
rocSOLVER	`77.83% <ø> (ø)`		Carriedforward from e0ae7a1
rocSPARSE	`72.68% <ø> (ø)`		Carriedforward from e0ae7a1

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines	Coverage Δ
...slt/tensilelite/Tensile/Common/GlobalParameters.py	`85.00% <100.00%> (+0.14%)`	⬆️
...e/Tensile/Components/Subtile/InstructionEmitter.py	`96.13% <100.00%> (+3.22%)`	⬆️
...ite/Tensile/Components/Subtile/LogicalScheduler.py	`88.27% <100.00%> (+0.72%)`	⬆️
...ite/Tensile/Components/Subtile/SubtileScaleEmit.py	`35.83% <100.00%> (+7.33%)`	⬆️
...ects/hipblaslt/tensilelite/Tensile/Contractions.py	`71.30% <100.00%> (+1.81%)`	⬆️
...lt/tensilelite/Tensile/SolutionStructs/Solution.py	`10.26% <100.00%> (+2.69%)`	⬆️
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py	`7.89% <0.00%> (+0.34%)`	⬆️
...t/tensilelite/Tensile/Components/Subtile/Kernel.py	`68.27% <60.00%> (+4.53%)`	⬆️
...ects/hipblaslt/tensilelite/Tensile/ClientWriter.py	`37.57% <0.00%> (-0.29%)`	⬇️
...ects/hipblaslt/tensilelite/Tensile/KernelWriter.py	`11.67% <94.01%> (+2.09%)`	⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

nakajee · 2026-05-20T16:36:23Z

It does not have code to skip first prefetch GR for sgprLoopCounterL=0 case.

nakajee · 2026-05-20T16:37:46Z

I think it might be good to have an option to put some non 0 value for MX swizzle padding.
Otherwise, we might not be able to confirm if A,B masking is correct or not.

nakajee · 2026-05-20T17:00:51Z

+        if kernel["ProblemType"].get("MXBlockB", 0) > 0:
+          realignTcs.append('MXSB')
+
+        c0ResetLabel = Label("PGRTailC0Reset%s" % loopChar, "")


We should not need any special treatment for OrigCount==0 case in tailloop.
Init acc should be done in preloop.

nakajee · 2026-05-20T17:03:04Z

+          src0=sgpr(savedOrigCounterSgpr),
+          src1=pgr,
+          comment="origCounter < PGR?"))
+        module.add(SCBranchSCC1(


We should not need special treatment for LoopCounter < 2 case.
SrdA/B/MXSA/MXSB should be already adjusted in preloop.

nakajee · 2026-05-20T17:05:15Z

+
+      kPosBaseVgpr = self.vgprPool.checkOut(1, "kReg_first")
+      with self.allocTmpSgpr(1) as tmpSgprInfo:
+        module.add(vectorStaticRemainder(-1, kPosBaseVgpr, "Serial",


We can put this after GR and before wait.

nakajee · 2026-05-20T17:07:42Z

+    module = Module("tailLaneMaskApplySubtile")
+    laneSGPRCount = self.states.laneSGPRCount
+    for idx in aTile.regList.indices:
+      module.add(VCndMaskB32(dst=vgpr(idx), src0=vgpr(idx), src1=0,


We do not need to generate cndmask against same vgpr more than once.
Please add done check or something to avoid generating cndmask for same vgpr twice or more.

nakajee · 2026-05-20T18:40:02Z

+      for tailTile in tailAllocTiles:
+        tailTile.allocVgprTileRegisters_legacy(self, kernel)
+
+      # Re-issue one DepthU-shaped GR + LR. Byte-layout identical to a


We should adjust SrdA/B/MXSA/MXSB + 2 to avoid out of range access.

nakajee · 2026-05-20T19:51:09Z

+          # K_rem=32 (tail-only, K << DU)
+          - Exact: [128, 128, 1,  32]
+          # K_rem=256 (2 full last MFMAs, K_rem mod 128 == 0)
+          - Exact: [128, 128, 1, 256]


Would you please add more sizes for DU512?
(similar coverage as section 2 above)

nakajee · 2026-05-20T19:51:55Z

+          # Decorative: batched + partial-MFMA tail
+          - Exact: [128, 128, 2, 160]
+          # Decorative: odd M,N + smallest tail
+          - Exact: [ 63,  63, 1,  32]


Could we have some more iteration cases?

2 iteration + Tail

3 iteration + Tail

Followups on top of "subtile: address PR review for K-tail loop" addressing the remaining nakajee review comments on PR #7636 and fixing the unit-test failures introduced by the prior commit. 1. Comment 1 (no special OrigCounter==0 treatment in tail loop): move the c=0 skip gate fully upstream into `kernelBodySubtile`. The new `SkipSubtileMainLoop<L>` gate wraps the scheduler-emitted mainLoop (preLoop + mainloop + NGLL + NLL) and bypasses all of them when `OrigLoopCounter == 0`. With the gate in place, SRDs stay at K=0, LWA/LRA stay at buf 0, and accD stays at zero from `initVgprTilesToZero`, so the tail body just runs from setupNewTile's defaults -- no undo needed. The legacy `PGRTailC0Reset<L>` block in `_emitTailLoopScaffoldSubtile` (which used to zero accD, subtract one DU from each Srd<tc>, and XOR LWA back to buf 0 for PGR>=2) is removed entirely. The c=0 compare now branches straight into `PGRTailEntry<L>`. The legacy `SkipPreLoopGR<L>` gate that lived in the obsolete `_kernelBody` path is removed (subtile kernels never go through `kernelBody`; they use `kernelBodySubtile`). 2. Comment 3 (move kPosBase setup after GR, before wait): relocate the per-lane `kPosBase = tidInK * numMIInUnroll` vector remainder / divide / multiply chain to fire between the tail's global-read issue and the post-GR `s_waitcnt`. The vector math now overlaps with buffer-load memory latency instead of serializing in front of it. 3. Comment 4 (dedupe per-MFMA cndmask): the previous tail body emitted one `v_cndmask_b32` per (mmak, mma1, mma0) MFMA over each ValuA/B input vgpr. Because ValuA depends only on `(mma0, mmak)` and ValuB only on `(mma1, mmak)`, the same A vgprs were cndmasked `len(mma1)` times and the same B vgprs `len(mma0)` times per mmak. The cndmask chain is hoisted out of the inner (mma1, mma0) MFMA grid into the per-mmak block, with a `seenVgpr` set tracking which ValuA/B/MXSA/MXSB vgprs have already been masked. Each unique vgpr now gets exactly one `v_cndmask_b32` per mmak. The downstream MFMA grid runs cndmask-free. The now-redundant `_emitTailLaneMaskApplySubtile` helper is removed (its body is inlined into the scaffold). 4. Comment 5 (clamp Srd<tc>+2 at tail entry): documented as a follow-up TODO at the tail GR site. The buffer NumRecords field is a single linear-byte limit, so a tight per-row K clamp is only achievable for the last M-row of the tile. The per-MFMA lane mask remains the actual correctness mechanism for in-range M rows. Filed for a follow-up PR. 5. CI fix: update `test_subtile_tailloop_emit.py`: - `test_emits_c0_reset_compare_and_branch` (PGR=1, PGR=2): now asserts the c=0 compare branches to `PGRTailEntry<L>` (not `PGRTailC0Reset<L>`) and that the legacy reset block is absent. - Replace `test_emits_c0_srd_subtract_with_borrow`, `test_emits_c0_lwa_xor_realign`, `test_emits_c0_reset_label_and_accD_zero`, `test_omits_c0_srd_subtract_pgr1`, `test_emits_c0_accD_zero_pgr1`, `_extract_c0_block` with negative-assertion tests `test_omits_c0_reset_label`, `test_omits_c0_srd_subtract`, `test_omits_c0_lwa_xor_undo`, `test_omits_c0_reset_label_pgr1`, `test_omits_c0_undo_instructions_pgr1` that pin the legacy block / instructions as absent. Validation: full unit suite green (`pytest Tensile/Tests/unit/`: 821 passed, 5 skipped, 2 xfailed). Co-authored-by: Cursor <cursoragent@cursor.com>

CI hipBLASLt build on PR #7636 fails with: RuntimeError: Unknown assertion key: AssertFree1DivByMT1LowbitGT1 at `Contractions.py:ProblemPredicate.FromOriginalKeyPair` while `parseLibraryLogicData` walks a gfx950 logic YAML. Root cause: PR #7443 ("manual revert KRingShift") removed the parser handlers for `AssertFree1DivByMT1LowbitGT1` / `AssertKRingShiftTailWrapOnly`, and PR #7513 cleaned the matching keys out of the hipBLASLt / hipSPARSELt library logic YAML files. Both reverts are in our branch's history, but origin/develop has new gfx950 logic YAMLs (e.g. `gfx950/gfx950_128cu/Equality/ gfx950_Cijk_Alik_Bljk_HHS_BH_BiasSH_HAS_SAV_UserArgs.yaml`, added by #7125 "update cucount") that were generated *before* the revert pair and therefore still carry the deprecated keys. When CI merges our branch with origin/develop the merged tree contains a YAML the parser can no longer read. Fix: silently ignore the two deprecated keys in `ProblemPredicate.FromOriginalKeyPair`. Matches the spirit of the revert + cleanup pair without requiring follow-up YAML cleanup in develop, and is forward-compatible against any future stale YAML that slips through the same gap. Genuinely unknown `Assert*` keys still raise loudly. New unit tests in `test_Contractions_deprecated_asserts.py` pin the silent-ignore behavior for the two specific keys, verify that unrecognized `Assert*` keys still raise, and check that the existing known `AssertFree0/1ElementMultiple` + `AssertSummationElementMultiple` predicate path is unaffected. Validation: - new tests: 7 passed - full unit suite: 828 passed, 5 skipped, 2 xfailed (was 821 passed) - end-to-end smoke: drove `FromOriginalKeyPair` on a representative stale-key entry set; no `RuntimeError` raised; recognized keys still produce real predicates. Co-authored-by: Cursor <cursoragent@cursor.com>

Addresses the latest nakajee review comment on PR #7636: > module.add(SMovB32(dst=sgpr("LoopCounterL"), src=0, > comment="single-iter tail: force closeLoop fall-through")) > module.add(self.closeLoop(kernel, tensorParametersA, tensorParametersB, > -1, finalLoop=True)) > > Seems like closeLoop here is not necessary. The subtile tail body processes the entire K_tail in a single pass via the `mmak` loop (every lane mask is emitted against the current `LoopCounterL = K mod DU` snapshot), so `closeLoop(... finalLoop=True)` would only contribute dead code: - per-iter `s_sub_i32 LoopCounterL, ..., MIK` decrement, - `s_cbranch_scc0 TailLoopBeginL` back-edge (no-op since we intentionally underflowed LoopCounterL to negative), - `OrigLoopCounter += MIK` increment (only consumed by the legacy LRO-damage recovery block, which is already bypassed for `UseSubtileImpl=1` kernels inside `closeLoop`), - `TailLoopEndL:` label (nothing branches to it — the only early-exit target is `SkipTailLoopL:`, emitted by the surviving `closeLoop(... emitEndLabelOnly=True)` call). Dropping the `closeLoop(... finalLoop=True)` call also removes the need for the `s_mov_b32 LoopCounterL, 0` workaround that was there purely to force the now-removed back-edge to fall through. Net asm delta per kernel (subtile tail body): - 1 `s_mov_b32 LoopCounterL, 0` - 1 `s_sub_i32 LoopCounterL, LoopCounterL, MIK` - 1 `s_add_u32 OrigLoopCounter, OrigLoopCounter, MIK` - 1 `s_cmp_le_i32 LoopCounterL, 0` - 1 `s_cbranch_scc0 label_TailLoopBeginL` - 1 `label_TailLoopEndL:` label All removed. Test updates: - `test_emits_loop_counter_zero_before_closeloop` (PGR=0) is inverted into `test_omits_closeloop_emit`: asserts that the `s_sub_i32 LoopCounterL ... dec counterL` decrement, the `s_mov_b32 LoopCounterL, 0` workaround, and `TailLoopEndL` are all absent from the tail block. - `test_pgr2_loop_counter_zero_before_closeloop` (PGR=2) is inverted into `test_pgr2_omits_closeloop_emit` with the same set of negative assertions. - `_emitTailLoopScaffoldSubtile` docstring updated to enumerate the three PGR>0 entry-gate paths (origCounter==0 / small / large) and to document that closeLoop is intentionally not emitted by the subtile scaffold. - `TestEmitAllLoopsTail_PGR0` class docstring and `test_omits_tail_when_NoTailLoop_true` docstring updated to note that `TailLoopEndL` is never emitted (was previously only asserted absent under NoTailLoop=True). Full unit suite: 828 passed, 5 skipped, 2 xfailed. Co-authored-by: Cursor <cursoragent@cursor.com>

Addresses the second-to-last open nakajee review comment on PR #7636: > module.add(SCmpLtU32( > src0=sgpr(savedOrigCounterSgpr), > src1=pgr, > comment="origCounter < PGR?")) > module.add(SCBranchSCC1( > ... > > We should not need special treatment for LoopCounter < 2 case. > SrdA/B/MXSA/MXSB should be already adjusted in preloop. Background ---------- For PGR=2 origCounter==1 the previous tail scaffold emitted a "small-counter realign" block that XOR'd `LocalWriteBaseAddr{A,B, MXSA,MXSB}` back with their `Swap<tc>` masks to re-align LWA with the LDS buffer NLL drained from. The misalignment came from preLoop: GR(MT 0) GR_INC <-- SRD advance + LWA toggle to buf 1 Wait + Sync LR drains MT 0 from buf 0 SkipOp(LE 1, NLL) <-- for origCounter==1, branches here GR(MT 1) writes buf 1 SkipOp(LE 2, NGLL) The `GR_INC`'s LDS toggle still fired even when `SkipOp(LE 1, NLL)` bypassed `GR(MT 1)`, so on the c==1 path NLL ended with LWA at buf 1 while LR was still at buf 0 -- the tail had to XOR LWA back. Fix (per the review) -------------------- Split the compound `GRIncOp` into two ops just for preLoop's use: - `GRPtrIncOp`: SRD pointer advance only. - `GRLDSwapOp`: LDS double-buffer XOR only. The mainloop keeps the compound `GRIncOp` (its two halves are always paired one-to-one with an iteration). preLoop is restructured so the SRD advance fires before `SkipOp(LE 1, NLL)` (the tail still needs SRD at K=DU for c==1) and the LDS swap fires AFTER the SkipOp, right before `GR(MT 1)`: GR(MT 0) gr_ptr_inc (per tensor) <-- SRD advance, always fires Wait + Sync LR drains MT 0 from buf 0 SkipOp(LE 1, NLL) <-- branch over both lds_swap and GR(MT 1) for c==1 gr_lds_swap (per tensor) <-- LDS toggle, skipped for c==1 GR(MT 1) writes buf 1 SkipOp(LE 2, NGLL) With the swap gated by the same SkipOp that gates GR(MT 1), LWA and LR stay aligned for every origCounter: - c==0: upstream SkipSubtileMainLoop gate skips all of preLoop; LWA at initial buf 0, LR untouched. - c==1: SkipOp(LE 1) taken -> no LDS toggle; LWA at buf 0, LR at buf 0 (NLL drained MT 0 from buf 0). - c==2: SkipOp(LE 1) not taken, LDS toggle fires (LWA->buf 1), GR(MT 1) writes buf 1; SkipOp(LE 2, NGLL) taken; NGLL/NLL drain MT 1 from buf 1; LR ends at buf 1. - c>=3: SkipOp(LE 1)/SkipOp(LE 2) both not taken; mainloop runs (c-2) iters, each toggling LWA and LR; iter-pair cancellation keeps LWA and LR aligned at exit. Tail-scaffold simplification ---------------------------- With the preLoop fix in place, the tail's small-counter realign block (`SXorB32 LocalWriteBaseAddr<tc>, ..., Swap<tc>`) becomes unreachable / dead and is removed. The c==0 and small-counter compares also collapse into a single `s_cmp_lt_u32 origCounter, PGR` -- both cases need no SRD/LWA fix-up, only the large-counter case (`origCounter >= PGR`) still needs the existing +1 DU SRD advance to undo the per-iter GR_INC off-by-one at loop exit. Net asm delta per PGR=2 kernel (subtile tail body): - 1 `s_cmp_eq_u32 origCounter, 0` compare [removed] - 1 `s_cbranch_scc1 PGRTailEntry` (c=0 branch) [removed] - `PGRTailSmallCounterRealign<L>:` label [removed] - 4 `s_xor_b32 LocalWriteBaseAddr<tc>, ..., Swap<tc>` (per A/B/MXSA/MXSB) [removed] - 1 `s_branch PGRTailEntry` (large-counter fall-through bypass) [removed] Test updates ------------ - LogicalScheduler.py: new `GRPtrIncOp` and `GRLDSwapOp` dataclasses with `gr_ptr_inc` / `gr_lds_swap` kinds; `build_preloop` PGR=2 branch uses the split pair around `SkipOp(LE 1, NLL)`. - InstructionEmitter.py: new `emit_gr_ptr_inc` (ptr-only) and `emit_gr_lds_swap` (lds-only) handlers wired into the dispatch table. - KernelWriter.py `_emitTailLoopScaffoldSubtile`: small-counter realign block removed; c=0 and `<PGR` compares folded into one `s_cmp_lt_u32 origCounter, PGR`; docstring + inline comments updated. - test_subtile_tailloop_emit.py: invert `test_emits_small_counter_lwa_realign` to assert absence; replace `test_emits_c0_reset_compare_and_branch` (PGR=1 & PGR=2) with `test_emits_lt_pgr_compare_and_branch`; rename `test_emits_small_counter_compare` to `test_emits_lt_pgr_compare_immediate` and update its docstring; refresh class / module docstrings. - test_SubtileBasedLogicalScheduler.py: extend `TestBuildPreloop.test_256x256_fp4` to require both `gr_ptr_inc` and `gr_lds_swap` (and forbid the legacy `gr_inc`); add a strict ordering invariant pinning `gr_lds_swap` AFTER `skip(LE:1:NLL)`. - test_SubtileBasedSchedulerRef.py: refresh the three preloop golden references (256x256 FP4 1x1; 320x320 BF16 1x5 offset1 and offset_all) to show `gr_ptr_inc` ... skip ... `gr_lds_swap` in place of the old `gr_inc` ... skip ... shape. Validation: full unit suite green (828 passed, 5 skipped, 2 xfailed). Co-authored-by: Cursor <cursoragent@cursor.com>

Addresses nakajee's open OOR review on PR #7661 (carried forward from PR #7636 "Comment 5" TODO at the tail GR site): > This still does not have any prevention for out of array access. > What we need is set SrdA/B/MXSA/MXSB + 2 to the exact end of array > (but 4 byte alignment). [...] > remainK = (k%DepthU); remainKalign = remainK & 0xfffffffe; > SrdA/B -= (DepthU - remainKalign) * bpe Before this commit, the subtile K-tail re-issued a DepthU-shaped GR against an SRD whose NumRecords (Srd<tc>+2) still spanned the full DepthU bytes after the last live K-element. For K_remain < DepthU, the last m-row's per-thread `buffer_load_b128` past `K_remain*bpe` could read past A/B's allocated end-of-array (buffer-OOB does NOT bail on past-allocation reads; only on past- NumRecords reads). The per-MFMA lane mask + sub-lane refine zeroed those VGPRs after the load, so values were correct, but the buffer engine could still touch unmapped pages and trigger an HSA fault on non-contiguous A/B allocations. New `_emitTailSrdTightenSubtile` runs at tail entry (right after the PGR>0 entry-gate `PGRTailEntry<L>` label, before `openLoop`) and shrinks each `Srd<tc>+2` by `DepthU*bpe - roundUp(K_remain*bpe, loadBytesGR)`. The `roundUp(..., loadBytesGR)` (vs nakajee's literal `& 0xfffffffe`) is what keeps our wide DTL load valid for the trailing odd-K element on the last m-row: the per-thread load is B128 (16 B) for bf16/fp16, so a single thread covers up to 8 K elements -- the load must succeed for the thread that holds `K_remain - 1`. nakajee's literal align-down assumed the narrow-trailing-element strategy (`buffer_load_d16_b16 ... lds` + lane-0-only); that path is rejected by the gfx950 assembler and was previously deleted in `7df7d24` ("remove dead bf16 narrow-load helper"). The align-up variant preserves nakajee's intent (clip past-K reads on the last m-row) without needing the narrow load. A single runtime `s_cmp_lt_u32 alignedBytes, DepthU*bpe` + `s_cbranch_scc0 TailSrdTightenSkip<L>` short-circuits the SSub chain when `alignedBytes >= DepthU*bpe` (K_remain close to DepthU with wide per-thread loads -- the natural SRD limit already covers every read). The skip label is the join target. Gating: - non-MX (`MXBlock{A,B} == 0`): MX scales have host re-scatter padding (`DataInitialization.rearrangePaddedMXScaleLayout`); MX data + MXSA/MXSB SRD tightening needs nakajee's swizzleBlock-aware formula (`remainK_MX = roundUp(remainK / 256) * 256`, `SrdMXSA/B -= (DepthU - remainK_MX) * swizzleSize0 = 32`) and is a separate follow-up. - non-swizzled A/B (`SwizzleTensor{A,B}=False`): subtile mxfp4 swizzled A/B need the same swizzleBlock formula; same follow-up. - symmetric per-tensor `bpe in {1, 2}` and matching `loadWidthGR`; bf16 / fp16 / int8 anyk paths are the immediate consumer. Per-kernel asm delta (bf16 fixture, B128 load, DU=64 -> depthUBytes=128): +5 s_* (lshl/add/and/cmp/cbranch) +1 s_sub_u32 (delta) +2 s_sub_u32 (Srd{A,B}+2 tighten) +1 label 8 instructions + 1 label, all inside the tail entry; no per-iter cost. Statically gated to no-op on MX / swizzled / non-bf16 paths. No SRD restore is needed: the tail body is the last GR site for A/B before the kernel epilogue (epilogue uses SrdC / SrdD). Stale TODO comment at the tail GR site (which previously said the tightening was filed as a follow-up) is updated to reflect the new state (A/B done here; MX + swizzled deferred with a pointer to nakajee's spec). Tests (`Tensile/Tests/unit/test_subtile_tailloop_emit.py::TestTailSrdTightenSubtile`, 10 new): - Pin the emit-time `s_lshl_b32 + s_add_u32 + s_and_b32` aligned-K chain, the runtime no-op `s_cmp_lt_u32 + s_cbranch_scc0` skip, the `s_sub_u32 SrdA+2 / SrdB+2` tightening (with delta precompute), and the `TailSrdTightenSkip<L>:` join label. - Strict order: branch < SrdA+2 sub < skip label (so the short-circuit actually short-circuits). - PGR>0 placement: tightening fires AFTER `PGRTailEntry<L>:` so c=0 / small-counter / large-counter paths have all converged onto the same SRD state, then BEFORE the tail GR. - Negative pins: MX fp4 emits no tightening (MXBlock>0 gate); NoTailLoop=True emits no tightening (scaffold early-returns). Validation: - Full unit suite: 887 passed, 5 skipped, 2 xfailed (was 887 / same). - gfx950 yaml gauntlet on MI355X: subtile_bf16.yaml : 7082 / 7082 PASS subtile_bf16_tail.yaml : 450 / 450 PASS subtile_bf16_anyk_k8.yaml : 183 / 183 PASS subtile_bf16_anyk_k2.yaml : 183 / 183 PASS subtile_bf16_anyk_odd.yaml : 117 / 117 PASS subtile_bf16_anyk_largemt.yaml: 4 / 4 PASS subtile_mxfp4.yaml : 2691 / 2691 PASS subtile_mxfp4_tail.yaml : 68 / 68 PASS subtile_mxfp4_tail_smoke.yaml: PASS Total ~10,778 problem runs, 0 failures. Co-authored-by: Cursor <cursoragent@cursor.com>

The subtile branch of minASEMforMX forced ASEM to 256 (= DepthU) for MX kernels, making `NoTailLoop = (ASEM % DepthU == 0)` always True and silently rounding K up to a multiple of DepthU on the host side. With the subtile K%32 tail loop now wired in (subsequent commit), the subtile MX path can use the same ASEM=32 minimum as the rest of the MX path. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 61e236a)

Lifts the subtile K constraint from `K % DepthU == 0` to `K % 32 == 0` by adding a real K%32 tail loop to the subtile emit path. The tail scaffold lives in `KernelWriter._emitTailLoopScaffoldSubtile`; it emits the LoopCounterL early-exit, per-lane kPosBase, re-issued DepthU-shaped GR + LR for A/B (and MXSA/MXSB), per-mmak `v_cmp_ge_i32` + per-MFMA `v_cndmask_b32` lane-mask, and MFMAs into the existing D accumulators. The body runs as a single pass; the legacy closeLoop's per-iter decrement is short-circuited by zeroing LoopCounterL beforehand. PGR>0 is gated by three mutually-exclusive entry paths keyed off a snapshot of OrigLoopCounter (= K // DU) taken before `calculateLoopNumIter` resets it: - origCounter == 0: c=0 reset (zero accD, undo preLoop GR_INC SRD-advance + LWA-XOR for PGR>=2), fall through to tail body. - 0 < origCounter < PGR: small-counter LWA realign (XOR back to buf 0 to match LR; no SRD advance). - origCounter >= PGR: large-counter +1 DU SRD advance to land at K_aligned. The PGR=0 mainloop gets an underflow pre-guard (`s_cmp_eq_u32 LoopCounterL, 0` + branch to SkipMainloop) so the do-while body is bypassed when K < DepthU. `KernelWriterAssembly.closeLoop`'s legacy LRO/LWA damage-recovery references `vgprLocalReadAddrA/B` symbols subtile kernels never define, so it is bypassed for `UseSubtileImpl`. The host-side MX scale initializer re-scatters the canonical DGen layout into the padded mxsa/mxsb tensor descriptor on gfx950 so the CPU reference observes the same scale bytes the GPU consumes. Other touch-ups: - `Subtile/Kernel.py`: name the 4-tiles-per-VGPR scale packing constant (`MX_SCALE_TILES_PER_VGPR`); split per-mmak setup from per-MFMA cndmask via `_emitTailKPosCmpSubtile` / `_emitTailLaneMaskApplySubtile`; drop the dead `preLoop`. - `SubtileScaleEmit.localReadDoScaleSubtile`: use `lrLocalSubtileGrid` (not the [1,1]-frozen `localSubtileGrid`) so all scale VGPRs get a ds_read on MT >= 128x128. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit bd0c7f9)

End-to-end yaml tests in `Tensile/Tests/common/gemm/gfx950`: - `subtile_bf16_tail.yaml`: BF16 tail across K_rem in {16, 32, 48, 64}, DepthU in {64, 128}, PGR in {0, 1, 2}, StreamK on/off, MT128x128 and MT256x256. - `subtile_mxfp4_tail.yaml`: MXFP4 tail across the K_rem partitions (partial-MFMA: 32, 64, 96, 160, ...; full-last-MFMA: 128, 256), DU=256 and DU=512, PGR / StreamK / bias / dest-type cells (~14 kernels, ~80 problem runs). - `subtile_mxfp4_tail_smoke.yaml`: 1-kernel smoke variant. Unit tests in `Tensile/Tests/unit`: - `_subtile_tailloop_fixtures.py`: shared kwa-builder, kernel-keys helper, tile-info population, and skiptoend wrapper used by `test_subtile_tailloop_emit.py` and the new tail-emit class in `test_SubtileBasedLogicalScheduler.py`. - `test_subtile_tailloop_emit.py`: tail-body content assertions (no SRD rewind on PGR=0, single-iter forcing, kPosBase init, lane mask on A/B/MXSA/MXSB, one ds_read per scale group, no LDS pre-zero, PGR>0 entry-gate branches). - `test_solution_subtile_tailloop.py`: solution-level gating (ASEM=32 + NoTailLoop=False for subtile MX). - `test_SubtileBasedLogicalScheduler.py`: PGR=0 mainloop underflow pre-guard structural assertions + tail-loop scaffold placement. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 4fb7e10)

The transA=true (A) and transB=false (B) code paths each had a copy of the same loop that re-scatters DGen's canonical scale bytes into the padded mxsa/mxsb descriptor strides. Extract the shared body into a single static helper, `rearrangePaddedMXScaleLayout`, and call it from both sides. No behavioral change; the helper preserves the existing needsRearrange short-circuit and the per-batch byte layout. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 2a34490)

Solution.py rejects every subtile solution with `StreamK == 0` (`"UseSubtileImpl=1 supports StreamK only (no support for GSU)"`). Sections 2-6 of `subtile_mxfp4_tail.yaml` were pinned to `StreamK: [0]` and therefore produced 0 valid solutions at SolutionStructs, failing CI with `SystemExit: -1` after section 1 ran. Section 1 was hiding the bug behind `StreamK: [0, 3]`, which silently drops the SK=0 half. Change sections 2-6 to `StreamK: [3]` and update the header / per-section comments to reflect the SK=0 silent filtering in section 1 and the SK=3-only reality elsewhere. No other yaml content changes. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 1d68fc2)

Two reviewer-requested improvements on top of the K%32 tail-loop work: 1. Skip the preLoop GR block when origCounter == 0. `_kernelBody` now wraps the `if PrefetchGlobalRead:` block with an `OrigLoopCounter == 0` gate (Subtile + PGR>0 only) that branches past the wait-for-PGR, preLoop local write, LWA swap, and PGR>=2 prefetch GR/GR_INC sequence to a fresh `SkipPreLoopGR<loopChar>` label placed just before the unrolled loop opens. With the gate, SRDs and LWAs stay at `setupNewTile`'s defaults when K < DepthU, so the tail scaffold's c=0 reset path collapses to a single `initVgprTilesToZero` -- the previous PGR>=2 "undo preLoop GR_INC" SRD subtract + LWA XOR loops are removed. The tail body's re-issued GR + LR runs from the same setupNewTile state on every K_rem < DepthU problem regardless of PGR. 2. Configurable MX scale padding fill byte. New `GlobalParameters["MXScalePadByte"]` (default 0, mirrors current behaviour) plumbed through: - `ClientWriter.py` -> `--mx-scale-pad-byte <N>` (only emitted when non-zero, keeps run.sh and CI calls unchanged in default configurations). - `client/main.cpp` registers `--mx-scale-pad-byte` (0..255). - `DataInitialization` stores `m_mxScalePadByte` and uses it for the bulk `memset` of `pristineE8A/B.cpuInput.valid`, for the inner `memset` in `rearrangePaddedMXScaleLayout`, and for the initial fill of the preswizzled `gpuScaleBuf`. Setting `MXScalePadByte: 255` (E8M0 NaN/Inf) seeds every byte of MX scale padding with a poison value; any A/B lane-mask leak in the GPU kernel into the padded region then propagates inf/NaN through the MFMA and trips validation instead of silently being multiplied by 0. Validation: subtile_mxfp4_tail.yaml passes locally with the default (MXScalePadByte=0, exercises Comment 1's gate across PGR=[0,1,2] and K_rem in [32, 96, 128, 160, 224, 288, 384]) and with MXScalePadByte=255 (50/50 problem runs PASSED, confirming the kernel's per-MFMA lane mask correctly suppresses padded scale lanes). Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 24369f3)

Set `MXScalePadByte: 255` (E8M0 0xFF = NaN/Inf) on the three subtile MX-format yamls so CI actively exercises the new poison-padding option: - subtile_mxfp4.yaml (aligned-K, M/N edge coverage) - subtile_mxfp4_tail.yaml (K%32 tail coverage, all PGR variants) - subtile_mxfp4_tail_smoke.yaml (fast tail smoke) Any A/B/MXSA/MXSB lane in the GPU kernel that reads a padded scale position will now pull an E8M0 NaN, propagate it through the MFMA into accD, and trip validation -- both for K-direction padding (subtile tail-loop mask) and for M/N-direction padding (wave-tile OOB lanes). Locally re-verified that all three yamls still pass with the poison padding active (no spurious failures introduced). Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit c7009b4)

Followups on top of "subtile: address PR review for K-tail loop" addressing the remaining nakajee review comments on PR #7636 and fixing the unit-test failures introduced by the prior commit. 1. Comment 1 (no special OrigCounter==0 treatment in tail loop): move the c=0 skip gate fully upstream into `kernelBodySubtile`. The new `SkipSubtileMainLoop<L>` gate wraps the scheduler-emitted mainLoop (preLoop + mainloop + NGLL + NLL) and bypasses all of them when `OrigLoopCounter == 0`. With the gate in place, SRDs stay at K=0, LWA/LRA stay at buf 0, and accD stays at zero from `initVgprTilesToZero`, so the tail body just runs from setupNewTile's defaults -- no undo needed. The legacy `PGRTailC0Reset<L>` block in `_emitTailLoopScaffoldSubtile` (which used to zero accD, subtract one DU from each Srd<tc>, and XOR LWA back to buf 0 for PGR>=2) is removed entirely. The c=0 compare now branches straight into `PGRTailEntry<L>`. The legacy `SkipPreLoopGR<L>` gate that lived in the obsolete `_kernelBody` path is removed (subtile kernels never go through `kernelBody`; they use `kernelBodySubtile`). 2. Comment 3 (move kPosBase setup after GR, before wait): relocate the per-lane `kPosBase = tidInK * numMIInUnroll` vector remainder / divide / multiply chain to fire between the tail's global-read issue and the post-GR `s_waitcnt`. The vector math now overlaps with buffer-load memory latency instead of serializing in front of it. 3. Comment 4 (dedupe per-MFMA cndmask): the previous tail body emitted one `v_cndmask_b32` per (mmak, mma1, mma0) MFMA over each ValuA/B input vgpr. Because ValuA depends only on `(mma0, mmak)` and ValuB only on `(mma1, mmak)`, the same A vgprs were cndmasked `len(mma1)` times and the same B vgprs `len(mma0)` times per mmak. The cndmask chain is hoisted out of the inner (mma1, mma0) MFMA grid into the per-mmak block, with a `seenVgpr` set tracking which ValuA/B/MXSA/MXSB vgprs have already been masked. Each unique vgpr now gets exactly one `v_cndmask_b32` per mmak. The downstream MFMA grid runs cndmask-free. The now-redundant `_emitTailLaneMaskApplySubtile` helper is removed (its body is inlined into the scaffold). 4. Comment 5 (clamp Srd<tc>+2 at tail entry): documented as a follow-up TODO at the tail GR site. The buffer NumRecords field is a single linear-byte limit, so a tight per-row K clamp is only achievable for the last M-row of the tile. The per-MFMA lane mask remains the actual correctness mechanism for in-range M rows. Filed for a follow-up PR. 5. CI fix: update `test_subtile_tailloop_emit.py`: - `test_emits_c0_reset_compare_and_branch` (PGR=1, PGR=2): now asserts the c=0 compare branches to `PGRTailEntry<L>` (not `PGRTailC0Reset<L>`) and that the legacy reset block is absent. - Replace `test_emits_c0_srd_subtract_with_borrow`, `test_emits_c0_lwa_xor_realign`, `test_emits_c0_reset_label_and_accD_zero`, `test_omits_c0_srd_subtract_pgr1`, `test_emits_c0_accD_zero_pgr1`, `_extract_c0_block` with negative-assertion tests `test_omits_c0_reset_label`, `test_omits_c0_srd_subtract`, `test_omits_c0_lwa_xor_undo`, `test_omits_c0_reset_label_pgr1`, `test_omits_c0_undo_instructions_pgr1` that pin the legacy block / instructions as absent. Validation: full unit suite green (`pytest Tensile/Tests/unit/`: 821 passed, 5 skipped, 2 xfailed). Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 664e471)

Two nakajee review comments on `subtile_mxfp4_tail.yaml`: 1. Section 3 (DU=512): "Would you please add more sizes for DU512? (similar coverage as section 2 above)" Section 3 previously had just K_rem in {32, 256} — a proof-of-life only. Expand to mirror Section 2's K_rem coverage (which exercises partial-first-MFMA, full-last-MFMA, partial-near-full, and the N*DU + tail entry-gate paths). New K values for DU=512: - K_rem=96, 128, 224, 384 (K_rem < DU coverage) - K = 1*512 + 128 = 640 (1 main iter + full-last-MFMA tail; PGR=2 small-counter realign path) - K = 2*512 + 32 = 1056 (2 main iters + tail; PGR=2 large-counter SRD-advance path) Section 3 now runs 8 K_rem × PGR={0,2} = 16 problem runs (was 4). 2. Section 1 decorative tail: "Could we have some more iteration cases? 2 iteration + Tail, 3 iteration + Tail" Section 1 previously only had K = 1*DU + 32 = 288 (1 main iter + tail; PGR=2 small-counter realign). Add the multi-iter analogues: - K = 2*256 + 32 = 544 (PGR=2 large-counter SRD-advance) - K = 3*256 + 32 = 800 (PGR>=2 large-counter, NLL drained) Adds 2 K values × PGR={0,1,2} = 6 problem runs. Header coverage summary and Section 3's banner updated to match the new content. YAML parses cleanly; existing `Tensile/Tests/unit/` subtile suite stays green (116 passed, 1 xfailed). Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 210a5bb)

CI hipBLASLt build on PR #7636 fails with: RuntimeError: Unknown assertion key: AssertFree1DivByMT1LowbitGT1 at `Contractions.py:ProblemPredicate.FromOriginalKeyPair` while `parseLibraryLogicData` walks a gfx950 logic YAML. Root cause: PR #7443 ("manual revert KRingShift") removed the parser handlers for `AssertFree1DivByMT1LowbitGT1` / `AssertKRingShiftTailWrapOnly`, and PR #7513 cleaned the matching keys out of the hipBLASLt / hipSPARSELt library logic YAML files. Both reverts are in our branch's history, but origin/develop has new gfx950 logic YAMLs (e.g. `gfx950/gfx950_128cu/Equality/ gfx950_Cijk_Alik_Bljk_HHS_BH_BiasSH_HAS_SAV_UserArgs.yaml`, added by #7125 "update cucount") that were generated *before* the revert pair and therefore still carry the deprecated keys. When CI merges our branch with origin/develop the merged tree contains a YAML the parser can no longer read. Fix: silently ignore the two deprecated keys in `ProblemPredicate.FromOriginalKeyPair`. Matches the spirit of the revert + cleanup pair without requiring follow-up YAML cleanup in develop, and is forward-compatible against any future stale YAML that slips through the same gap. Genuinely unknown `Assert*` keys still raise loudly. New unit tests in `test_Contractions_deprecated_asserts.py` pin the silent-ignore behavior for the two specific keys, verify that unrecognized `Assert*` keys still raise, and check that the existing known `AssertFree0/1ElementMultiple` + `AssertSummationElementMultiple` predicate path is unaffected. Validation: - new tests: 7 passed - full unit suite: 828 passed, 5 skipped, 2 xfailed (was 821 passed) - end-to-end smoke: drove `FromOriginalKeyPair` on a representative stale-key entry set; no `RuntimeError` raised; recognized keys still produce real predicates. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 9d034c9)

Addresses the latest nakajee review comment on PR #7636: > module.add(SMovB32(dst=sgpr("LoopCounterL"), src=0, > comment="single-iter tail: force closeLoop fall-through")) > module.add(self.closeLoop(kernel, tensorParametersA, tensorParametersB, > -1, finalLoop=True)) > > Seems like closeLoop here is not necessary. The subtile tail body processes the entire K_tail in a single pass via the `mmak` loop (every lane mask is emitted against the current `LoopCounterL = K mod DU` snapshot), so `closeLoop(... finalLoop=True)` would only contribute dead code: - per-iter `s_sub_i32 LoopCounterL, ..., MIK` decrement, - `s_cbranch_scc0 TailLoopBeginL` back-edge (no-op since we intentionally underflowed LoopCounterL to negative), - `OrigLoopCounter += MIK` increment (only consumed by the legacy LRO-damage recovery block, which is already bypassed for `UseSubtileImpl=1` kernels inside `closeLoop`), - `TailLoopEndL:` label (nothing branches to it — the only early-exit target is `SkipTailLoopL:`, emitted by the surviving `closeLoop(... emitEndLabelOnly=True)` call). Dropping the `closeLoop(... finalLoop=True)` call also removes the need for the `s_mov_b32 LoopCounterL, 0` workaround that was there purely to force the now-removed back-edge to fall through. Net asm delta per kernel (subtile tail body): - 1 `s_mov_b32 LoopCounterL, 0` - 1 `s_sub_i32 LoopCounterL, LoopCounterL, MIK` - 1 `s_add_u32 OrigLoopCounter, OrigLoopCounter, MIK` - 1 `s_cmp_le_i32 LoopCounterL, 0` - 1 `s_cbranch_scc0 label_TailLoopBeginL` - 1 `label_TailLoopEndL:` label All removed. Test updates: - `test_emits_loop_counter_zero_before_closeloop` (PGR=0) is inverted into `test_omits_closeloop_emit`: asserts that the `s_sub_i32 LoopCounterL ... dec counterL` decrement, the `s_mov_b32 LoopCounterL, 0` workaround, and `TailLoopEndL` are all absent from the tail block. - `test_pgr2_loop_counter_zero_before_closeloop` (PGR=2) is inverted into `test_pgr2_omits_closeloop_emit` with the same set of negative assertions. - `_emitTailLoopScaffoldSubtile` docstring updated to enumerate the three PGR>0 entry-gate paths (origCounter==0 / small / large) and to document that closeLoop is intentionally not emitted by the subtile scaffold. - `TestEmitAllLoopsTail_PGR0` class docstring and `test_omits_tail_when_NoTailLoop_true` docstring updated to note that `TailLoopEndL` is never emitted (was previously only asserted absent under NoTailLoop=True). Full unit suite: 828 passed, 5 skipped, 2 xfailed. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 11173ca)

Addresses the second-to-last open nakajee review comment on PR #7636: > module.add(SCmpLtU32( > src0=sgpr(savedOrigCounterSgpr), > src1=pgr, > comment="origCounter < PGR?")) > module.add(SCBranchSCC1( > ... > > We should not need special treatment for LoopCounter < 2 case. > SrdA/B/MXSA/MXSB should be already adjusted in preloop. Background ---------- For PGR=2 origCounter==1 the previous tail scaffold emitted a "small-counter realign" block that XOR'd `LocalWriteBaseAddr{A,B, MXSA,MXSB}` back with their `Swap<tc>` masks to re-align LWA with the LDS buffer NLL drained from. The misalignment came from preLoop: GR(MT 0) GR_INC <-- SRD advance + LWA toggle to buf 1 Wait + Sync LR drains MT 0 from buf 0 SkipOp(LE 1, NLL) <-- for origCounter==1, branches here GR(MT 1) writes buf 1 SkipOp(LE 2, NGLL) The `GR_INC`'s LDS toggle still fired even when `SkipOp(LE 1, NLL)` bypassed `GR(MT 1)`, so on the c==1 path NLL ended with LWA at buf 1 while LR was still at buf 0 -- the tail had to XOR LWA back. Fix (per the review) -------------------- Split the compound `GRIncOp` into two ops just for preLoop's use: - `GRPtrIncOp`: SRD pointer advance only. - `GRLDSwapOp`: LDS double-buffer XOR only. The mainloop keeps the compound `GRIncOp` (its two halves are always paired one-to-one with an iteration). preLoop is restructured so the SRD advance fires before `SkipOp(LE 1, NLL)` (the tail still needs SRD at K=DU for c==1) and the LDS swap fires AFTER the SkipOp, right before `GR(MT 1)`: GR(MT 0) gr_ptr_inc (per tensor) <-- SRD advance, always fires Wait + Sync LR drains MT 0 from buf 0 SkipOp(LE 1, NLL) <-- branch over both lds_swap and GR(MT 1) for c==1 gr_lds_swap (per tensor) <-- LDS toggle, skipped for c==1 GR(MT 1) writes buf 1 SkipOp(LE 2, NGLL) With the swap gated by the same SkipOp that gates GR(MT 1), LWA and LR stay aligned for every origCounter: - c==0: upstream SkipSubtileMainLoop gate skips all of preLoop; LWA at initial buf 0, LR untouched. - c==1: SkipOp(LE 1) taken -> no LDS toggle; LWA at buf 0, LR at buf 0 (NLL drained MT 0 from buf 0). - c==2: SkipOp(LE 1) not taken, LDS toggle fires (LWA->buf 1), GR(MT 1) writes buf 1; SkipOp(LE 2, NGLL) taken; NGLL/NLL drain MT 1 from buf 1; LR ends at buf 1. - c>=3: SkipOp(LE 1)/SkipOp(LE 2) both not taken; mainloop runs (c-2) iters, each toggling LWA and LR; iter-pair cancellation keeps LWA and LR aligned at exit. Tail-scaffold simplification ---------------------------- With the preLoop fix in place, the tail's small-counter realign block (`SXorB32 LocalWriteBaseAddr<tc>, ..., Swap<tc>`) becomes unreachable / dead and is removed. The c==0 and small-counter compares also collapse into a single `s_cmp_lt_u32 origCounter, PGR` -- both cases need no SRD/LWA fix-up, only the large-counter case (`origCounter >= PGR`) still needs the existing +1 DU SRD advance to undo the per-iter GR_INC off-by-one at loop exit. Net asm delta per PGR=2 kernel (subtile tail body): - 1 `s_cmp_eq_u32 origCounter, 0` compare [removed] - 1 `s_cbranch_scc1 PGRTailEntry` (c=0 branch) [removed] - `PGRTailSmallCounterRealign<L>:` label [removed] - 4 `s_xor_b32 LocalWriteBaseAddr<tc>, ..., Swap<tc>` (per A/B/MXSA/MXSB) [removed] - 1 `s_branch PGRTailEntry` (large-counter fall-through bypass) [removed] Test updates ------------ - LogicalScheduler.py: new `GRPtrIncOp` and `GRLDSwapOp` dataclasses with `gr_ptr_inc` / `gr_lds_swap` kinds; `build_preloop` PGR=2 branch uses the split pair around `SkipOp(LE 1, NLL)`. - InstructionEmitter.py: new `emit_gr_ptr_inc` (ptr-only) and `emit_gr_lds_swap` (lds-only) handlers wired into the dispatch table. - KernelWriter.py `_emitTailLoopScaffoldSubtile`: small-counter realign block removed; c=0 and `<PGR` compares folded into one `s_cmp_lt_u32 origCounter, PGR`; docstring + inline comments updated. - test_subtile_tailloop_emit.py: invert `test_emits_small_counter_lwa_realign` to assert absence; replace `test_emits_c0_reset_compare_and_branch` (PGR=1 & PGR=2) with `test_emits_lt_pgr_compare_and_branch`; rename `test_emits_small_counter_compare` to `test_emits_lt_pgr_compare_immediate` and update its docstring; refresh class / module docstrings. - test_SubtileBasedLogicalScheduler.py: extend `TestBuildPreloop.test_256x256_fp4` to require both `gr_ptr_inc` and `gr_lds_swap` (and forbid the legacy `gr_inc`); add a strict ordering invariant pinning `gr_lds_swap` AFTER `skip(LE:1:NLL)`. - test_SubtileBasedSchedulerRef.py: refresh the three preloop golden references (256x256 FP4 1x1; 320x320 BF16 1x5 offset1 and offset_all) to show `gr_ptr_inc` ... skip ... `gr_lds_swap` in place of the old `gr_inc` ... skip ... shape. Validation: full unit suite green (828 passed, 5 skipped, 2 xfailed). Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 78e5093)

bnemanich requested a review from a team as a code owner May 20, 2026 14:25

github-actions Bot added project: hipblaslt project: hipsparselt ci:hipsparselt-fast labels May 20, 2026

assistant-librarian Bot added the organization: ROCm label May 20, 2026

nakajee reviewed May 20, 2026

View reviewed changes

nakajee reviewed May 21, 2026

View reviewed changes

Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriter.py Outdated

bnemanich and others added 12 commits May 25, 2026 07:44

bnemanich force-pushed the users/bnemanich/subtile-tailloop-k32-rebased branch from 78e5093 to c19651b Compare May 25, 2026 19:49

bnemanich mentioned this pull request May 27, 2026

[Hipblaslt] Allow Subtile path to use BF16 any-K and MX K%32 tail loop #7782

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hipBLASLt] [TensileLite] Add initial tail loop support for Subtile path#7636

[hipBLASLt] [TensileLite] Add initial tail loop support for Subtile path#7636
bnemanich wants to merge 12 commits into
developfrom
users/bnemanich/subtile-tailloop-k32-rebased

bnemanich commented May 20, 2026

Uh oh!

codecov-commenter commented May 20, 2026 •

edited

Loading

Uh oh!

nakajee commented May 20, 2026

Uh oh!

nakajee commented May 20, 2026

Uh oh!

nakajee May 20, 2026

Uh oh!

nakajee May 20, 2026

Uh oh!

nakajee May 20, 2026

Uh oh!

nakajee May 20, 2026

Uh oh!

nakajee May 20, 2026

Uh oh!

nakajee May 20, 2026

Uh oh!

nakajee May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bnemanich commented May 20, 2026

Summary

What's in here

Scope of impact

Test plan

Submission Checklist

Uh oh!

codecov-commenter commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nakajee commented May 20, 2026

Uh oh!

nakajee commented May 20, 2026

Uh oh!

nakajee May 20, 2026

Choose a reason for hiding this comment

Uh oh!

nakajee May 20, 2026

Choose a reason for hiding this comment

Uh oh!

nakajee May 20, 2026

Choose a reason for hiding this comment

Uh oh!

nakajee May 20, 2026

Choose a reason for hiding this comment

Uh oh!

nakajee May 20, 2026

Choose a reason for hiding this comment

Uh oh!

nakajee May 20, 2026

Choose a reason for hiding this comment

Uh oh!

nakajee May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented May 20, 2026 •

edited

Loading