Skip to content

[hipBLASLt] [TensileLite] Add initial tail loop support for Subtile path#7636

Open
bnemanich wants to merge 12 commits into
developfrom
users/bnemanich/subtile-tailloop-k32-rebased
Open

[hipBLASLt] [TensileLite] Add initial tail loop support for Subtile path#7636
bnemanich wants to merge 12 commits into
developfrom
users/bnemanich/subtile-tailloop-k32-rebased

Conversation

@bnemanich
Copy link
Copy Markdown
Contributor

Summary

Lifts the subtile-impl K constraint from K % DepthU == 0 to K % 32 == 0 by
wiring a real K%32 tail loop into the BF16 and MXFP4 subtile emit path. Adds the
host-side scale-tensor fixup that the new tail kernels need on gfx950, plus end-to-end
yaml fixtures and unit tests that exercise the new tail behavior and pre-existing
non-tail behavior.
Before this PR, subtile MX kernels silently rounded K up to a multiple of DepthU
(256 / 512) on the host (minASEMforMX=256 + NoTailLoop=True), so any problem
with K % DepthU != 0 ran on padded data. After this PR, subtile kernels honor
the same ASEM=32 minimum as the rest of the MX path and produce correct results
for every K % 32 == 0.

What's in here

The branch is split into four logically-orderable commits:

  1. subtile: drop minASEMforMX=256 conditional for MX path — removes the
    conditional that forced ASEM=256 for subtile MX kernels. On its own this
    would expose the tail-emit gap; the next commit fills it.
  2. subtile: emit K%32 tail loop for BF16 and MXFP4 — the core change.
    • New KernelWriter._emitTailLoopScaffoldSubtile emits the
      LoopCounterL early-exit, per-lane kPosBase, a re-issued
      DepthU-shaped GR + LR for A/B (and MXSA/MXSB), the per-mmak
      v_cmp_ge_i32 + per-MFMA v_cndmask_b32 lane-mask, and MFMAs into
      the existing D accumulators. The body runs as a single pass; the
      legacy closeLoop per-iter decrement is short-circuited by zeroing
      LoopCounterL beforehand.
    • PGR>0 is gated by three mutually-exclusive entry paths keyed off a
      snapshot of OrigLoopCounter (= K // DU) taken before
      calculateLoopNumIter resets it:
      • origCounter == 0: c=0 reset (zero accD, undo preLoop
        GR_INC SRD-advance + LWA-XOR for PGR>=2), fall through.
      • 0 < origCounter < PGR: small-counter LWA realign (XOR back
        to buf 0 to match LR; no SRD advance).
      • origCounter >= PGR: large-counter +1 DU SRD advance to land
        at K_aligned.
    • PGR=0 mainloop gets an underflow pre-guard
      (s_cmp_eq_u32 LoopCounterL, 0 + branch to SkipMainloop) so the
      do-while body is bypassed when K < DepthU.
    • KernelWriterAssembly.closeLoop's legacy LRO/LWA damage-recovery
      references vgprLocalReadAddrA/B symbols that subtile kernels
      never define, so it is bypassed for UseSubtileImpl.
    • Host-side DataInitialization re-scatters the canonical DGen
      scale layout into the padded mxsa/mxsb tensor descriptor (gfx950).
    • Minor touch-ups: name the 4-tiles-per-VGPR scale packing constant
      (MX_SCALE_TILES_PER_VGPR); split per-mmak cmp from per-MFMA
      cndmask via _emitTailKPosCmpSubtile /
      _emitTailLaneMaskApplySubtile; fix
      SubtileScaleEmit.localReadDoScaleSubtile to use
      lrLocalSubtileGrid (not the [1,1]-frozen localSubtileGrid) so
      all scale VGPRs get a ds_read on MT >= 128x128.
  3. subtile: add K-tail yaml fixtures and unit-test coverage — covers the
    new emit path:
    • Tensile/Tests/common/gemm/gfx950/subtile_bf16_tail.yaml: BF16 tail
      across K_rem ∈ {16, 32, 48, 64}, DepthU ∈ {64, 128},
      PGR ∈ {0, 1, 2}, StreamK on/off, MT128x128 and MT256x256.
    • subtile_mxfp4_tail.yaml: MXFP4 tail across the partial-MFMA
      (32, 64, 96, 160, …) and full-last-MFMA (128, 256) K_rem
      partitions, DU=256 and DU=512, PGR / StreamK / bias / dest-type
      cells (~14 kernels, ~80 problem runs).
    • subtile_mxfp4_tail_smoke.yaml: a 1-kernel smoke variant.
    • Unit tests:
      • test_subtile_tailloop_emit.py — tail-body content assertions
        (no SRD rewind on PGR=0, single-iter forcing, kPosBase init,
        lane mask on A/B/MXSA/MXSB, one ds_read per scale group, no
        LDS pre-zero, PGR>0 entry-gate branches).
      • test_solution_subtile_tailloop.py — solution-level gating
        (ASEM=32 + NoTailLoop=False for subtile MX).
      • test_SubtileBasedLogicalScheduler.py — PGR=0 mainloop
        underflow pre-guard structural assertions + tail-loop
        scaffold placement.
      • _subtile_tailloop_fixtures.py — shared kwa-builder, kernel-keys
        helper, and tile-info population used by the above.
  4. subtile: dedupe scale re-scatter in DataInitialization — extracts the
    A-side and B-side scale-re-scatter loops in
    client/src/DataInitialization.cpp into a single static helper,
    rearrangePaddedMXScaleLayout. No behavioral change.

Scope of impact

  • Subtile MX kernels with K % DepthU != 0: previously broken (host
    silently rounded K up), now correct. New behavior.
  • Subtile MX kernels with K % DepthU == 0: assembled identically
    to before — the tail body is gated on LoopCounterL != 0 and the
    PGR>0 entry-path snapshot reduces to the existing fast path. Verified
    by static assembly diff on representative kernels.
  • Subtile BF16 kernels with K % DepthU != 0: previously rejected
    by gating, now emit and run a tail body.
  • Non-subtile kernels: untouched.
  • Host-side data init: the re-scatter only fires on gfx950 when the
    padded mxsa/mxsb stride genuinely diverges from the canonical DGen
    stride; otherwise it short-circuits. Unaffected on non-MX paths.

Test plan

  • CI: existing unit + e2e gfx950 jobs stay green
  • New unit tests pass:
    pytest projects/hipblaslt/tensilelite/Tensile/Tests/unit/test_subtile_tailloop_emit.py projects/hipblaslt/tensilelite/Tensile/Tests/unit/test_solution_subtile_tailloop.py projects/hipblaslt/tensilelite/Tensile/Tests/unit/test_SubtileBasedLogicalScheduler.py
  • New yaml fixtures run clean on gfx950:
    subtile_bf16_tail.yaml, subtile_mxfp4_tail.yaml,
    subtile_mxfp4_tail_smoke.yaml
  • Spot-check: pre-existing subtile MX yamls with K % DepthU == 0
    produce byte-identical kernel binaries (or only the entry-gate
    snapshot diff).

Submission Checklist

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 20, 2026

Codecov Report

❌ Patch coverage is 91.20370% with 19 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ects/hipblaslt/tensilelite/Tensile/KernelWriter.py 94.01% 9 Missing and 1 partial ⚠️
...ects/hipblaslt/tensilelite/Tensile/ClientWriter.py 0.00% 6 Missing ⚠️
...t/tensilelite/Tensile/Components/Subtile/Kernel.py 60.00% 2 Missing ⚠️
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py 0.00% 1 Missing ⚠️

❌ Your project status has failed because the head coverage (77.83%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7636      +/-   ##
===========================================
+ Coverage    61.95%   61.96%   +0.01%     
===========================================
  Files         2086     2086              
  Lines       357070   357436     +366     
  Branches     53779    53884     +105     
===========================================
+ Hits        221221   221469     +248     
- Misses      117055   117132      +77     
- Partials     18794    18835      +41     
Flag Coverage Δ *Carryforward flag
TensileLite 26.73% <91.20%> (+0.29%) ⬆️
hipBLAS 90.65% <ø> (ø) Carriedforward from e0ae7a1
hipBLASLt 41.27% <ø> (ø) Carriedforward from e0ae7a1
hipCUB 82.21% <ø> (ø) Carriedforward from e0ae7a1
hipDNN 85.87% <ø> (ø) Carriedforward from e0ae7a1
hipFFT 51.12% <ø> (ø) Carriedforward from e0ae7a1
hipRAND 76.12% <ø> (ø) Carriedforward from e0ae7a1
hipSOLVER 69.24% <ø> (ø) Carriedforward from e0ae7a1
hipSPARSE 85.09% <ø> (ø) Carriedforward from e0ae7a1
rocBLAS 48.11% <ø> (ø) Carriedforward from e0ae7a1
rocFFT 52.07% <ø> (ø) Carriedforward from e0ae7a1
rocRAND 57.03% <ø> (ø) Carriedforward from e0ae7a1
rocSOLVER 77.83% <ø> (ø) Carriedforward from e0ae7a1
rocSPARSE 72.68% <ø> (ø) Carriedforward from e0ae7a1

*This pull request uses carry forward flags. Click here to find out more.

Files with missing lines Coverage Δ
...slt/tensilelite/Tensile/Common/GlobalParameters.py 85.00% <100.00%> (+0.14%) ⬆️
...e/Tensile/Components/Subtile/InstructionEmitter.py 96.13% <100.00%> (+3.22%) ⬆️
...ite/Tensile/Components/Subtile/LogicalScheduler.py 88.27% <100.00%> (+0.72%) ⬆️
...ite/Tensile/Components/Subtile/SubtileScaleEmit.py 35.83% <100.00%> (+7.33%) ⬆️
...ects/hipblaslt/tensilelite/Tensile/Contractions.py 71.30% <100.00%> (+1.81%) ⬆️
...lt/tensilelite/Tensile/SolutionStructs/Solution.py 10.26% <100.00%> (+2.69%) ⬆️
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py 7.89% <0.00%> (+0.34%) ⬆️
...t/tensilelite/Tensile/Components/Subtile/Kernel.py 68.27% <60.00%> (+4.53%) ⬆️
...ects/hipblaslt/tensilelite/Tensile/ClientWriter.py 37.57% <0.00%> (-0.29%) ⬇️
...ects/hipblaslt/tensilelite/Tensile/KernelWriter.py 11.67% <94.01%> (+2.09%) ⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nakajee
Copy link
Copy Markdown
Contributor

nakajee commented May 20, 2026

It does not have code to skip first prefetch GR for sgprLoopCounterL=0 case.

@nakajee
Copy link
Copy Markdown
Contributor

nakajee commented May 20, 2026

I think it might be good to have an option to put some non 0 value for MX swizzle padding.
Otherwise, we might not be able to confirm if A,B masking is correct or not.

if kernel["ProblemType"].get("MXBlockB", 0) > 0:
realignTcs.append('MXSB')

c0ResetLabel = Label("PGRTailC0Reset%s" % loopChar, "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need any special treatment for OrigCount==0 case in tailloop.
Init acc should be done in preloop.

src0=sgpr(savedOrigCounterSgpr),
src1=pgr,
comment="origCounter < PGR?"))
module.add(SCBranchSCC1(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need special treatment for LoopCounter < 2 case.
SrdA/B/MXSA/MXSB should be already adjusted in preloop.


kPosBaseVgpr = self.vgprPool.checkOut(1, "kReg_first")
with self.allocTmpSgpr(1) as tmpSgprInfo:
module.add(vectorStaticRemainder(-1, kPosBaseVgpr, "Serial",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put this after GR and before wait.

module = Module("tailLaneMaskApplySubtile")
laneSGPRCount = self.states.laneSGPRCount
for idx in aTile.regList.indices:
module.add(VCndMaskB32(dst=vgpr(idx), src0=vgpr(idx), src1=0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to generate cndmask against same vgpr more than once.
Please add done check or something to avoid generating cndmask for same vgpr twice or more.

for tailTile in tailAllocTiles:
tailTile.allocVgprTileRegisters_legacy(self, kernel)

# Re-issue one DepthU-shaped GR + LR. Byte-layout identical to a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should adjust SrdA/B/MXSA/MXSB + 2 to avoid out of range access.

# K_rem=32 (tail-only, K << DU)
- Exact: [128, 128, 1, 32]
# K_rem=256 (2 full last MFMAs, K_rem mod 128 == 0)
- Exact: [128, 128, 1, 256]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add more sizes for DU512?
(similar coverage as section 2 above)

# Decorative: batched + partial-MFMA tail
- Exact: [128, 128, 2, 160]
# Decorative: odd M,N + smallest tail
- Exact: [ 63, 63, 1, 32]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have some more iteration cases?

  • 2 iteration + Tail
  • 3 iteration + Tail

bnemanich added a commit that referenced this pull request May 20, 2026
Followups on top of "subtile: address PR review for K-tail loop"
addressing the remaining nakajee review comments on PR #7636 and
fixing the unit-test failures introduced by the prior commit.

1. Comment 1 (no special OrigCounter==0 treatment in tail loop):
   move the c=0 skip gate fully upstream into `kernelBodySubtile`.
   The new `SkipSubtileMainLoop<L>` gate wraps the scheduler-emitted
   mainLoop (preLoop + mainloop + NGLL + NLL) and bypasses all of
   them when `OrigLoopCounter == 0`. With the gate in place, SRDs
   stay at K=0, LWA/LRA stay at buf 0, and accD stays at zero from
   `initVgprTilesToZero`, so the tail body just runs from
   setupNewTile's defaults -- no undo needed.

   The legacy `PGRTailC0Reset<L>` block in `_emitTailLoopScaffoldSubtile`
   (which used to zero accD, subtract one DU from each Srd<tc>, and
   XOR LWA back to buf 0 for PGR>=2) is removed entirely. The c=0
   compare now branches straight into `PGRTailEntry<L>`.

   The legacy `SkipPreLoopGR<L>` gate that lived in the obsolete
   `_kernelBody` path is removed (subtile kernels never go through
   `kernelBody`; they use `kernelBodySubtile`).

2. Comment 3 (move kPosBase setup after GR, before wait):
   relocate the per-lane `kPosBase = tidInK * numMIInUnroll` vector
   remainder / divide / multiply chain to fire between the tail's
   global-read issue and the post-GR `s_waitcnt`. The vector math
   now overlaps with buffer-load memory latency instead of
   serializing in front of it.

3. Comment 4 (dedupe per-MFMA cndmask):
   the previous tail body emitted one `v_cndmask_b32` per
   (mmak, mma1, mma0) MFMA over each ValuA/B input vgpr. Because
   ValuA depends only on `(mma0, mmak)` and ValuB only on
   `(mma1, mmak)`, the same A vgprs were cndmasked `len(mma1)`
   times and the same B vgprs `len(mma0)` times per mmak.

   The cndmask chain is hoisted out of the inner (mma1, mma0) MFMA
   grid into the per-mmak block, with a `seenVgpr` set tracking
   which ValuA/B/MXSA/MXSB vgprs have already been masked. Each
   unique vgpr now gets exactly one `v_cndmask_b32` per mmak. The
   downstream MFMA grid runs cndmask-free.

   The now-redundant `_emitTailLaneMaskApplySubtile` helper is
   removed (its body is inlined into the scaffold).

4. Comment 5 (clamp Srd<tc>+2 at tail entry):
   documented as a follow-up TODO at the tail GR site. The buffer
   NumRecords field is a single linear-byte limit, so a tight
   per-row K clamp is only achievable for the last M-row of the
   tile. The per-MFMA lane mask remains the actual correctness
   mechanism for in-range M rows. Filed for a follow-up PR.

5. CI fix:
   update `test_subtile_tailloop_emit.py`:
     - `test_emits_c0_reset_compare_and_branch` (PGR=1, PGR=2): now
       asserts the c=0 compare branches to `PGRTailEntry<L>`
       (not `PGRTailC0Reset<L>`) and that the legacy reset block is
       absent.
     - Replace `test_emits_c0_srd_subtract_with_borrow`,
       `test_emits_c0_lwa_xor_realign`,
       `test_emits_c0_reset_label_and_accD_zero`,
       `test_omits_c0_srd_subtract_pgr1`,
       `test_emits_c0_accD_zero_pgr1`,
       `_extract_c0_block`
       with negative-assertion tests `test_omits_c0_reset_label`,
       `test_omits_c0_srd_subtract`, `test_omits_c0_lwa_xor_undo`,
       `test_omits_c0_reset_label_pgr1`,
       `test_omits_c0_undo_instructions_pgr1` that pin the legacy
       block / instructions as absent.

Validation: full unit suite green (`pytest Tensile/Tests/unit/`:
821 passed, 5 skipped, 2 xfailed).

Co-authored-by: Cursor <cursoragent@cursor.com>
bnemanich added a commit that referenced this pull request May 20, 2026
CI hipBLASLt build on PR #7636 fails with:

  RuntimeError: Unknown assertion key: AssertFree1DivByMT1LowbitGT1

at `Contractions.py:ProblemPredicate.FromOriginalKeyPair` while
`parseLibraryLogicData` walks a gfx950 logic YAML.

Root cause: PR #7443 ("manual revert KRingShift") removed the parser
handlers for `AssertFree1DivByMT1LowbitGT1` /
`AssertKRingShiftTailWrapOnly`, and PR #7513 cleaned the matching
keys out of the hipBLASLt / hipSPARSELt library logic YAML files.
Both reverts are in our branch's history, but origin/develop has new
gfx950 logic YAMLs (e.g. `gfx950/gfx950_128cu/Equality/
gfx950_Cijk_Alik_Bljk_HHS_BH_BiasSH_HAS_SAV_UserArgs.yaml`, added by
#7125 "update cucount") that were generated *before* the revert pair
and therefore still carry the deprecated keys. When CI merges our
branch with origin/develop the merged tree contains a YAML the
parser can no longer read.

Fix: silently ignore the two deprecated keys in
`ProblemPredicate.FromOriginalKeyPair`. Matches the spirit of the
revert + cleanup pair without requiring follow-up YAML cleanup in
develop, and is forward-compatible against any future stale YAML
that slips through the same gap. Genuinely unknown `Assert*` keys
still raise loudly.

New unit tests in `test_Contractions_deprecated_asserts.py` pin the
silent-ignore behavior for the two specific keys, verify that
unrecognized `Assert*` keys still raise, and check that the existing
known `AssertFree0/1ElementMultiple` + `AssertSummationElementMultiple`
predicate path is unaffected.

Validation: - new tests: 7 passed
  - full unit suite: 828 passed, 5 skipped, 2 xfailed (was 821 passed)
  - end-to-end smoke: drove `FromOriginalKeyPair` on a representative
    stale-key entry set; no `RuntimeError` raised; recognized keys
    still produce real predicates.
Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread projects/hipblaslt/tensilelite/Tensile/KernelWriter.py Outdated
bnemanich added a commit that referenced this pull request May 21, 2026
Addresses the latest nakajee review comment on PR #7636:

> module.add(SMovB32(dst=sgpr("LoopCounterL"), src=0,
>     comment="single-iter tail: force closeLoop fall-through"))
> module.add(self.closeLoop(kernel, tensorParametersA, tensorParametersB,
>     -1, finalLoop=True))
>
> Seems like closeLoop here is not necessary.

The subtile tail body processes the entire K_tail in a single pass
via the `mmak` loop (every lane mask is emitted against the current
`LoopCounterL = K mod DU` snapshot), so `closeLoop(... finalLoop=True)`
would only contribute dead code:

  - per-iter `s_sub_i32 LoopCounterL, ..., MIK` decrement,
  - `s_cbranch_scc0 TailLoopBeginL` back-edge (no-op since we
    intentionally underflowed LoopCounterL to negative),
  - `OrigLoopCounter += MIK` increment (only consumed by the legacy
    LRO-damage recovery block, which is already bypassed for
    `UseSubtileImpl=1` kernels inside `closeLoop`),
  - `TailLoopEndL:` label (nothing branches to it — the only
    early-exit target is `SkipTailLoopL:`, emitted by the surviving
    `closeLoop(... emitEndLabelOnly=True)` call).

Dropping the `closeLoop(... finalLoop=True)` call also removes the
need for the `s_mov_b32 LoopCounterL, 0` workaround that was there
purely to force the now-removed back-edge to fall through.

Net asm delta per kernel (subtile tail body):
  - 1 `s_mov_b32 LoopCounterL, 0`
  - 1 `s_sub_i32 LoopCounterL, LoopCounterL, MIK`
  - 1 `s_add_u32 OrigLoopCounter, OrigLoopCounter, MIK`
  - 1 `s_cmp_le_i32 LoopCounterL, 0`
  - 1 `s_cbranch_scc0 label_TailLoopBeginL`
  - 1 `label_TailLoopEndL:` label
All removed.

Test updates:
  - `test_emits_loop_counter_zero_before_closeloop` (PGR=0) is
    inverted into `test_omits_closeloop_emit`: asserts that the
    `s_sub_i32 LoopCounterL ... dec counterL` decrement, the
    `s_mov_b32 LoopCounterL, 0` workaround, and `TailLoopEndL` are
    all absent from the tail block.
  - `test_pgr2_loop_counter_zero_before_closeloop` (PGR=2) is
    inverted into `test_pgr2_omits_closeloop_emit` with the same
    set of negative assertions.
  - `_emitTailLoopScaffoldSubtile` docstring updated to enumerate
    the three PGR>0 entry-gate paths (origCounter==0 / small /
    large) and to document that closeLoop is intentionally not
    emitted by the subtile scaffold.
  - `TestEmitAllLoopsTail_PGR0` class docstring and
    `test_omits_tail_when_NoTailLoop_true` docstring updated to
    note that `TailLoopEndL` is never emitted (was previously only
    asserted absent under NoTailLoop=True).

Full unit suite: 828 passed, 5 skipped, 2 xfailed.

Co-authored-by: Cursor <cursoragent@cursor.com>
bnemanich added a commit that referenced this pull request May 21, 2026
Addresses the second-to-last open nakajee review comment on PR #7636:

> module.add(SCmpLtU32(
>     src0=sgpr(savedOrigCounterSgpr),
>     src1=pgr,
>     comment="origCounter < PGR?"))
> module.add(SCBranchSCC1(
>     ...
>
> We should not need special treatment for LoopCounter < 2 case.
> SrdA/B/MXSA/MXSB should be already adjusted in preloop.

Background
----------

For PGR=2 origCounter==1 the previous tail scaffold emitted a
"small-counter realign" block that XOR'd `LocalWriteBaseAddr{A,B,
MXSA,MXSB}` back with their `Swap<tc>` masks to re-align LWA with
the LDS buffer NLL drained from. The misalignment came from preLoop:

    GR(MT 0)
    GR_INC                <-- SRD advance + LWA toggle to buf 1
    Wait + Sync
    LR drains MT 0 from buf 0
    SkipOp(LE 1, NLL)     <-- for origCounter==1, branches here
    GR(MT 1) writes buf 1
    SkipOp(LE 2, NGLL)

The `GR_INC`'s LDS toggle still fired even when `SkipOp(LE 1, NLL)`
bypassed `GR(MT 1)`, so on the c==1 path NLL ended with LWA at buf 1
while LR was still at buf 0 -- the tail had to XOR LWA back.

Fix (per the review)
--------------------

Split the compound `GRIncOp` into two ops just for preLoop's use:

  - `GRPtrIncOp`:  SRD pointer advance only.
  - `GRLDSwapOp`:  LDS double-buffer XOR only.

The mainloop keeps the compound `GRIncOp` (its two halves are always
paired one-to-one with an iteration). preLoop is restructured so the
SRD advance fires before `SkipOp(LE 1, NLL)` (the tail still needs
SRD at K=DU for c==1) and the LDS swap fires AFTER the SkipOp, right
before `GR(MT 1)`:

    GR(MT 0)
    gr_ptr_inc  (per tensor)        <-- SRD advance, always fires
    Wait + Sync
    LR drains MT 0 from buf 0
    SkipOp(LE 1, NLL)               <-- branch over both lds_swap
                                       and GR(MT 1) for c==1
    gr_lds_swap (per tensor)        <-- LDS toggle, skipped for c==1
    GR(MT 1) writes buf 1
    SkipOp(LE 2, NGLL)

With the swap gated by the same SkipOp that gates GR(MT 1), LWA and
LR stay aligned for every origCounter:

  - c==0:    upstream SkipSubtileMainLoop gate skips all of preLoop;
             LWA at initial buf 0, LR untouched.
  - c==1:    SkipOp(LE 1) taken -> no LDS toggle; LWA at buf 0,
             LR at buf 0 (NLL drained MT 0 from buf 0).
  - c==2:    SkipOp(LE 1) not taken, LDS toggle fires (LWA->buf 1),
             GR(MT 1) writes buf 1; SkipOp(LE 2, NGLL) taken;
             NGLL/NLL drain MT 1 from buf 1; LR ends at buf 1.
  - c>=3:    SkipOp(LE 1)/SkipOp(LE 2) both not taken; mainloop
             runs (c-2) iters, each toggling LWA and LR; iter-pair
             cancellation keeps LWA and LR aligned at exit.

Tail-scaffold simplification
----------------------------

With the preLoop fix in place, the tail's small-counter realign
block (`SXorB32 LocalWriteBaseAddr<tc>, ..., Swap<tc>`) becomes
unreachable / dead and is removed. The c==0 and small-counter
compares also collapse into a single `s_cmp_lt_u32 origCounter,
PGR` -- both cases need no SRD/LWA fix-up, only the large-counter
case (`origCounter >= PGR`) still needs the existing +1 DU SRD
advance to undo the per-iter GR_INC off-by-one at loop exit.

Net asm delta per PGR=2 kernel (subtile tail body):
  - 1 `s_cmp_eq_u32 origCounter, 0` compare      [removed]
  - 1 `s_cbranch_scc1 PGRTailEntry` (c=0 branch)  [removed]
  - `PGRTailSmallCounterRealign<L>:` label       [removed]
  - 4 `s_xor_b32 LocalWriteBaseAddr<tc>, ...,
    Swap<tc>` (per A/B/MXSA/MXSB)                [removed]
  - 1 `s_branch PGRTailEntry` (large-counter
    fall-through bypass)                          [removed]

Test updates
------------

  - LogicalScheduler.py: new `GRPtrIncOp` and `GRLDSwapOp` dataclasses
    with `gr_ptr_inc` / `gr_lds_swap` kinds; `build_preloop` PGR=2
    branch uses the split pair around `SkipOp(LE 1, NLL)`.
  - InstructionEmitter.py: new `emit_gr_ptr_inc` (ptr-only) and
    `emit_gr_lds_swap` (lds-only) handlers wired into the dispatch
    table.
  - KernelWriter.py `_emitTailLoopScaffoldSubtile`: small-counter
    realign block removed; c=0 and `<PGR` compares folded into one
    `s_cmp_lt_u32 origCounter, PGR`; docstring + inline comments
    updated.
  - test_subtile_tailloop_emit.py: invert
    `test_emits_small_counter_lwa_realign` to assert absence;
    replace `test_emits_c0_reset_compare_and_branch` (PGR=1 & PGR=2)
    with `test_emits_lt_pgr_compare_and_branch`; rename
    `test_emits_small_counter_compare` to
    `test_emits_lt_pgr_compare_immediate` and update its docstring;
    refresh class / module docstrings.
  - test_SubtileBasedLogicalScheduler.py: extend
    `TestBuildPreloop.test_256x256_fp4` to require both
    `gr_ptr_inc` and `gr_lds_swap` (and forbid the legacy
    `gr_inc`); add a strict ordering invariant pinning
    `gr_lds_swap` AFTER `skip(LE:1:NLL)`.
  - test_SubtileBasedSchedulerRef.py: refresh the three preloop
    golden references (256x256 FP4 1x1; 320x320 BF16 1x5 offset1
    and offset_all) to show `gr_ptr_inc` ... skip ... `gr_lds_swap`
    in place of the old `gr_inc` ... skip ... shape.

Validation: full unit suite green (828 passed, 5 skipped, 2 xfailed).

Co-authored-by: Cursor <cursoragent@cursor.com>
bnemanich added a commit that referenced this pull request May 21, 2026
Addresses nakajee's open OOR review on PR #7661 (carried forward from
PR #7636 "Comment 5" TODO at the tail GR site):

> This still does not have any prevention for out of array access.
> What we need is set SrdA/B/MXSA/MXSB + 2 to the exact end of array
> (but 4 byte alignment). [...]
> remainK = (k%DepthU); remainKalign = remainK & 0xfffffffe;
> SrdA/B -= (DepthU - remainKalign) * bpe

Before this commit, the subtile K-tail re-issued a DepthU-shaped GR
against an SRD whose NumRecords (Srd<tc>+2) still spanned the full
DepthU bytes after the last live K-element. For K_remain < DepthU,
the last m-row's per-thread `buffer_load_b128` past
`K_remain*bpe` could read past A/B's allocated end-of-array
(buffer-OOB does NOT bail on past-allocation reads; only on past-
NumRecords reads). The per-MFMA lane mask + sub-lane refine zeroed
those VGPRs after the load, so values were correct, but the buffer
engine could still touch unmapped pages and trigger an HSA fault on
non-contiguous A/B allocations.

New `_emitTailSrdTightenSubtile` runs at tail entry (right after the
PGR>0 entry-gate `PGRTailEntry<L>` label, before `openLoop`) and
shrinks each `Srd<tc>+2` by `DepthU*bpe - roundUp(K_remain*bpe,
loadBytesGR)`. The `roundUp(..., loadBytesGR)` (vs nakajee's literal
`& 0xfffffffe`) is what keeps our wide DTL load valid for the
trailing odd-K element on the last m-row: the per-thread load is
B128 (16 B) for bf16/fp16, so a single thread covers up to 8 K
elements -- the load must succeed for the thread that holds
`K_remain - 1`. nakajee's literal align-down assumed the
narrow-trailing-element strategy (`buffer_load_d16_b16 ... lds`
+ lane-0-only); that path is rejected by the gfx950 assembler and
was previously deleted in `7df7d24` ("remove dead bf16 narrow-load
helper"). The align-up variant preserves nakajee's intent (clip
past-K reads on the last m-row) without needing the narrow load.

A single runtime `s_cmp_lt_u32 alignedBytes, DepthU*bpe`
+ `s_cbranch_scc0 TailSrdTightenSkip<L>` short-circuits the SSub
chain when `alignedBytes >= DepthU*bpe` (K_remain close to DepthU
with wide per-thread loads -- the natural SRD limit already covers
every read). The skip label is the join target.

Gating:
  - non-MX (`MXBlock{A,B} == 0`): MX scales have host re-scatter
    padding (`DataInitialization.rearrangePaddedMXScaleLayout`);
    MX data + MXSA/MXSB SRD tightening needs nakajee's
    swizzleBlock-aware formula (`remainK_MX = roundUp(remainK / 256)
    * 256`, `SrdMXSA/B -= (DepthU - remainK_MX) * swizzleSize0 = 32`)
    and is a separate follow-up.
  - non-swizzled A/B (`SwizzleTensor{A,B}=False`): subtile mxfp4
    swizzled A/B need the same swizzleBlock formula; same follow-up.
  - symmetric per-tensor `bpe in {1, 2}` and matching
    `loadWidthGR`; bf16 / fp16 / int8 anyk paths are the immediate
    consumer.

Per-kernel asm delta (bf16 fixture, B128 load, DU=64 -> depthUBytes=128):
  +5 s_* (lshl/add/and/cmp/cbranch)
  +1 s_sub_u32 (delta)
  +2 s_sub_u32 (Srd{A,B}+2 tighten)
  +1 label
8 instructions + 1 label, all inside the tail entry; no per-iter
cost. Statically gated to no-op on MX / swizzled / non-bf16 paths.

No SRD restore is needed: the tail body is the last GR site for
A/B before the kernel epilogue (epilogue uses SrdC / SrdD).

Stale TODO comment at the tail GR site (which previously said the
tightening was filed as a follow-up) is updated to reflect the new
state (A/B done here; MX + swizzled deferred with a pointer to
nakajee's spec).

Tests (`Tensile/Tests/unit/test_subtile_tailloop_emit.py::TestTailSrdTightenSubtile`,
10 new):
  - Pin the emit-time `s_lshl_b32 + s_add_u32 + s_and_b32` aligned-K
    chain, the runtime no-op `s_cmp_lt_u32 + s_cbranch_scc0` skip,
    the `s_sub_u32 SrdA+2 / SrdB+2` tightening (with delta
    precompute), and the `TailSrdTightenSkip<L>:` join label.
  - Strict order: branch < SrdA+2 sub < skip label (so the
    short-circuit actually short-circuits).
  - PGR>0 placement: tightening fires AFTER `PGRTailEntry<L>:` so
    c=0 / small-counter / large-counter paths have all converged
    onto the same SRD state, then BEFORE the tail GR.
  - Negative pins: MX fp4 emits no tightening (MXBlock>0 gate);
    NoTailLoop=True emits no tightening (scaffold early-returns).

Validation: - Full unit suite: 887 passed, 5 skipped, 2 xfailed (was 887 /
    same).
  - gfx950 yaml gauntlet on MI355X:
      subtile_bf16.yaml           : 7082 / 7082 PASS
      subtile_bf16_tail.yaml      :  450 /  450 PASS
      subtile_bf16_anyk_k8.yaml   :  183 /  183 PASS
      subtile_bf16_anyk_k2.yaml   :  183 /  183 PASS
      subtile_bf16_anyk_odd.yaml  :  117 /  117 PASS
      subtile_bf16_anyk_largemt.yaml:  4 /    4 PASS
      subtile_mxfp4.yaml          : 2691 / 2691 PASS
      subtile_mxfp4_tail.yaml     :   68 /   68 PASS
      subtile_mxfp4_tail_smoke.yaml: PASS
    Total ~10,778 problem runs, 0 failures.
Co-authored-by: Cursor <cursoragent@cursor.com>
bnemanich and others added 12 commits May 25, 2026 07:44
The subtile branch of minASEMforMX forced ASEM to 256 (= DepthU) for
MX kernels, making `NoTailLoop = (ASEM % DepthU == 0)` always True and
silently rounding K up to a multiple of DepthU on the host side. With
the subtile K%32 tail loop now wired in (subsequent commit), the subtile
MX path can use the same ASEM=32 minimum as the rest of the MX path.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 61e236a)
Lifts the subtile K constraint from `K % DepthU == 0` to `K % 32 == 0`
by adding a real K%32 tail loop to the subtile emit path. The tail
scaffold lives in `KernelWriter._emitTailLoopScaffoldSubtile`; it
emits the LoopCounterL early-exit, per-lane kPosBase, re-issued
DepthU-shaped GR + LR for A/B (and MXSA/MXSB), per-mmak
`v_cmp_ge_i32` + per-MFMA `v_cndmask_b32` lane-mask, and MFMAs into
the existing D accumulators. The body runs as a single pass; the
legacy closeLoop's per-iter decrement is short-circuited by zeroing
LoopCounterL beforehand.

PGR>0 is gated by three mutually-exclusive entry paths keyed off a
snapshot of OrigLoopCounter (= K // DU) taken before
`calculateLoopNumIter` resets it:
  - origCounter == 0:        c=0 reset (zero accD, undo preLoop
                             GR_INC SRD-advance + LWA-XOR for
                             PGR>=2), fall through to tail body.
  - 0 < origCounter < PGR:   small-counter LWA realign (XOR back to
                             buf 0 to match LR; no SRD advance).
  - origCounter >= PGR:      large-counter +1 DU SRD advance to land
                             at K_aligned.
The PGR=0 mainloop gets an underflow pre-guard
(`s_cmp_eq_u32 LoopCounterL, 0` + branch to SkipMainloop) so the
do-while body is bypassed when K < DepthU.

`KernelWriterAssembly.closeLoop`'s legacy LRO/LWA damage-recovery
references `vgprLocalReadAddrA/B` symbols subtile kernels never
define, so it is bypassed for `UseSubtileImpl`. The host-side MX
scale initializer re-scatters the canonical DGen layout into the
padded mxsa/mxsb tensor descriptor on gfx950 so the CPU reference
observes the same scale bytes the GPU consumes.

Other touch-ups:
  - `Subtile/Kernel.py`: name the 4-tiles-per-VGPR scale packing
    constant (`MX_SCALE_TILES_PER_VGPR`); split per-mmak setup from
    per-MFMA cndmask via `_emitTailKPosCmpSubtile` /
    `_emitTailLaneMaskApplySubtile`; drop the dead `preLoop`.
  - `SubtileScaleEmit.localReadDoScaleSubtile`: use
    `lrLocalSubtileGrid` (not the [1,1]-frozen `localSubtileGrid`)
    so all scale VGPRs get a ds_read on MT >= 128x128.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit bd0c7f9)
End-to-end yaml tests in `Tensile/Tests/common/gemm/gfx950`:
  - `subtile_bf16_tail.yaml`: BF16 tail across K_rem in {16, 32, 48, 64},
    DepthU in {64, 128}, PGR in {0, 1, 2}, StreamK on/off, MT128x128
    and MT256x256.
  - `subtile_mxfp4_tail.yaml`: MXFP4 tail across the K_rem partitions
    (partial-MFMA: 32, 64, 96, 160, ...; full-last-MFMA: 128, 256),
    DU=256 and DU=512, PGR / StreamK / bias / dest-type cells (~14
    kernels, ~80 problem runs).
  - `subtile_mxfp4_tail_smoke.yaml`: 1-kernel smoke variant.

Unit tests in `Tensile/Tests/unit`:
  - `_subtile_tailloop_fixtures.py`: shared kwa-builder, kernel-keys
    helper, tile-info population, and skiptoend wrapper used by
    `test_subtile_tailloop_emit.py` and the new tail-emit class in
    `test_SubtileBasedLogicalScheduler.py`.
  - `test_subtile_tailloop_emit.py`: tail-body content assertions
    (no SRD rewind on PGR=0, single-iter forcing, kPosBase init,
    lane mask on A/B/MXSA/MXSB, one ds_read per scale group, no
    LDS pre-zero, PGR>0 entry-gate branches).
  - `test_solution_subtile_tailloop.py`: solution-level gating
    (ASEM=32 + NoTailLoop=False for subtile MX).
  - `test_SubtileBasedLogicalScheduler.py`: PGR=0 mainloop underflow
    pre-guard structural assertions + tail-loop scaffold placement.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 4fb7e10)
The transA=true (A) and transB=false (B) code paths each had a copy of
the same loop that re-scatters DGen's canonical scale bytes into the
padded mxsa/mxsb descriptor strides. Extract the shared body into a
single static helper, `rearrangePaddedMXScaleLayout`, and call it from
both sides. No behavioral change; the helper preserves the existing
needsRearrange short-circuit and the per-batch byte layout.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 2a34490)
Solution.py rejects every subtile solution with `StreamK == 0`
(`"UseSubtileImpl=1 supports StreamK only (no support for GSU)"`).
Sections 2-6 of `subtile_mxfp4_tail.yaml` were pinned to
`StreamK: [0]` and therefore produced 0 valid solutions at
SolutionStructs, failing CI with `SystemExit: -1` after section 1
ran. Section 1 was hiding the bug behind `StreamK: [0, 3]`, which
silently drops the SK=0 half.

Change sections 2-6 to `StreamK: [3]` and update the
header / per-section comments to reflect the SK=0 silent filtering
in section 1 and the SK=3-only reality elsewhere. No other yaml
content changes.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 1d68fc2)
Two reviewer-requested improvements on top of the K%32 tail-loop work:

1. Skip the preLoop GR block when origCounter == 0.

   `_kernelBody` now wraps the `if PrefetchGlobalRead:` block with an
   `OrigLoopCounter == 0` gate (Subtile + PGR>0 only) that branches
   past the wait-for-PGR, preLoop local write, LWA swap, and PGR>=2
   prefetch GR/GR_INC sequence to a fresh `SkipPreLoopGR<loopChar>`
   label placed just before the unrolled loop opens.

   With the gate, SRDs and LWAs stay at `setupNewTile`'s defaults when
   K < DepthU, so the tail scaffold's c=0 reset path collapses to a
   single `initVgprTilesToZero` -- the previous PGR>=2 "undo preLoop
   GR_INC" SRD subtract + LWA XOR loops are removed. The tail body's
   re-issued GR + LR runs from the same setupNewTile state on every
   K_rem < DepthU problem regardless of PGR.

2. Configurable MX scale padding fill byte.

   New `GlobalParameters["MXScalePadByte"]` (default 0, mirrors current
   behaviour) plumbed through:
     - `ClientWriter.py` -> `--mx-scale-pad-byte <N>` (only emitted
       when non-zero, keeps run.sh and CI calls unchanged in default
       configurations).
     - `client/main.cpp` registers `--mx-scale-pad-byte` (0..255).
     - `DataInitialization` stores `m_mxScalePadByte` and uses it for
       the bulk `memset` of `pristineE8A/B.cpuInput.valid`, for the
       inner `memset` in `rearrangePaddedMXScaleLayout`, and for the
       initial fill of the preswizzled `gpuScaleBuf`.

   Setting `MXScalePadByte: 255` (E8M0 NaN/Inf) seeds every byte of
   MX scale padding with a poison value; any A/B lane-mask leak in
   the GPU kernel into the padded region then propagates inf/NaN
   through the MFMA and trips validation instead of silently being
   multiplied by 0.

Validation: subtile_mxfp4_tail.yaml passes locally with the default
(MXScalePadByte=0, exercises Comment 1's gate across PGR=[0,1,2] and
K_rem in [32, 96, 128, 160, 224, 288, 384]) and with MXScalePadByte=255
(50/50 problem runs PASSED, confirming the kernel's per-MFMA lane mask
correctly suppresses padded scale lanes).

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 24369f3)
Set `MXScalePadByte: 255` (E8M0 0xFF = NaN/Inf) on the three subtile
MX-format yamls so CI actively exercises the new poison-padding option:

  - subtile_mxfp4.yaml (aligned-K, M/N edge coverage)
  - subtile_mxfp4_tail.yaml (K%32 tail coverage, all PGR variants)
  - subtile_mxfp4_tail_smoke.yaml (fast tail smoke)

Any A/B/MXSA/MXSB lane in the GPU kernel that reads a padded scale
position will now pull an E8M0 NaN, propagate it through the MFMA into
accD, and trip validation -- both for K-direction padding (subtile
tail-loop mask) and for M/N-direction padding (wave-tile OOB lanes).

Locally re-verified that all three yamls still pass with the poison
padding active (no spurious failures introduced).

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit c7009b4)
Followups on top of "subtile: address PR review for K-tail loop"
addressing the remaining nakajee review comments on PR #7636 and
fixing the unit-test failures introduced by the prior commit.

1. Comment 1 (no special OrigCounter==0 treatment in tail loop):
   move the c=0 skip gate fully upstream into `kernelBodySubtile`.
   The new `SkipSubtileMainLoop<L>` gate wraps the scheduler-emitted
   mainLoop (preLoop + mainloop + NGLL + NLL) and bypasses all of
   them when `OrigLoopCounter == 0`. With the gate in place, SRDs
   stay at K=0, LWA/LRA stay at buf 0, and accD stays at zero from
   `initVgprTilesToZero`, so the tail body just runs from
   setupNewTile's defaults -- no undo needed.

   The legacy `PGRTailC0Reset<L>` block in `_emitTailLoopScaffoldSubtile`
   (which used to zero accD, subtract one DU from each Srd<tc>, and
   XOR LWA back to buf 0 for PGR>=2) is removed entirely. The c=0
   compare now branches straight into `PGRTailEntry<L>`.

   The legacy `SkipPreLoopGR<L>` gate that lived in the obsolete
   `_kernelBody` path is removed (subtile kernels never go through
   `kernelBody`; they use `kernelBodySubtile`).

2. Comment 3 (move kPosBase setup after GR, before wait):
   relocate the per-lane `kPosBase = tidInK * numMIInUnroll` vector
   remainder / divide / multiply chain to fire between the tail's
   global-read issue and the post-GR `s_waitcnt`. The vector math
   now overlaps with buffer-load memory latency instead of
   serializing in front of it.

3. Comment 4 (dedupe per-MFMA cndmask):
   the previous tail body emitted one `v_cndmask_b32` per
   (mmak, mma1, mma0) MFMA over each ValuA/B input vgpr. Because
   ValuA depends only on `(mma0, mmak)` and ValuB only on
   `(mma1, mmak)`, the same A vgprs were cndmasked `len(mma1)`
   times and the same B vgprs `len(mma0)` times per mmak.

   The cndmask chain is hoisted out of the inner (mma1, mma0) MFMA
   grid into the per-mmak block, with a `seenVgpr` set tracking
   which ValuA/B/MXSA/MXSB vgprs have already been masked. Each
   unique vgpr now gets exactly one `v_cndmask_b32` per mmak. The
   downstream MFMA grid runs cndmask-free.

   The now-redundant `_emitTailLaneMaskApplySubtile` helper is
   removed (its body is inlined into the scaffold).

4. Comment 5 (clamp Srd<tc>+2 at tail entry):
   documented as a follow-up TODO at the tail GR site. The buffer
   NumRecords field is a single linear-byte limit, so a tight
   per-row K clamp is only achievable for the last M-row of the
   tile. The per-MFMA lane mask remains the actual correctness
   mechanism for in-range M rows. Filed for a follow-up PR.

5. CI fix:
   update `test_subtile_tailloop_emit.py`:
     - `test_emits_c0_reset_compare_and_branch` (PGR=1, PGR=2): now
       asserts the c=0 compare branches to `PGRTailEntry<L>`
       (not `PGRTailC0Reset<L>`) and that the legacy reset block is
       absent.
     - Replace `test_emits_c0_srd_subtract_with_borrow`,
       `test_emits_c0_lwa_xor_realign`,
       `test_emits_c0_reset_label_and_accD_zero`,
       `test_omits_c0_srd_subtract_pgr1`,
       `test_emits_c0_accD_zero_pgr1`,
       `_extract_c0_block`
       with negative-assertion tests `test_omits_c0_reset_label`,
       `test_omits_c0_srd_subtract`, `test_omits_c0_lwa_xor_undo`,
       `test_omits_c0_reset_label_pgr1`,
       `test_omits_c0_undo_instructions_pgr1` that pin the legacy
       block / instructions as absent.

Validation: full unit suite green (`pytest Tensile/Tests/unit/`:
821 passed, 5 skipped, 2 xfailed).

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 664e471)
Two nakajee review comments on `subtile_mxfp4_tail.yaml`:

1. Section 3 (DU=512): "Would you please add more sizes for DU512?
   (similar coverage as section 2 above)"

   Section 3 previously had just K_rem in {32, 256} — a proof-of-life
   only. Expand to mirror Section 2's K_rem coverage (which exercises
   partial-first-MFMA, full-last-MFMA, partial-near-full, and the
   N*DU + tail entry-gate paths). New K values for DU=512:
     - K_rem=96, 128, 224, 384 (K_rem < DU coverage)
     - K = 1*512 + 128 = 640 (1 main iter + full-last-MFMA tail;
                              PGR=2 small-counter realign path)
     - K = 2*512 + 32 = 1056 (2 main iters + tail; PGR=2
                              large-counter SRD-advance path)

   Section 3 now runs 8 K_rem × PGR={0,2} = 16 problem runs (was 4).

2. Section 1 decorative tail: "Could we have some more iteration cases?
   2 iteration + Tail, 3 iteration + Tail"

   Section 1 previously only had K = 1*DU + 32 = 288 (1 main iter +
   tail; PGR=2 small-counter realign). Add the multi-iter analogues:
     - K = 2*256 + 32 = 544 (PGR=2 large-counter SRD-advance)
     - K = 3*256 + 32 = 800 (PGR>=2 large-counter, NLL drained)

   Adds 2 K values × PGR={0,1,2} = 6 problem runs.

Header coverage summary and Section 3's banner updated to match the
new content. YAML parses cleanly; existing `Tensile/Tests/unit/`
subtile suite stays green (116 passed, 1 xfailed).

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 210a5bb)
CI hipBLASLt build on PR #7636 fails with:

  RuntimeError: Unknown assertion key: AssertFree1DivByMT1LowbitGT1

at `Contractions.py:ProblemPredicate.FromOriginalKeyPair` while
`parseLibraryLogicData` walks a gfx950 logic YAML.

Root cause: PR #7443 ("manual revert KRingShift") removed the parser
handlers for `AssertFree1DivByMT1LowbitGT1` /
`AssertKRingShiftTailWrapOnly`, and PR #7513 cleaned the matching
keys out of the hipBLASLt / hipSPARSELt library logic YAML files.
Both reverts are in our branch's history, but origin/develop has new
gfx950 logic YAMLs (e.g. `gfx950/gfx950_128cu/Equality/
gfx950_Cijk_Alik_Bljk_HHS_BH_BiasSH_HAS_SAV_UserArgs.yaml`, added by
#7125 "update cucount") that were generated *before* the revert pair
and therefore still carry the deprecated keys. When CI merges our
branch with origin/develop the merged tree contains a YAML the
parser can no longer read.

Fix: silently ignore the two deprecated keys in
`ProblemPredicate.FromOriginalKeyPair`. Matches the spirit of the
revert + cleanup pair without requiring follow-up YAML cleanup in
develop, and is forward-compatible against any future stale YAML
that slips through the same gap. Genuinely unknown `Assert*` keys
still raise loudly.

New unit tests in `test_Contractions_deprecated_asserts.py` pin the
silent-ignore behavior for the two specific keys, verify that
unrecognized `Assert*` keys still raise, and check that the existing
known `AssertFree0/1ElementMultiple` + `AssertSummationElementMultiple`
predicate path is unaffected.

Validation: - new tests: 7 passed
  - full unit suite: 828 passed, 5 skipped, 2 xfailed (was 821 passed)
  - end-to-end smoke: drove `FromOriginalKeyPair` on a representative
    stale-key entry set; no `RuntimeError` raised; recognized keys
    still produce real predicates.
Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 9d034c9)
Addresses the latest nakajee review comment on PR #7636:

> module.add(SMovB32(dst=sgpr("LoopCounterL"), src=0,
>     comment="single-iter tail: force closeLoop fall-through"))
> module.add(self.closeLoop(kernel, tensorParametersA, tensorParametersB,
>     -1, finalLoop=True))
>
> Seems like closeLoop here is not necessary.

The subtile tail body processes the entire K_tail in a single pass
via the `mmak` loop (every lane mask is emitted against the current
`LoopCounterL = K mod DU` snapshot), so `closeLoop(... finalLoop=True)`
would only contribute dead code:

  - per-iter `s_sub_i32 LoopCounterL, ..., MIK` decrement,
  - `s_cbranch_scc0 TailLoopBeginL` back-edge (no-op since we
    intentionally underflowed LoopCounterL to negative),
  - `OrigLoopCounter += MIK` increment (only consumed by the legacy
    LRO-damage recovery block, which is already bypassed for
    `UseSubtileImpl=1` kernels inside `closeLoop`),
  - `TailLoopEndL:` label (nothing branches to it — the only
    early-exit target is `SkipTailLoopL:`, emitted by the surviving
    `closeLoop(... emitEndLabelOnly=True)` call).

Dropping the `closeLoop(... finalLoop=True)` call also removes the
need for the `s_mov_b32 LoopCounterL, 0` workaround that was there
purely to force the now-removed back-edge to fall through.

Net asm delta per kernel (subtile tail body):
  - 1 `s_mov_b32 LoopCounterL, 0`
  - 1 `s_sub_i32 LoopCounterL, LoopCounterL, MIK`
  - 1 `s_add_u32 OrigLoopCounter, OrigLoopCounter, MIK`
  - 1 `s_cmp_le_i32 LoopCounterL, 0`
  - 1 `s_cbranch_scc0 label_TailLoopBeginL`
  - 1 `label_TailLoopEndL:` label
All removed.

Test updates:
  - `test_emits_loop_counter_zero_before_closeloop` (PGR=0) is
    inverted into `test_omits_closeloop_emit`: asserts that the
    `s_sub_i32 LoopCounterL ... dec counterL` decrement, the
    `s_mov_b32 LoopCounterL, 0` workaround, and `TailLoopEndL` are
    all absent from the tail block.
  - `test_pgr2_loop_counter_zero_before_closeloop` (PGR=2) is
    inverted into `test_pgr2_omits_closeloop_emit` with the same
    set of negative assertions.
  - `_emitTailLoopScaffoldSubtile` docstring updated to enumerate
    the three PGR>0 entry-gate paths (origCounter==0 / small /
    large) and to document that closeLoop is intentionally not
    emitted by the subtile scaffold.
  - `TestEmitAllLoopsTail_PGR0` class docstring and
    `test_omits_tail_when_NoTailLoop_true` docstring updated to
    note that `TailLoopEndL` is never emitted (was previously only
    asserted absent under NoTailLoop=True).

Full unit suite: 828 passed, 5 skipped, 2 xfailed.

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 11173ca)
Addresses the second-to-last open nakajee review comment on PR #7636:

> module.add(SCmpLtU32(
>     src0=sgpr(savedOrigCounterSgpr),
>     src1=pgr,
>     comment="origCounter < PGR?"))
> module.add(SCBranchSCC1(
>     ...
>
> We should not need special treatment for LoopCounter < 2 case.
> SrdA/B/MXSA/MXSB should be already adjusted in preloop.

Background
----------

For PGR=2 origCounter==1 the previous tail scaffold emitted a
"small-counter realign" block that XOR'd `LocalWriteBaseAddr{A,B,
MXSA,MXSB}` back with their `Swap<tc>` masks to re-align LWA with
the LDS buffer NLL drained from. The misalignment came from preLoop:

    GR(MT 0)
    GR_INC                <-- SRD advance + LWA toggle to buf 1
    Wait + Sync
    LR drains MT 0 from buf 0
    SkipOp(LE 1, NLL)     <-- for origCounter==1, branches here
    GR(MT 1) writes buf 1
    SkipOp(LE 2, NGLL)

The `GR_INC`'s LDS toggle still fired even when `SkipOp(LE 1, NLL)`
bypassed `GR(MT 1)`, so on the c==1 path NLL ended with LWA at buf 1
while LR was still at buf 0 -- the tail had to XOR LWA back.

Fix (per the review)
--------------------

Split the compound `GRIncOp` into two ops just for preLoop's use:

  - `GRPtrIncOp`:  SRD pointer advance only.
  - `GRLDSwapOp`:  LDS double-buffer XOR only.

The mainloop keeps the compound `GRIncOp` (its two halves are always
paired one-to-one with an iteration). preLoop is restructured so the
SRD advance fires before `SkipOp(LE 1, NLL)` (the tail still needs
SRD at K=DU for c==1) and the LDS swap fires AFTER the SkipOp, right
before `GR(MT 1)`:

    GR(MT 0)
    gr_ptr_inc  (per tensor)        <-- SRD advance, always fires
    Wait + Sync
    LR drains MT 0 from buf 0
    SkipOp(LE 1, NLL)               <-- branch over both lds_swap
                                       and GR(MT 1) for c==1
    gr_lds_swap (per tensor)        <-- LDS toggle, skipped for c==1
    GR(MT 1) writes buf 1
    SkipOp(LE 2, NGLL)

With the swap gated by the same SkipOp that gates GR(MT 1), LWA and
LR stay aligned for every origCounter:

  - c==0:    upstream SkipSubtileMainLoop gate skips all of preLoop;
             LWA at initial buf 0, LR untouched.
  - c==1:    SkipOp(LE 1) taken -> no LDS toggle; LWA at buf 0,
             LR at buf 0 (NLL drained MT 0 from buf 0).
  - c==2:    SkipOp(LE 1) not taken, LDS toggle fires (LWA->buf 1),
             GR(MT 1) writes buf 1; SkipOp(LE 2, NGLL) taken;
             NGLL/NLL drain MT 1 from buf 1; LR ends at buf 1.
  - c>=3:    SkipOp(LE 1)/SkipOp(LE 2) both not taken; mainloop
             runs (c-2) iters, each toggling LWA and LR; iter-pair
             cancellation keeps LWA and LR aligned at exit.

Tail-scaffold simplification
----------------------------

With the preLoop fix in place, the tail's small-counter realign
block (`SXorB32 LocalWriteBaseAddr<tc>, ..., Swap<tc>`) becomes
unreachable / dead and is removed. The c==0 and small-counter
compares also collapse into a single `s_cmp_lt_u32 origCounter,
PGR` -- both cases need no SRD/LWA fix-up, only the large-counter
case (`origCounter >= PGR`) still needs the existing +1 DU SRD
advance to undo the per-iter GR_INC off-by-one at loop exit.

Net asm delta per PGR=2 kernel (subtile tail body):
  - 1 `s_cmp_eq_u32 origCounter, 0` compare      [removed]
  - 1 `s_cbranch_scc1 PGRTailEntry` (c=0 branch)  [removed]
  - `PGRTailSmallCounterRealign<L>:` label       [removed]
  - 4 `s_xor_b32 LocalWriteBaseAddr<tc>, ...,
    Swap<tc>` (per A/B/MXSA/MXSB)                [removed]
  - 1 `s_branch PGRTailEntry` (large-counter
    fall-through bypass)                          [removed]

Test updates
------------

  - LogicalScheduler.py: new `GRPtrIncOp` and `GRLDSwapOp` dataclasses
    with `gr_ptr_inc` / `gr_lds_swap` kinds; `build_preloop` PGR=2
    branch uses the split pair around `SkipOp(LE 1, NLL)`.
  - InstructionEmitter.py: new `emit_gr_ptr_inc` (ptr-only) and
    `emit_gr_lds_swap` (lds-only) handlers wired into the dispatch
    table.
  - KernelWriter.py `_emitTailLoopScaffoldSubtile`: small-counter
    realign block removed; c=0 and `<PGR` compares folded into one
    `s_cmp_lt_u32 origCounter, PGR`; docstring + inline comments
    updated.
  - test_subtile_tailloop_emit.py: invert
    `test_emits_small_counter_lwa_realign` to assert absence;
    replace `test_emits_c0_reset_compare_and_branch` (PGR=1 & PGR=2)
    with `test_emits_lt_pgr_compare_and_branch`; rename
    `test_emits_small_counter_compare` to
    `test_emits_lt_pgr_compare_immediate` and update its docstring;
    refresh class / module docstrings.
  - test_SubtileBasedLogicalScheduler.py: extend
    `TestBuildPreloop.test_256x256_fp4` to require both
    `gr_ptr_inc` and `gr_lds_swap` (and forbid the legacy
    `gr_inc`); add a strict ordering invariant pinning
    `gr_lds_swap` AFTER `skip(LE:1:NLL)`.
  - test_SubtileBasedSchedulerRef.py: refresh the three preloop
    golden references (256x256 FP4 1x1; 320x320 BF16 1x5 offset1
    and offset_all) to show `gr_ptr_inc` ... skip ... `gr_lds_swap`
    in place of the old `gr_inc` ... skip ... shape.

Validation: full unit suite green (828 passed, 5 skipped, 2 xfailed).

Co-authored-by: Cursor <cursoragent@cursor.com>
(cherry picked from commit 78e5093)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants