[Hipblaslt] [Subtile] Tail-loop support for BF16 and MXFP4 by sebvince · Pull Request #7683 · ROCm/rocm-libraries

sebvince · 2026-05-21T14:21:55Z

Summary

Adds tail-loop support to the gfx950 subtile path for BF16 and MXFP4, removing the prior K-must-be-multiple-of-DepthU restriction. Subtile kernels now accept any K and emit a per-PGR tail loop that handles the remainder.

What this enables

Before this PR, subtile kernels required K % DepthU == 0. Any K with a remainder fell off the subtile path (or was rejected at Solution time). After this PR:

BF16 / BSS / BBS: any K accepted.
MXFP4 (F4 + MXBlock=32): any K that is a multiple of 32 (the MX block size).

Code change

Produce PGR0 styled schedule in the Logical scheduler for the Tail loop.

Mask emit: per-K-block lane masks precomputed once and reused across sub-K iterations
Sub-VGPR masking: K%8==0 fast path avoids per-lane select
M/N dedup: shared mask application across the M and N axes
Multi-partition VGPR reuse: tile VGPRs reused across partitions to keep register pressure flat
K<DU skip: jumps directly into the tail loop when no main iteration fits
Early exit on each subIterK
Used flat single partition for tailloop

Tests

subtile_bf16.yaml: TAIL-loop sweep (K=1..128 step 1, every ASEM bucket) on MT 128x128 / 320x320 / 96x128 (odd M wavetile) across DU=64/128, PGR=0/1/2, BBS+bias+hipblaslt_all+ScaleAlphaVec
subtile_mxfp4.yaml: TAIL-loop sweep (K=32..512 step 32) on MT 128x128 / 64x64 across DU=256/512, PGR=0/1/2, F4BS+bias+hipblaslt_all+ScaleAlphaVec

Test plan

subtile-test bf16: PASSED=15066, FAILED=0 Rejected: 294
subtile-test mxfp4: PASSED=2883, FAILED=0 Rejected: 531

The K-tail scaffold's per-mmak early exit emits `s_cmp_le_u32 LoopCounterL, MIK*(subIterK+1)` with the threshold as a compile-time literal. gfx950's VOPC / SOPC inline-constant range is [-16, 64]; literals past 64 land in the assembler's 32-bit literal slot, which some opcodes either reject or silently mis-encode. The threshold exceeds 64 for typical bf16 (MIK=32, subIterK >= 2) and for FP4 (MIK=128) at every mmak >= 0, so the existing emit was already at risk on those configs even though our test fixtures happened to cover only the inline range. Adds `_emitSubtileScalarCmpLitOrStaged` which, for a given cmp op / src0 / literal, emits the cmp directly when the literal fits in the inline range, and otherwise stages it through a scratch sgpr via `s_mov_b32 sScratch, <lit>` first. The per-mmak early exit now goes through this helper. Idea ported from sebvince's #7683 commit 6d7e743 (`Dont use literal for cmp when diff >64`). Test additions in `test_subtile_tailloop_emit.py`: - `test_per_mmak_early_exit_inline_when_consumedK_fits`: bf16 fixture (consumedK=32) must use the inline form. - `test_per_mmak_early_exit_staged_when_consumedK_exceeds_inline`: FP4 fixture (consumedK >= 128) must stage every cmp. - `test_per_mmak_early_exit_boundary_consumedK_64`: custom bf16 DU=128 fixture pins the inclusive upper bound (32, 64 inline; 96 staged). - `test_per_mmak_early_exit_threshold_progression` updated to accept both inline and staged forms via a new helper `_collect_per_mmak_early_exit_thresholds` that pairs each cmp with its preceding `stage literal ...` s_mov_b32 by sgpr name. End-to-end verification on subtile_bf16.yaml shows the production emit using 0x20 / 0x40 inline and staging 0x60..0xe0 through sgprs; both subtile_bf16.yaml and subtile_bf16_anyk_largemt.yaml pass with the change. Unit suite: 871 passed / 5 skipped / 2 xfailed (baseline 868). Co-authored-by: Cursor <cursoragent@cursor.com>

The byte-refine K-tail mask path previously emitted the full per-(operand, ir) chain (`v_mov seed` + per-mod runtime-gated `v_cmp + v_cndmask + v_and`) inside the per-mmak loop body. The chain is independent of A/B tile data — it depends only on `kPosBase + mmak*MIK + ir*elementsPerVgpr` and `LoopCounterL` — so it can be hoisted out once before the loop and the per-mmak step collapses to pure `v_and_b32 vIdx, vMask[..], vIdx`. This commit lands that hoist on the byte-refine path: * `_emitTailSubLaneMaskChainIntoVgpr`: extracted chain emitter that writes its mask to a caller-owned target VGPR (mirrors the body of `_emitTailSubLaneMaskRefineSubtile._emitChain`; keeps the bpe-parametric mod chain, the static `ASEM*bpe % bpr == 0` skip, and the runtime `LoopCounterL & (elementsPerVgpr-1)` gate). * `_emitTailSubLaneMaskPrecomputeSubtile`: allocates one scratch VGPR per (operand, mmak, ir) and emits the chain into each. When `bpeA == bpeB` (the common bf16/bf16 case) A and B share the same mask VGPR per (mmak, ir), halving the persistent VGPR cost. Returns the `(operand, mmak, ir) -> vgpr` lookup plus the list of VGPRs to release after the loop. * `_emitTailSubLaneMaskApplySubtile`: per-mmak v_and-only apply step. Comment carries `apply precomputed mask to ...` so the new shape is distinguishable from the legacy inline emit. `_emitTailLoopScaffoldSubtile` now precomputes the mask map once before the per-mmak loop and calls apply per mmak. Cleanup of the precomputed VGPRs happens after the loop completes (the runtime per-mmak early-exit branches past the cleanup so emit-time pool accounting balances). Gated by `kernel.get("SubtileTailMaskPrecompute", True)`. Setting it False reverts to the legacy `_emitTailSubLaneMaskRefineSubtile` per-mmak inline chain. The coarse path (ASEM>=numMIInUnroll, MX scale) is untouched by this commit. Scope: * Byte-refine path ONLY (ASEM<numMIInUnroll, integer bpe, no MX). The user-flagged risk of regressing the per-mmak LR + MFMA interleave on MT320x288 is avoided by NOT extending the precompute to the coarse path: that path's higher tile-grid counts would inflate the persistent VGPR cost on the high-water configs. * MT320x288 (subtile_bf16_anyk_largemt): byte-refine kernels stay at .vgpr_count=249 (same as the per-mmak-LR baseline); the +numMmaks*vgprPerInUnroll persistent mask VGPRs (8 for DU=64) do not push past the D-tile high-water. Tests in `test_subtile_anyk_emit.py`: * `TestAnyKEmit_Precompute`: new class pinning the hoist (precompute block sits BEFORE the per-mmak ds_read wait; apply block contains no cmps/cndmasks), the A/B mask VGPR dedupe (bf16/bf16 → one VGPR per (mmak, ir) shared across operands), the config switch behaviour (off reverts to legacy inline shape), the NoTailLoop no-op, and the coarse-path identity (ASEM=32 emit byte-identical with the switch on vs off). * `TestAnyKEmit_K2::test_k2_emits_per_operand_byte_refine`, `TestAnyKEmit_K1::test_k1_no_narrow_load_and_emits_partial_chain`: updated to recognise the new `apply precomputed mask to ...` apply comment and to verify the chain/gate live in the precompute prefix rather than per-mmak. Idea ported from sebvince's #7683 commit a0fee26 (`Precompute mask for all subIterK`) with subsequent refinements (e9f5f55, d999a28, 4a4d6a5). Sebvince's framework computes per-iteration masks (BF16 3-state blend + boundary precompute); this port keeps our existing bpe-parametric mod chain and just moves the emit-site from inside the loop to before it. Unit suite: 876 passed / 5 skipped / 2 xfailed (baseline 868; +3 from Ask 1, +5 from Ask 2 precompute pins). Yaml smokes (all PASS): - subtile_bf16 - subtile_bf16_tail - subtile_bf16_anyk_k2 - subtile_bf16_anyk_odd - subtile_bf16_anyk_k8 - subtile_bf16_anyk_largemt (MT320x288 byte-refine, vgpr=249) - subtile_mxfp4_tail Co-authored-by: Cursor <cursoragent@cursor.com>

codecov-commenter · 2026-05-21T16:42:35Z

Codecov Report

❌ Patch coverage is 56.13718% with 243 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py	1.00%	198 Missing ⚠️
...ite/Tensile/Components/Subtile/LogicalScheduler.py	86.00%	23 Missing and 5 partials ⚠️
...t/tensilelite/Tensile/Components/Subtile/Kernel.py	7.14%	13 Missing ⚠️
...e/Tensile/Components/Subtile/InstructionEmitter.py	97.84%	1 Missing and 2 partials ⚠️
...lt/tensilelite/Tensile/SolutionStructs/Solution.py	0.00%	1 Missing ⚠️

❌ Your project status has failed because the head coverage (27.58%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff              @@
##           develop    #7683       +/-   ##
============================================
- Coverage    61.87%   30.35%   -31.51%     
============================================
  Files         2086      161     -1925     
  Lines       357038    67735   -289303     
  Branches     53806    13045    -40761     
============================================
- Hits        220892    20560   -200332     
+ Misses      117348    45380    -71968     
+ Partials     18798     1795    -17003

Flag	Coverage Δ
TensileLite	`27.58% <56.14%> (+1.64%)`	⬆️
hipBLAS	`?`
hipBLASLt	`41.27% <ø> (ø)`
hipCUB	`?`
hipDNN	`?`
hipFFT	`?`
hipRAND	`?`
hipSOLVER	`?`
hipSPARSE	`?`
rocBLAS	`?`
rocFFT	`?`
rocRAND	`?`
rocSOLVER	`?`
rocSPARSE	`?`

Files with missing lines	Coverage Δ
...ects/hipblaslt/tensilelite/Tensile/KernelWriter.py	`9.43% <ø> (-0.15%)`	⬇️
...lt/tensilelite/Tensile/SolutionStructs/Solution.py	`7.19% <0.00%> (-0.12%)`	⬇️
...e/Tensile/Components/Subtile/InstructionEmitter.py	`96.74% <97.84%> (+3.83%)`	⬆️
...t/tensilelite/Tensile/Components/Subtile/Kernel.py	`62.39% <7.14%> (-1.55%)`	⬇️
...ite/Tensile/Components/Subtile/LogicalScheduler.py	`87.68% <86.00%> (+0.47%)`	⬆️
...blaslt/tensilelite/Tensile/KernelWriterAssembly.py	`7.38% <1.00%> (-0.14%)`	⬇️

... and 1946 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

In the subtile tail scaffold, the PGR>0 SRD-advance / LWA-XOR block (which undoes the mainloop-exit's last `gr_inc + lw_swap` so the tail's GR+LR can re-issue at K_aligned with LWA/LR in sync) sat AFTER `calculateLoopNumIter(-1)`, i.e. on the not-taken side of the K%DU==0 early-exit cmp/branch to SkipTailLoopL. Sebvince's #PR-7683 design (`build_tailloop_pgr0` + `Subtile/ Kernel.mainLoop`) emits the analogous mainloop-exit code as part of `emitMainAndExitLoops`, BEFORE the orchestrator calls `calculateLoopNumIter(-1)` and `emitTailLoop`. Nakajee's MT320x320 side-by-side flagged sebvince's placement as preferable for two reasons: - The SRD-add / LWA-xor chain has small dependency lengths; on sebvince's side it co-issues with calculateLoopNumIter's scalar divide-and-remainder, whereas our previous placement serialized it after the cmp+branch (a hard dependency on the scc / branch resolution). - The OrigLoopCounter snapshot SGPR our scaffold held across `calculateLoopNumIter` (because it resets OrigLoopCounter to 0 at the start of the tail-number-of-iters compute) becomes unnecessary once the gate runs before that reset -- the gate can read `sgprOrigLoopCounter` directly. One fewer SGPR live across the tail-entry block. This commit moves the PGR>0 gating block (origCounter == 0 c=0 branch + 0 < origCounter < PGR small-counter realign + origCounter >= PGR large-counter SRD-advance) to ABOVE `calculateLoopNumIter` and drops the `savedOrigKAlignDUs` snapshot SGPR. Safety: on the early-exit path (K%DU == 0 -> SkipTailLoopL) the now-unconditional SRD-advance / LWA-XOR is harmless -- the post-tail write-out reads SrdC/SrdD, never SrdA/B/MXSA/MXSB or LocalWriteBaseAddr. Test updates in `test_subtile_tailloop_emit.py`: - `test_emits_origCounter_snapshot{PGR1,PGR2}` -> renamed and flipped to negative pins `test_omits_origCounter_snapshot_after_hoist` (the snapshot s_mov_b32 must NOT appear). - New `test_gating_reads_OrigLoopCounter_directly`: the c=0 cmp src0 must spell out `s[sgprOrigLoopCounter]`. - New `test_gating_block_precedes_TailLoopBeginL`: structural pin that the c=0 origCounter cmp appears BEFORE both the `skip to end of tail loop b/c numIter==0` early-exit cmp and the `TailLoopBeginL` label. - `test_emits_small_counter_compare`: updated to require the `s_cmp_lt_u32 s[sgprOrigLoopCounter], 1` form (no anonymous snapshot sgpr). Idea ported from sebvince #7683 (`build_tailloop_pgr0` design / `Subtile/Kernel.mainLoop` wrapper order). Nakajee's MT320x320 side-by-side comparison flagged this placement difference and preferred sebvince's; this commit lands the move. Unit suite: 879 passed / 5 skipped / 2 xfailed (baseline 877; +2 from the new gating positive pins). Yaml smokes (all PASS): - subtile_bf16 - subtile_bf16_anyk_odd - subtile_bf16_anyk_largemt (MT320x288 byte-refine) Co-authored-by: Cursor <cursoragent@cursor.com>

The byte-refine K-tail mask precompute previously emitted one bpe-parametric `byteRefine[<op> ir=N mmak=M]` chain per (operand, mmak, ir): seed `0xFFFFFFFF` mov + per-mod (mod chain length 2 for bf16, 4 for fp8) `v_add + v_cmp_ge_i32 + v_cndmask_b32 vSeed` + mod=0 close. For bf16 byte-refine that is 7 instructions per (mmak, ir) chain, multiplied by `numMmaks * vgprPerInUnroll` chains (e.g. 2*4 = 8 chains -> 56 instr on the DU=64 path, 4*4 = 16 chains -> 112 instr on the DU=128 path). Sebvince's #7683 design factors out the per-lane invariants once (`emit_mask_k_init`) and turns the per-subIterK chain into a pure 3-state diff/boundary cmp+cndmask (`emit_mask_k`). Nakajee's MT320x320 side-by-side review of #7661 vs #7683 flagged this as "much simpler" and asked to verify equivalence before adopting. Equivalence verification (sebvince's BF16 chain vs ours, with `numMIInUnroll = 8`, `kStride = 2`, `vgprPerInUnroll = 4`): laneK_0 := tidInK * numMIInUnroll (== our kPosBase) diff := LoopCounterL - laneK_0 (signed) effective_diff_n := diff - n*MIK (n = mmak / subIterK) d := LoopCounterL % numMIInUnroll (== effective_diff_n in the boundary range) For each (lane, mmak=n, vgpr i): sFull = diff > n*MIK + numMIInUnroll - 1 <=> effective_diff_n >= numMIInUnroll <=> ALL of this lane's K is in range -> mask = -1 sZero = diff <= n*MIK <=> effective_diff_n <= 0 <=> NONE of this lane's K is in range -> mask = 0 otherwise -> mask = boundary[i] where d <= i*2 -> 0 d == i*2+1 -> 0x0000FFFF (low bf16 in, high past) d >= i*2+2 -> -1 Truth-table check (tidInK=0, mmak=0, K_remain in {1..8}): K_rem=1: i=0 mask=0x0000FFFF (low K=0 in, high K=1 past) i=1..3 mask=0 K_rem=2: i=0 -1 ; i=1..3 mask=0 K_rem=3: i=0 -1 ; i=1 mask=0x0000FFFF ; i=2,3 mask=0 K_rem=4: i=0,1 -1 ; i=2,3 mask=0 K_rem=5: i=0,1 -1 ; i=2 mask=0x0000FFFF ; i=3 mask=0 K_rem=6: i=0,1,2 -1 ; i=3 mask=0 K_rem=7: i=0,1,2 -1 ; i=3 mask=0x0000FFFF K_rem=8: i=0..3 -1 (sFull fires for tidInK=0) Our existing chain (bpe-parametric mod chain) produces the same masks bit-for-bit for every entry above (and for tidInK > 0 the diff/sFull/sZero arithmetic and our `K_pos vs LoopCounterL` cmps algebraically coincide). Verified by hand on the full K_remain 1..8 x i 0..3 truth table. Caveats (all non-blocking for our gauntlet): * Sebvince's chain does NOT have our `(ASEM * bpe) % bpr == 0` static skip; the boundary[i] init runs unconditionally. For ASEM=2/4 (currently fast-path with static skip dropping the mod>0 chain) this emits a few more instructions, but the mask bit-pattern stays identical (d is even -> halfKeep branch never fires; only the full/zero outcomes are reachable). * Sebvince's chain assumes `bpeA == bpeB` (one shared boundary[i] per vgpr position, applied to both operands' tiles). All current gauntlet configs are symmetric bpe (`_subtileTailByteShiftApplies`'s integer-bpe gate plus `_emitTailSrdTightenSubtile`'s explicit `bpeA == bpeB` check). For asymmetric bpe (no current YAML) the dispatcher falls back to the legacy chain. * Storage: init holds `1 (diff) + vgprPerInUnroll (boundary)` = 5 persistent VGPRs for BF16 (numMIInUnroll=8 -> vgprPerInUnroll =4), on top of our existing `numMmaks * vgprPerInUnroll` precomputed masks. On MT320x288 byte-refine (`subtile_bf16_anyk_largemt`, baseline vgpr_count=249) the +5 fits well within the 256-VGPR budget; verified by re-running the yaml under the new emit (still PASS). This commit lands the adoption: * `_emitTailSubLaneMaskInitSebvince(kPosBaseVgpr, numMIInUnroll, bpe, vgprPerInUnroll)` -> emits `diff = sgprLoopCounterL - kPosBase` (signed v_sub_i32) + `d = sgprLoopCounterL & 7` + per-i `(d<hi) ? halfKeep : full ; (d<lo) ? 0 : prev` boundary cndmasks. Returns the persistent diff and boundary[i] vgprs. * `_emitTailSubLaneMaskChainIntoVgprSebvince(diffVgpr, boundaryMaskVgpr, ...)` -> per-(operand, mmak, ir): two cmps + two cndmasks. VOPC inline-range staging via `_subtileCmpSrc1FitsInline` for `mmak*MIK + ... > 64` (BF16 DU=128, mmak=2,3). * `_emitTailSubLaneMaskPrecomputeSubtile` dispatches: bpeA == bpeB == 2 && `SubtileTailMaskSebvinceForm=True` (default) -> sebvince form; otherwise -> legacy chain. Legacy chain retained for fp8 / int8 byte-refine (asymmetric bpe gates deferred) and as a reversibility escape hatch. Tests in `test_subtile_anyk_emit.py`: * `_emit_anyk_tail_asm` gains a `sebvinceForm` kwarg (default True) so individual tests can opt back into the legacy chain for regression coverage. * `TestAnyKEmit_K4::test_k4_sebvince_form_emits_init_and_per_mmak_chain` (replaces `test_k4_byte_refine_mod0_only`): pins sebvince init marker + per-(mmak, ir) sFull cndmask count + absence of the legacy mod>0 / mod=0 byteRefine chain seed. * `TestAnyKEmit_K4::test_k4_legacy_form_emits_mod0_only`: new regression pin for `SubtileTailMaskSebvinceForm=False` (legacy chain still emits the mod=0-only chain on ASEM=4). * `TestAnyKEmit_K2::test_k2_emits_sebvince_form_chain` (replaces `test_k2_emits_per_operand_byte_refine`): pins cmp count growth vs K%32 baseline + sebvince diff init + sFull/sZero pairing + `d = LoopCounterL % 8` init. * `TestAnyKEmit_K1::test_k1_no_narrow_load_and_emits_sebvince_chain` (replaces `test_k1_no_narrow_load_and_emits_partial_chain`): pins absence of narrow d16 load + sebvince init + boundary cndmask + per-(mmak, ir) sFull/sZero cndmasks; the chain sits in the precompute prefix (before per-mmak ds_read wait). * `TestAnyKEmit_Precompute::test_precompute_block_before_per_mmak_loop`: rewritten to pin sebvince diff init + per-mmak sFull cmps in the precompute section; apply section has no chain primitives. * `TestAnyKEmit_Precompute::test_precompute_hoisted_above_dtl_wait_and_barrier`: rewritten to pin sebvince `diff` + per-(mmak, ir) sFull/sZero markers above the DTL wait. Asm excerpt from `subtile_bf16_anyk_odd.yaml` MT128x128 DU=64 (`Cijk_Alik_Bljk_BSS_BH ... MT128x128x64`), the new precompute prefix (lines 1414-1453): ``` v_sub_i32 v12, s[sgprLoopCounterL], v11 // diff = LoopCounterL - kPosBase v_mov_b32 v13, 0x0000FFFF // halfKeep v_and_b32 v14, 7, s[sgprLoopCounterL] // d = LoopCounterL % 8 v_cmp_lt_i32 s[68:69], v14, 2 // boundary[0]: d < 2 ? v_cndmask_b32 v15, -1, v13, s[68:69] // boundary[0] = (d<2) ? halfKeep : full v_cmp_lt_i32 s[68:69], v14, 1 // boundary[0]: d < 1 ? v_cndmask_b32 v15, v15, 0, s[68:69] // boundary[0] = (d<1) ? 0 : prev ... boundary[1..3] (3 vgprs, same shape) ... v_cmp_gt_i32 s[68:69], v12, 7 // mmak=0 ir=0: sFull = diff > 7 v_cndmask_b32 v13, v15, -1, s[68:69] // mmak=0 ir=0 = sFull ? full : boundary[0] v_cmp_le_i32 s[68:69], v12, 0 // mmak=0 ir=0: sZero = diff <= 0 v_cndmask_b32 v13, v13, 0, s[68:69] // mmak=0 ir=0 = sZero ? 0 : prev ... mmak=0 ir=1..3 + mmak=1 ir=0..3 (8 per-(mmak, ir) chains) ... v_cmp_gt_i32 s[68:69], v12, 39 // mmak=1 ir=0: sFull = diff > 39 v_cmp_le_i32 s[68:69], v12, 32 // mmak=1 ir=0: sZero = diff <= 32 ... ``` The per-mmak apply step keeps the existing `apply precomputed mask to ValuA/B[idx]` v_and_b32 emit (one v_and per VGPR per operand, mask source-vgpr indexed into our per-(mmak, ir) precomputed pool). Unit suite: 880 passed / 5 skipped / 2 xfailed (baseline 877; +3 from Item 1 + Item 2 new pins). Yaml smokes (all PASS): - subtile_bf16 - subtile_bf16_tail - subtile_bf16_anyk_k2 - subtile_bf16_anyk_odd - subtile_bf16_anyk_k8 - subtile_bf16_anyk_largemt (MT320x288 byte-refine; +5 vgprs over baseline, well under 256-VGPR budget) The mxfp4 / mxfp4_tail / mxfp4_tail_smoke yamls go through the coarse cmp path (MX scales gate `_subtileTailByteShiftApplies` to False), so they are unchanged by this commit; verified by running them in-place and inspecting the post-tail emit. Idea ported from sebvince #7683 commits a0fee26 (`Precompute mask for all subIterK`), e9f5f55b (`Simplify emit_mask_k`), d999a288 (`Simplify emit_mask_k_init`), and 4a4d6a596 (`Remove hardcoded values`). Co-authored-by: Cursor <cursoragent@cursor.com>

This reverts commit 59230da.

github-actions Bot added project: hipblaslt project: hipsparselt ci:hipsparselt-fast labels May 21, 2026

assistant-librarian Bot added the organization: ROCm label May 21, 2026

sebvince added 21 commits May 27, 2026 08:34

Template for tailloop based on PGR0

c9265b9

Add unit test to display generated asm for tailloop

5951b05

Skip tailloop if not needed

a2898f8

Apply Mask / Kmod8Eq0

e5e83a4

SubVgpr Masking

92ee3b8

Deduplication along M/N

dd7bcbc

Reduce instruction count

a967787

Precompute mask for all subIterK

9f8dac3

as is

722a9c4

Skip to tailloop if K<DU

b1b16b7

Fix PGR1

ba571f2

Dont use literal for cmp when diff >64

78dfd27

Fix MX scale buffer stride mismatch in canonical CPU-ref init

49feb8a

Add tailloop tests to yaml files

b9002de

Remove mask_k_init and mask_k_done op

d07b5c4

Remove some tailloop some specific Ops

99e697e

Remove tail_srd_advance and tail_lr_inc

697008d

Add TAIL loop test for multi-partitions

be92cf2

Simplify emit_mask_k_init

85f8d33

Remove hardcoded valuesin emit_mask_k_init

6e25e8b

Simplify emit_mask_k

4e42c52

sebvince added 21 commits May 27, 2026 08:35

as is

7dfd63a

Revert "as is"

a931fa0

This reverts commit 59230da.

as is

cd75bdf

wave gate

0ff3e6c

Special case for x4 config

faaf28f

Remove loop and merge vAddr for 4x case

6c5947b

Enable logic on B as well

ba5c9af

Merge A B tail Op

75cc300

Use inline module instead (more generic)

c491116

Remove useless barrier and vmcnt

fa8cad3

Add missing sync + gate on bf16

0eb8195

Minor changes

c53ccb9

Cleanup

2ce35e5

Cleanup

d249b19

Remove debug label

785ca34

Enable SrdLimit for fp4

e0e4de3

Fix missing wait/sync on fp4

ebbc28a

Fix boundary load for larger DU

19dd922

Add DU=768 test (fp4)

516056a

Tighten MXSA,MXSB too

94a0ada

Fix unit tests

c7ccdb7

sebvince force-pushed the tailloop_bf16_fp4 branch from 92adc6a to c7ccdb7 Compare May 27, 2026 13:46

sebvince added 5 commits May 28, 2026 04:35

KMask scales too

59260f5

Simplify Mask

f5009b7

Simplify Tighten Srd{A,B,MXSA,MXSB}+2

165cfec

Add early exit in tailloop

b69bc77

Single partition Tailloop

909fb3e

sebvince marked this pull request as ready for review May 28, 2026 17:39

sebvince requested a review from a team as a code owner May 28, 2026 17:39

sebvince requested a review from nakajee May 28, 2026 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hipblaslt] [Subtile] Tail-loop support for BF16 and MXFP4 #7683

[Hipblaslt] [Subtile] Tail-loop support for BF16 and MXFP4 #7683
sebvince wants to merge 56 commits into
ROCm:developfrom
sebvince:tailloop_bf16_fp4

sebvince commented May 21, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sebvince commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Code change

Tests

Uh oh!

codecov-commenter commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sebvince commented May 21, 2026 •

edited

Loading

codecov-commenter commented May 21, 2026 •

edited

Loading