Fix/hipblaslt _APIAlgoIndex regression on per-arch mapping installs#7009
Merged
Conversation
## Summary Follow-up to #6840 (per-arch lazy mapping). Fixes the `pre_checkin_matmul_extapi_algo_method_gemm_*_APIAlgoIndex` test suite, which started failing on every gfx110X-all (and gfx12X-all / gfx94X-all) Windows runner that isn't the first-alphabetical arch in its build, and quiets the cross-arch "No library found for index N" stderr that the same scenario floods. ## Problem After #6840, lazy solution indices stay globally numbered but each arch's `TensileLiteLibrary_lazy_<arch>_Mapping.dat` only contains entries whose file prefix ends with `_<arch>`. In a multi-arch build the alphabetically-first arch (e.g. gfx1100) gets the low indices; later arches start well above 0. Two consequences: 1. **`_APIAlgoIndex` test breaks.** The test (`algo_method == 2`) probes `[0, 1, ..., 99]` blindly via `getAlgosFromIndex`. On gfx1101 (and any non-first-alphabetical arch) the entire first batch falls below the smallest mapping key, every lookup misses, `tmpAlgo` returns empty, the loop early-breaks at line 3755, and `CHECK_SOLUTION_FOUND` fires "NO solution found!" — even though the device has plenty of valid algos at higher indices. 2. **Cross-arch lookups spam stderr.** `MasterSolutionLibrary::loadLibrary` prints `std::cerr << "No library found for index N"` for every index below the local arch's smallest mapping key. With per-arch mapping that's now a routine occurrence for any consumer that scans an index range. Surfaced by ROCm/TheRock CI run 25215940348 on the gfx110X-all Windows runner: 68 `_APIAlgoIndex` failures, ~6800 stderr lines of cross-arch noise. ## Solution ### Test (`testing_matmul.hpp`, `algo_method == 2`) Discover the device's saveable indices via `getAllAlgos` first, extract them with `getIndexFromAlgo`, then round-trip the list through `getAlgosFromIndex` in batches of 100. This preserves the test's intent — "an index obtained from the API can be used to re-resolve the same algo" — without depending on indices starting at 0. The explicit-`solution_index` path is unchanged in semantics (now marked single-shot via `lastBatch = true` so the loop exits deterministically after one iteration). ### Library (`MasterSolutionLibrary.hpp`) Demote the `it == begin()` branch from `std::cerr` to `std::cout` gated by `Debug::Instance().printDataInit()` — the same flag the success-path "Loading library for index" print already uses. The diagnostic is preserved for anyone debugging library load with the standard Tensile debug knob, but stops polluting stderr in the now-routine cross-arch case. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (77.83%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #7009 +/- ##
===========================================
- Coverage 64.73% 64.73% -0.00%
===========================================
Files 2108 2108
Lines 328578 328678 +100
Branches 43101 43143 +42
===========================================
+ Hits 212682 212746 +64
- Misses 97997 98006 +9
- Partials 17899 17926 +27
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
testing_matmul.hpp: - Assert HIPBLAS_STATUS_SUCCESS from getAlgosFromIndex() in the discovered-index path. Indices come from a successful getAllAlgos + getIndexFromAlgo round-trip, so non-success here is a real regression in the index API. Explicit-index callers may legitimately pass an out-of-range index; CHECK_SOLUTION_FOUND surfaces that case. - Drop dead `lastBatch` variable. The early break on batchStart exhaustion (-1 path) and the unconditional foundAlgo=true after CHECK_SOLUTION_FOUND (explicit-index path) make the bottom-of-loop lastBatch check unreachable; simplify to `if(foundAlgo) break`. - `int algoIndexInc` -> `constexpr size_t algoIndexInc`. The only consumer is std::min<size_t>(batchStart + ...), so drop the narrowing. - Expand the algo_method == 2 intro comment with the test's intent (round-trip persistence path) before the per-arch shard rationale. - Add a one-shot mixed-validity contract check after the main loop: request [validIndices..., INT_MAX] and assert getAlgosFromIndex returns INVALID_VALUE *and* still produces the valid results. Recovers the implicit coverage the [0..N) probe loop used to provide; independent of per-arch index layout, so it doesn't reintroduce the bug this PR fixes. MasterSolutionLibrary.hpp: - Soften the comment on the gated "No library found" branch. The code can't distinguish a routine cross-arch shard miss from a genuinely bad index; spell out both cases, name the trade-off (loss of always-on bad-index diagnosis on non-shard installs), and explain why the gate is justified now (post-#6840 the cross-arch case dominates the signal). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the multi-paragraph justification on the "no mapping for index"
branch with an early-return whose log message describes the condition
("Index N not in this arch's mapping range"). The diagnostic-not-error
intent reads from the message + the printDataInit gate matching the
success-path print four lines below; no prose required.
Also flattens the success path one level by removing the else block.
No behavior change.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
In MasterSolutionLibrary::loadLibrary and the algo_method == 2 path of testing_matmul, brace every single-statement if/for body so future edits can't silently land outside the intended branch (Apple goto-fail style). In the algo_method == 2 path, drop the wall-of-comment intros and the "explicit-index path is single-shot" / "discovered indices came from..." inline narration. Replace `arg.solution_index == -1` with a named `indicesAreDiscovered` const bool so the surrounding branches read as intent rather than as a magic-value check, and rename `algoIndexInc` to `batchSize` to match what it actually represents. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ladder
The algo_method == 2 block was a 270-line while(1) loop with three exit
conditions buried in the body and four interleaved setProblem/select
variants. Restructure into two top-level lambdas matching the two
concerns:
searchForSupportedAlgoViaIndexAPI() -- discovery + batched search
verifyMixedValidityContract() -- post-search API contract test
Inside the search lambda, the loop now reads top-down with all three
exits visible at the top:
while(auto batch = nextBatchOfIndices()) {
const auto candidates = fetchAlgosForBatch(*batch);
if(candidates.empty()) break;
if(trySelectFromBatch(candidates)) break;
}
trySelectFromBatch dispatches by mode (grouped / ext+batched / CAPI),
each path delegating to a configureXxx + collectXxx pair. The
configureXxx lambdas hold the four setProblem variants, isolated from
the algo-selection logic. The two selection strategies are named after
what they actually do:
collectTuningsForFirstViableAlgo -- ext/grouped: stop at first j
with any working tuning
collectAllSupportedAlgosViaCAPI -- CAPI: every working algo
(no tuning concept)
Replaces magic-value comparisons with named state:
arg.solution_index == -1 -> indicesAreDiscovered
Drops the wall-of-comment intros, the "explicit-index path is
single-shot" inline narration, and the multi-line discovery rationale.
The names carry that intent now.
selectionWasAttempted preserves the original explicit-index semantics:
when getAlgosFromIndex returns empty, we silently fall through rather
than firing CHECK_SOLUTION_FOUND.
Adds <optional> for std::optional in nextBatchOfIndices.
No behavior change.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Alex-Vasile
reviewed
May 5, 2026
The lambda took its batch parameter as `const std::vector<int>&` but hipblaslt_ext::getAlgosFromIndex declares its index argument as a non- const `std::vector<int>&`. Pre-refactor the call used a non-const local named `algoIndex`, so the const dropped in only with commit 329e23d ("Refactor algo_method == 2: split into named lambdas"). Drop the `const`. The call site (`fetchAlgosForBatch(*batch)` with `auto batch = nextBatchOfIndices()`) already passes a non-const lvalue, so no caller change is needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The lambda refactor in 329e23d introduced several issues that compile only when clients are disabled: - discoverValidIndices and fetchAlgosForBatch used EXPECT_HIPBLAS_STATUS inside non-void lambdas; the macro expands to 'return;' which is invalid from a value-returning lambda. Converted both to void with out-params (validIndices lifted to outer scope; candidates passed in by reference). - collectTuningsForFirstViableAlgo and collectAllSupportedAlgosViaCAPI used CHECK_RETURNED_WORKSPACE_SIZE inside bool-returning lambdas with the same problem. Converted them and trySelectFromBatch to void with a bool& foundAlgo out-param. - gemmObject.isAlgoSupported and matmulIsAlgoSupported take hipblasLtMatmulAlgo_t& (non-const); the inner lambdas received candidates as const&, which dropped const at the call sites. Dropped the const from the parameter types and the receiving 'auto' at the caller. Also wraps a pre-existing EXPECT_EQ in verifyMixedValidityContract with ifdef GOOGLE_TEST to match the file's existing pattern (see d_vector.hpp); EXPECT_EQ is undefined in non-gtest builds. Behavior unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…y contract Two algo_method==2 paths previously fell through silently to a generic downstream "NO solution found" error: (a) explicit solution_index that returned zero candidates from getAlgosFromIndex(), (b) discovered indices where none produced a viable algo+tuning under the workspace budget. Both now emit a descriptive message naming the problem shape and fail at the original site via CHECK_SOLUTION_FOUND(0). Extends verifyMixedValidityContract from a single trailing-invalid shape to six: trailing-invalid, middle-invalid, reversed, duplicated, empty, all-invalid. Each shape pins both the documented return status (INVALID_VALUE iff any input was invalid) and the expected output count (count of valid inputs). Addresses review comments on PR #7009. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tracks issue #7080 (above-largest and gap-between-keys cases of upper_bound lookup). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
|
Overriding failing tensilelite tests resulting from a cmake issue unrelated to this change set. CMake is picking up GNU for the c+ compiler which is flagging legit issues in the code as warnings and we have Werror on. I will file an issue against stinkytofu for the UB that GNU is flagging. |
aledudek
pushed a commit
that referenced
this pull request
May 20, 2026
…g installs (#7009) ## Summary Follow-up to #6840 (per-arch lazy mapping). Fixes the `pre_checkin_matmul_extapi_algo_method_gemm_*_APIAlgoIndex` test suite, which began failing on every gfx110X-all (and gfx12X-all / gfx94X-all) runner that isn't the first-alphabetical arch in its build, and quiets the cross-arch `"No library found for index N"` stderr that the same scenario floods. ## Problem After #6840, lazy solution indices are still globally numbered, but each arch's `TensileLiteLibrary_lazy_<arch>_Mapping.dat` only contains entries whose file prefix ends with `_<arch>`. In a multi-arch build the alphabetically-first arch (e.g. gfx1100) gets the low indices; later arches start well above 0. Two consequences: 1. **`_APIAlgoIndex` test breaks.** The test (`algo_method == 2`) probed `[0, 1, …, 99]` blindly via `getAlgosFromIndex`. On gfx1101 (and any non-first-alphabetical arch) the entire first batch fell below the smallest mapping key, every lookup missed, `tmpAlgo` came back empty, the loop early-broke, and `CHECK_SOLUTION_FOUND` fired `"NO solution found!"` — even though the device had plenty of valid algos at higher indices. 2. **Cross-arch lookups spam stderr.** `MasterSolutionLibrary::loadLibrary` printed `std::cerr << "No library found for index N"` for every index below the local arch's smallest mapping key. With per-arch mapping that is now a routine occurrence for any consumer that scans an index range. Surfaced by ROCm/TheRock CI run 25215940348 on the gfx110X-all Windows runner: 68 `_APIAlgoIndex` failures, ~6800 stderr lines of cross-arch noise. ## Solution ### Test (`testing_matmul.hpp`, `algo_method == 2`) Rather than probing `[0..N)`, the test now discovers the device's saveable indices via `getAllAlgos` first, extracts them with `getIndexFromAlgo`, then round-trips the list through `getAlgosFromIndex` in batches of 100. This preserves the test's intent — "an index obtained from the API can be used to re-resolve the same algo" — without depending on indices starting at 0. Adds a one-shot contract check after the main loop: request `[validIndices…, INT_MAX]` and assert `getAlgosFromIndex` returns `INVALID_VALUE` while still emitting the valid results. Recovers the implicit "what happens with a bad index" coverage the `[0..N)` probe used to provide; independent of per-arch index layout, so it does not reintroduce the bug this PR fixes. ### Library (`MasterSolutionLibrary.hpp`) In `loadLibrary`, demote the `it == begin()` branch from `std::cerr` to `std::cout` gated by `Debug::Instance().printDataInit()` — the same flag the success-path "Loading library for index" print already uses. The diagnostic is preserved for anyone debugging library load with the standard Tensile debug knob, but stops polluting stderr in the now-routine cross-arch case. ## Readability follow-ups The last three commits are mechanical/structural cleanup with no behavior change. Reviewable independently of the fix above. - **`Refactor loadLibrary: drop wall-of-comment, invert to early-return`** Replaces a deeply-nested `if-then` with the more conventional "guard then continue" shape; deletes the wall-of-comment that previously explained the bottom branch. - **`Style: brace single-statement bodies; trim bloated comments`** Adds braces to every single-statement `if`/`for` body in both files (Apple `goto fail` style). Replaces `arg.solution_index == -1` with a named `indicesAreDiscovered` const bool; renames `algoIndexInc` → `batchSize`. - **`Refactor algo_method == 2: split into named lambdas with single exit ladder`** Splits the 270-line `while(1)` into two top-level lambdas: `searchForSupportedAlgoViaIndexAPI()` (discovery + batched search) and `verifyMixedValidityContract()` (the post-search contract test). Inside the search lambda, the loop now reads top-down with all three exits visible in one place: ```cpp while(auto batch = nextBatchOfIndices()) { const auto candidates = fetchAlgosForBatch(*batch); if(candidates.empty()) break; if(trySelectFromBatch(candidates)) break; } ``` The four interleaved `setProblem` variants are isolated in `configureExtGemmForCurrentProblem` / `configureGroupedGemmForCurrentProblem`, and the two selection strategies are named after what they do: `collectTuningsForFirstViableAlgo` (ext/grouped: stop at first algo with any working tuning) and `collectAllSupportedAlgosViaCAPI` (CAPI: every working algo, no tuning concept). ## Test plan - [ ] CI: `pre_checkin_matmul_extapi_algo_method_gemm_*_APIAlgoIndex` passes on gfx110X-all, gfx12X-all, and gfx94X-all runners (the configurations that previously failed) - [ ] CI: same suite still passes on gfx942 single-arch and on the first-alphabetical arch in each multi-arch build (was passing before, must not regress) - [ ] Manual: with `TENSILE_DEBUG=printDataInit`, both the cross-arch miss and the load success continue to print to stdout; without it, stderr stays clean during a normal index scan 🤖 Generated with [Claude Code](https://claude.com/claude-code) ROCM-23699 --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #6840 (per-arch lazy mapping). Fixes the
pre_checkin_matmul_extapi_algo_method_gemm_*_APIAlgoIndextest suite,which began failing on every gfx110X-all (and gfx12X-all / gfx94X-all)
runner that isn't the first-alphabetical arch in its build, and quiets
the cross-arch
"No library found for index N"stderr that the samescenario floods.
Problem
After #6840, lazy solution indices are still globally numbered, but each
arch's
TensileLiteLibrary_lazy_<arch>_Mapping.datonly contains entrieswhose file prefix ends with
_<arch>. In a multi-arch build thealphabetically-first arch (e.g. gfx1100) gets the low indices; later
arches start well above 0.
Two consequences:
_APIAlgoIndextest breaks. The test (algo_method == 2) probed[0, 1, …, 99]blindly viagetAlgosFromIndex. On gfx1101 (and anynon-first-alphabetical arch) the entire first batch fell below the
smallest mapping key, every lookup missed,
tmpAlgocame back empty,the loop early-broke, and
CHECK_SOLUTION_FOUNDfired"NO solution found!"— even though the device had plenty of validalgos at higher indices.
Cross-arch lookups spam stderr.
MasterSolutionLibrary::loadLibraryprinted
std::cerr << "No library found for index N"for every indexbelow the local arch's smallest mapping key. With per-arch mapping
that is now a routine occurrence for any consumer that scans an index
range.
Surfaced by ROCm/TheRock CI run 25215940348 on the gfx110X-all Windows
runner: 68
_APIAlgoIndexfailures, ~6800 stderr lines of cross-archnoise.
Solution
Test (
testing_matmul.hpp,algo_method == 2)Rather than probing
[0..N), the test now discovers the device'ssaveable indices via
getAllAlgosfirst, extracts them withgetIndexFromAlgo, then round-trips the list throughgetAlgosFromIndexin batches of 100. This preserves the test's intent— "an index obtained from the API can be used to re-resolve the same
algo" — without depending on indices starting at 0.
Adds a one-shot contract check after the main loop: request
[validIndices…, INT_MAX]and assertgetAlgosFromIndexreturnsINVALID_VALUEwhile still emitting the valid results. Recovers theimplicit "what happens with a bad index" coverage the
[0..N)probeused to provide; independent of per-arch index layout, so it does not
reintroduce the bug this PR fixes.
Library (
MasterSolutionLibrary.hpp)In
loadLibrary, demote theit == begin()branch fromstd::cerrtostd::coutgated byDebug::Instance().printDataInit()— the same flagthe success-path "Loading library for index" print already uses. The
diagnostic is preserved for anyone debugging library load with the
standard Tensile debug knob, but stops polluting stderr in the
now-routine cross-arch case.
Readability follow-ups
The last three commits are mechanical/structural cleanup with no
behavior change. Reviewable independently of the fix above.
Refactor loadLibrary: drop wall-of-comment, invert to early-returnReplaces a deeply-nested
if-thenwith the more conventional "guardthen continue" shape; deletes the wall-of-comment that previously
explained the bottom branch.
Style: brace single-statement bodies; trim bloated commentsAdds braces to every single-statement
if/forbody in both files(Apple
goto failstyle). Replacesarg.solution_index == -1with anamed
indicesAreDiscoveredconst bool; renamesalgoIndexInc→batchSize.Refactor algo_method == 2: split into named lambdas with single exit ladderSplits the 270-line
while(1)into two top-level lambdas:searchForSupportedAlgoViaIndexAPI()(discovery + batched search) andverifyMixedValidityContract()(the post-search contract test).Inside the search lambda, the loop now reads top-down with all three
exits visible in one place:
The four interleaved
setProblemvariants are isolated inconfigureExtGemmForCurrentProblem/configureGroupedGemmForCurrentProblem,and the two selection strategies are named after what they do:
collectTuningsForFirstViableAlgo(ext/grouped: stop at first algowith any working tuning) and
collectAllSupportedAlgosViaCAPI(CAPI:every working algo, no tuning concept).
Test plan
pre_checkin_matmul_extapi_algo_method_gemm_*_APIAlgoIndexpasses on gfx110X-all, gfx12X-all, and gfx94X-all runners (the
configurations that previously failed)
first-alphabetical arch in each multi-arch build (was passing
before, must not regress)
TENSILE_DEBUG=printDataInit, both the cross-archmiss and the load success continue to print to stdout; without it,
stderr stays clean during a normal index scan
🤖 Generated with Claude Code
ROCM-23699