Skip to content

Fix/hipblaslt _APIAlgoIndex regression on per-arch mapping installs#7009

Merged
davidd-amd merged 11 commits into
developfrom
fix/hipblaslt-apialgoindex-per-arch-mapping
May 5, 2026
Merged

Fix/hipblaslt _APIAlgoIndex regression on per-arch mapping installs#7009
davidd-amd merged 11 commits into
developfrom
fix/hipblaslt-apialgoindex-per-arch-mapping

Conversation

@davidd-amd
Copy link
Copy Markdown
Contributor

@davidd-amd davidd-amd commented May 1, 2026

Summary

Follow-up to #6840 (per-arch lazy mapping). Fixes the
pre_checkin_matmul_extapi_algo_method_gemm_*_APIAlgoIndex test suite,
which began failing on every gfx110X-all (and gfx12X-all / gfx94X-all)
runner that isn't the first-alphabetical arch in its build, and quiets
the cross-arch "No library found for index N" stderr that the same
scenario floods.

Problem

After #6840, lazy solution indices are still globally numbered, but each
arch's TensileLiteLibrary_lazy_<arch>_Mapping.dat only contains entries
whose file prefix ends with _<arch>. In a multi-arch build the
alphabetically-first arch (e.g. gfx1100) gets the low indices; later
arches start well above 0.

Two consequences:

  1. _APIAlgoIndex test breaks. The test (algo_method == 2) probed
    [0, 1, …, 99] blindly via getAlgosFromIndex. On gfx1101 (and any
    non-first-alphabetical arch) the entire first batch fell below the
    smallest mapping key, every lookup missed, tmpAlgo came back empty,
    the loop early-broke, and CHECK_SOLUTION_FOUND fired
    "NO solution found!" — even though the device had plenty of valid
    algos at higher indices.

  2. Cross-arch lookups spam stderr. MasterSolutionLibrary::loadLibrary
    printed std::cerr << "No library found for index N" for every index
    below the local arch's smallest mapping key. With per-arch mapping
    that is now a routine occurrence for any consumer that scans an index
    range.

Surfaced by ROCm/TheRock CI run 25215940348 on the gfx110X-all Windows
runner: 68 _APIAlgoIndex failures, ~6800 stderr lines of cross-arch
noise.

Solution

Test (testing_matmul.hpp, algo_method == 2)

Rather than probing [0..N), the test now discovers the device's
saveable indices via getAllAlgos first, extracts them with
getIndexFromAlgo, then round-trips the list through
getAlgosFromIndex in batches of 100. This preserves the test's intent
— "an index obtained from the API can be used to re-resolve the same
algo" — without depending on indices starting at 0.

Adds a one-shot contract check after the main loop: request
[validIndices…, INT_MAX] and assert getAlgosFromIndex returns
INVALID_VALUE while still emitting the valid results. Recovers the
implicit "what happens with a bad index" coverage the [0..N) probe
used to provide; independent of per-arch index layout, so it does not
reintroduce the bug this PR fixes.

Library (MasterSolutionLibrary.hpp)

In loadLibrary, demote the it == begin() branch from std::cerr to
std::cout gated by Debug::Instance().printDataInit() — the same flag
the success-path "Loading library for index" print already uses. The
diagnostic is preserved for anyone debugging library load with the
standard Tensile debug knob, but stops polluting stderr in the
now-routine cross-arch case.

Readability follow-ups

The last three commits are mechanical/structural cleanup with no
behavior change. Reviewable independently of the fix above.

  • Refactor loadLibrary: drop wall-of-comment, invert to early-return
    Replaces a deeply-nested if-then with the more conventional "guard
    then continue" shape; deletes the wall-of-comment that previously
    explained the bottom branch.

  • Style: brace single-statement bodies; trim bloated comments
    Adds braces to every single-statement if/for body in both files
    (Apple goto fail style). Replaces arg.solution_index == -1 with a
    named indicesAreDiscovered const bool; renames algoIndexInc
    batchSize.

  • Refactor algo_method == 2: split into named lambdas with single exit ladder
    Splits the 270-line while(1) into two top-level lambdas:
    searchForSupportedAlgoViaIndexAPI() (discovery + batched search) and
    verifyMixedValidityContract() (the post-search contract test).
    Inside the search lambda, the loop now reads top-down with all three
    exits visible in one place:

    while(auto batch = nextBatchOfIndices()) {
        const auto candidates = fetchAlgosForBatch(*batch);
        if(candidates.empty()) break;
        if(trySelectFromBatch(candidates)) break;
    }

    The four interleaved setProblem variants are isolated in
    configureExtGemmForCurrentProblem / configureGroupedGemmForCurrentProblem,
    and the two selection strategies are named after what they do:
    collectTuningsForFirstViableAlgo (ext/grouped: stop at first algo
    with any working tuning) and collectAllSupportedAlgosViaCAPI (CAPI:
    every working algo, no tuning concept).

Test plan

  • CI: pre_checkin_matmul_extapi_algo_method_gemm_*_APIAlgoIndex
    passes on gfx110X-all, gfx12X-all, and gfx94X-all runners (the
    configurations that previously failed)
  • CI: same suite still passes on gfx942 single-arch and on the
    first-alphabetical arch in each multi-arch build (was passing
    before, must not regress)
  • Manual: with TENSILE_DEBUG=printDataInit, both the cross-arch
    miss and the load success continue to print to stdout; without it,
    stderr stays clean during a normal index scan

🤖 Generated with Claude Code
ROCM-23699

davidd-amd and others added 2 commits May 1, 2026 16:37
## Summary

Follow-up to #6840 (per-arch lazy mapping). Fixes the
`pre_checkin_matmul_extapi_algo_method_gemm_*_APIAlgoIndex` test
suite, which started failing on every gfx110X-all (and
gfx12X-all / gfx94X-all) Windows runner that isn't the
first-alphabetical arch in its build, and quiets the cross-arch
"No library found for index N" stderr that the same scenario
floods.

## Problem

After #6840, lazy solution indices stay globally numbered but
each arch's `TensileLiteLibrary_lazy_<arch>_Mapping.dat` only
contains entries whose file prefix ends with `_<arch>`. In a
multi-arch build the alphabetically-first arch (e.g. gfx1100)
gets the low indices; later arches start well above 0.

Two consequences:

1. **`_APIAlgoIndex` test breaks.** The test (`algo_method == 2`)
   probes `[0, 1, ..., 99]` blindly via `getAlgosFromIndex`. On
   gfx1101 (and any non-first-alphabetical arch) the entire first
   batch falls below the smallest mapping key, every lookup misses,
   `tmpAlgo` returns empty, the loop early-breaks at line 3755, and
   `CHECK_SOLUTION_FOUND` fires "NO solution found!" — even though
   the device has plenty of valid algos at higher indices.

2. **Cross-arch lookups spam stderr.** `MasterSolutionLibrary::loadLibrary`
   prints `std::cerr << "No library found for index N"` for every
   index below the local arch's smallest mapping key. With per-arch
   mapping that's now a routine occurrence for any consumer that
   scans an index range.

Surfaced by ROCm/TheRock CI run 25215940348 on the gfx110X-all
Windows runner: 68 `_APIAlgoIndex` failures, ~6800 stderr lines
of cross-arch noise.

## Solution

### Test (`testing_matmul.hpp`, `algo_method == 2`)

Discover the device's saveable indices via `getAllAlgos` first,
extract them with `getIndexFromAlgo`, then round-trip the list
through `getAlgosFromIndex` in batches of 100. This preserves the
test's intent — "an index obtained from the API can be used to
re-resolve the same algo" — without depending on indices starting
at 0. The explicit-`solution_index` path is unchanged in semantics
(now marked single-shot via `lastBatch = true` so the loop exits
deterministically after one iteration).

### Library (`MasterSolutionLibrary.hpp`)

Demote the `it == begin()` branch from `std::cerr` to
`std::cout` gated by `Debug::Instance().printDataInit()` — the
same flag the success-path "Loading library for index" print
already uses. The diagnostic is preserved for anyone debugging
library load with the standard Tensile debug knob, but stops
polluting stderr in the now-routine cross-arch case.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

❌ Your project status has failed because the head coverage (77.83%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7009      +/-   ##
===========================================
- Coverage    64.73%   64.73%   -0.00%     
===========================================
  Files         2108     2108              
  Lines       328578   328678     +100     
  Branches     43101    43143      +42     
===========================================
+ Hits        212682   212746      +64     
- Misses       97997    98006       +9     
- Partials     17899    17926      +27     
Flag Coverage Δ *Carryforward flag
hipBLAS 90.65% <ø> (ø) Carriedforward from ee73956
hipBLASLt 39.92% <ø> (+0.19%) ⬆️
hipCUB 82.21% <ø> (ø) Carriedforward from ee73956
hipDNN 79.72% <ø> (ø) Carriedforward from ee73956
hipFFT 54.95% <ø> (ø) Carriedforward from ee73956
hipRAND 76.12% <ø> (ø) Carriedforward from ee73956
hipSOLVER 69.24% <ø> (ø) Carriedforward from ee73956
hipSPARSE 84.70% <ø> (ø) Carriedforward from ee73956
rocBLAS 48.11% <ø> (ø) Carriedforward from ee73956
rocFFT 48.18% <ø> (ø) Carriedforward from ee73956
rocPRIM 38.99% <ø> (ø) Carriedforward from ee73956
rocRAND 57.03% <ø> (ø) Carriedforward from ee73956
rocSOLVER 77.83% <ø> (ø) Carriedforward from ee73956
rocSPARSE 72.82% <ø> (ø) Carriedforward from ee73956

*This pull request uses carry forward flags. Click here to find out more.
see 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

davidd-amd and others added 4 commits May 4, 2026 16:12
testing_matmul.hpp:
- Assert HIPBLAS_STATUS_SUCCESS from getAlgosFromIndex() in the
  discovered-index path. Indices come from a successful getAllAlgos +
  getIndexFromAlgo round-trip, so non-success here is a real regression
  in the index API. Explicit-index callers may legitimately pass an
  out-of-range index; CHECK_SOLUTION_FOUND surfaces that case.
- Drop dead `lastBatch` variable. The early break on batchStart
  exhaustion (-1 path) and the unconditional foundAlgo=true after
  CHECK_SOLUTION_FOUND (explicit-index path) make the bottom-of-loop
  lastBatch check unreachable; simplify to `if(foundAlgo) break`.
- `int algoIndexInc` -> `constexpr size_t algoIndexInc`. The only
  consumer is std::min<size_t>(batchStart + ...), so drop the narrowing.
- Expand the algo_method == 2 intro comment with the test's intent
  (round-trip persistence path) before the per-arch shard rationale.
- Add a one-shot mixed-validity contract check after the main loop:
  request [validIndices..., INT_MAX] and assert getAlgosFromIndex
  returns INVALID_VALUE *and* still produces the valid results.
  Recovers the implicit coverage the [0..N) probe loop used to
  provide; independent of per-arch index layout, so it doesn't
  reintroduce the bug this PR fixes.

MasterSolutionLibrary.hpp:
- Soften the comment on the gated "No library found" branch. The code
  can't distinguish a routine cross-arch shard miss from a genuinely
  bad index; spell out both cases, name the trade-off (loss of
  always-on bad-index diagnosis on non-shard installs), and explain
  why the gate is justified now (post-#6840 the cross-arch case
  dominates the signal).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the multi-paragraph justification on the "no mapping for index"
branch with an early-return whose log message describes the condition
("Index N not in this arch's mapping range"). The diagnostic-not-error
intent reads from the message + the printDataInit gate matching the
success-path print four lines below; no prose required.

Also flattens the success path one level by removing the else block.

No behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
In MasterSolutionLibrary::loadLibrary and the algo_method == 2 path of
testing_matmul, brace every single-statement if/for body so future edits
can't silently land outside the intended branch (Apple goto-fail style).

In the algo_method == 2 path, drop the wall-of-comment intros and the
"explicit-index path is single-shot" / "discovered indices came from..."
inline narration. Replace `arg.solution_index == -1` with a named
`indicesAreDiscovered` const bool so the surrounding branches read as
intent rather than as a magic-value check, and rename `algoIndexInc` to
`batchSize` to match what it actually represents.

No behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ladder

The algo_method == 2 block was a 270-line while(1) loop with three exit
conditions buried in the body and four interleaved setProblem/select
variants. Restructure into two top-level lambdas matching the two
concerns:

  searchForSupportedAlgoViaIndexAPI()  -- discovery + batched search
  verifyMixedValidityContract()        -- post-search API contract test

Inside the search lambda, the loop now reads top-down with all three
exits visible at the top:

    while(auto batch = nextBatchOfIndices()) {
        const auto candidates = fetchAlgosForBatch(*batch);
        if(candidates.empty()) break;
        if(trySelectFromBatch(candidates)) break;
    }

trySelectFromBatch dispatches by mode (grouped / ext+batched / CAPI),
each path delegating to a configureXxx + collectXxx pair. The
configureXxx lambdas hold the four setProblem variants, isolated from
the algo-selection logic. The two selection strategies are named after
what they actually do:

  collectTuningsForFirstViableAlgo  -- ext/grouped: stop at first j
                                       with any working tuning
  collectAllSupportedAlgosViaCAPI   -- CAPI: every working algo
                                       (no tuning concept)

Replaces magic-value comparisons with named state:

  arg.solution_index == -1   ->  indicesAreDiscovered

Drops the wall-of-comment intros, the "explicit-index path is
single-shot" inline narration, and the multi-line discovery rationale.
The names carry that intent now.

selectionWasAttempted preserves the original explicit-index semantics:
when getAlgosFromIndex returns empty, we silently fall through rather
than firing CHECK_SOLUTION_FOUND.

Adds <optional> for std::optional in nextBatchOfIndices.

No behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@davidd-amd davidd-amd requested review from Alex-Vasile and talumbau May 5, 2026 13:26
Comment thread projects/hipblaslt/clients/common/include/testing_matmul.hpp Outdated
Comment thread projects/hipblaslt/clients/common/include/testing_matmul.hpp Outdated
davidd-amd and others added 2 commits May 5, 2026 12:10
The lambda took its batch parameter as `const std::vector<int>&` but
hipblaslt_ext::getAlgosFromIndex declares its index argument as a non-
const `std::vector<int>&`. Pre-refactor the call used a non-const local
named `algoIndex`, so the const dropped in only with commit 329e23d
("Refactor algo_method == 2: split into named lambdas").

Drop the `const`. The call site (`fetchAlgosForBatch(*batch)` with
`auto batch = nextBatchOfIndices()`) already passes a non-const lvalue,
so no caller change is needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The lambda refactor in 329e23d introduced several issues that compile
only when clients are disabled:

- discoverValidIndices and fetchAlgosForBatch used EXPECT_HIPBLAS_STATUS
  inside non-void lambdas; the macro expands to 'return;' which is
  invalid from a value-returning lambda. Converted both to void with
  out-params (validIndices lifted to outer scope; candidates passed in
  by reference).

- collectTuningsForFirstViableAlgo and collectAllSupportedAlgosViaCAPI
  used CHECK_RETURNED_WORKSPACE_SIZE inside bool-returning lambdas with
  the same problem. Converted them and trySelectFromBatch to void with
  a bool& foundAlgo out-param.

- gemmObject.isAlgoSupported and matmulIsAlgoSupported take
  hipblasLtMatmulAlgo_t& (non-const); the inner lambdas received
  candidates as const&, which dropped const at the call sites. Dropped
  the const from the parameter types and the receiving 'auto' at the
  caller.

Also wraps a pre-existing EXPECT_EQ in verifyMixedValidityContract with
ifdef GOOGLE_TEST to match the file's existing pattern (see d_vector.hpp);
EXPECT_EQ is undefined in non-gtest builds.

Behavior unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…y contract

Two algo_method==2 paths previously fell through silently to a generic
downstream "NO solution found" error: (a) explicit solution_index that
returned zero candidates from getAlgosFromIndex(), (b) discovered
indices where none produced a viable algo+tuning under the workspace
budget. Both now emit a descriptive message naming the problem shape
and fail at the original site via CHECK_SOLUTION_FOUND(0).

Extends verifyMixedValidityContract from a single trailing-invalid
shape to six: trailing-invalid, middle-invalid, reversed,
duplicated, empty, all-invalid. Each shape pins both the documented
return status (INVALID_VALUE iff any input was invalid) and the
expected output count (count of valid inputs).

Addresses review comments on PR #7009.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tracks issue #7080 (above-largest and gap-between-keys cases of
upper_bound lookup).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@davidd-amd davidd-amd marked this pull request as ready for review May 5, 2026 22:09
@davidd-amd davidd-amd requested a review from a team as a code owner May 5, 2026 22:09
@davidd-amd
Copy link
Copy Markdown
Contributor Author

Overriding failing tensilelite tests resulting from a cmake issue unrelated to this change set. CMake is picking up GNU for the c+ compiler which is flagging legit issues in the code as warnings and we have Werror on. I will file an issue against stinkytofu for the UB that GNU is flagging.

@davidd-amd davidd-amd merged commit 3edd7a3 into develop May 5, 2026
6 checks passed
@davidd-amd davidd-amd deleted the fix/hipblaslt-apialgoindex-per-arch-mapping branch May 5, 2026 22:12
aledudek pushed a commit that referenced this pull request May 20, 2026
…g installs (#7009)

## Summary

Follow-up to #6840 (per-arch lazy mapping). Fixes the
`pre_checkin_matmul_extapi_algo_method_gemm_*_APIAlgoIndex` test suite,
which began failing on every gfx110X-all (and gfx12X-all / gfx94X-all)
runner that isn't the first-alphabetical arch in its build, and quiets
the cross-arch `"No library found for index N"` stderr that the same
scenario floods.

## Problem

After #6840, lazy solution indices are still globally numbered, but each
arch's `TensileLiteLibrary_lazy_<arch>_Mapping.dat` only contains
entries
whose file prefix ends with `_<arch>`. In a multi-arch build the
alphabetically-first arch (e.g. gfx1100) gets the low indices; later
arches start well above 0.

Two consequences:

1. **`_APIAlgoIndex` test breaks.** The test (`algo_method == 2`) probed
   `[0, 1, …, 99]` blindly via `getAlgosFromIndex`. On gfx1101 (and any
   non-first-alphabetical arch) the entire first batch fell below the
   smallest mapping key, every lookup missed, `tmpAlgo` came back empty,
   the loop early-broke, and `CHECK_SOLUTION_FOUND` fired
   `"NO solution found!"` — even though the device had plenty of valid
   algos at higher indices.

2. **Cross-arch lookups spam stderr.**
`MasterSolutionLibrary::loadLibrary`
   printed `std::cerr << "No library found for index N"` for every index
   below the local arch's smallest mapping key. With per-arch mapping
   that is now a routine occurrence for any consumer that scans an index
   range.

Surfaced by ROCm/TheRock CI run 25215940348 on the gfx110X-all Windows
runner: 68 `_APIAlgoIndex` failures, ~6800 stderr lines of cross-arch
noise.

## Solution

### Test (`testing_matmul.hpp`, `algo_method == 2`)

Rather than probing `[0..N)`, the test now discovers the device's
saveable indices via `getAllAlgos` first, extracts them with
`getIndexFromAlgo`, then round-trips the list through
`getAlgosFromIndex` in batches of 100. This preserves the test's intent
— "an index obtained from the API can be used to re-resolve the same
algo" — without depending on indices starting at 0.

Adds a one-shot contract check after the main loop: request
`[validIndices…, INT_MAX]` and assert `getAlgosFromIndex` returns
`INVALID_VALUE` while still emitting the valid results. Recovers the
implicit "what happens with a bad index" coverage the `[0..N)` probe
used to provide; independent of per-arch index layout, so it does not
reintroduce the bug this PR fixes.

### Library (`MasterSolutionLibrary.hpp`)

In `loadLibrary`, demote the `it == begin()` branch from `std::cerr` to
`std::cout` gated by `Debug::Instance().printDataInit()` — the same flag
the success-path "Loading library for index" print already uses. The
diagnostic is preserved for anyone debugging library load with the
standard Tensile debug knob, but stops polluting stderr in the
now-routine cross-arch case.

## Readability follow-ups

The last three commits are mechanical/structural cleanup with no
behavior change. Reviewable independently of the fix above.

- **`Refactor loadLibrary: drop wall-of-comment, invert to
early-return`**
  Replaces a deeply-nested `if-then` with the more conventional "guard
  then continue" shape; deletes the wall-of-comment that previously
  explained the bottom branch.

- **`Style: brace single-statement bodies; trim bloated comments`**
  Adds braces to every single-statement `if`/`for` body in both files
  (Apple `goto fail` style). Replaces `arg.solution_index == -1` with a
  named `indicesAreDiscovered` const bool; renames `algoIndexInc` →
  `batchSize`.

- **`Refactor algo_method == 2: split into named lambdas with single
  exit ladder`**
  Splits the 270-line `while(1)` into two top-level lambdas:
  `searchForSupportedAlgoViaIndexAPI()` (discovery + batched search) and
  `verifyMixedValidityContract()` (the post-search contract test).
  Inside the search lambda, the loop now reads top-down with all three
  exits visible in one place:

  ```cpp
  while(auto batch = nextBatchOfIndices()) {
      const auto candidates = fetchAlgosForBatch(*batch);
      if(candidates.empty()) break;
      if(trySelectFromBatch(candidates)) break;
  }
  ```

  The four interleaved `setProblem` variants are isolated in
`configureExtGemmForCurrentProblem` /
`configureGroupedGemmForCurrentProblem`,
  and the two selection strategies are named after what they do:
  `collectTuningsForFirstViableAlgo` (ext/grouped: stop at first algo
  with any working tuning) and `collectAllSupportedAlgosViaCAPI` (CAPI:
  every working algo, no tuning concept).

## Test plan

- [ ] CI: `pre_checkin_matmul_extapi_algo_method_gemm_*_APIAlgoIndex`
      passes on gfx110X-all, gfx12X-all, and gfx94X-all runners (the
      configurations that previously failed)
- [ ] CI: same suite still passes on gfx942 single-arch and on the
      first-alphabetical arch in each multi-arch build (was passing
      before, must not regress)
- [ ] Manual: with `TENSILE_DEBUG=printDataInit`, both the cross-arch
      miss and the load success continue to print to stdout; without it,
      stderr stays clean during a normal index scan

🤖 Generated with [Claude Code](https://claude.com/claude-code)
ROCM-23699

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants