Skip to content

Refactor new_amdgpu_family_matrix to use data classes instead of dictionary#3653

Open
HereThereBeDragons wants to merge 19 commits into
mainfrom
users/lpromber/classify_amdgpu_matrix
Open

Refactor new_amdgpu_family_matrix to use data classes instead of dictionary#3653
HereThereBeDragons wants to merge 19 commits into
mainfrom
users/lpromber/classify_amdgpu_matrix

Conversation

@HereThereBeDragons
Copy link
Copy Markdown
Contributor

@HereThereBeDragons HereThereBeDragons commented Feb 26, 2026

Replaces the nested dictionary format in new_amdgpu_family_matrix.py with two new files using data classes as suggested in #3452:

  • new_amdgpu_family_matrix_types.py — dataclass schema definitions
  • new_amdgpu_family_matrix_data.py — matrix data (arch, predefined arch groups, build variants)

This will allow for easier maintenance of the entire matrix: Parameters must now be defined and have the description right next to it (instead of being spread throughout the file) and only non-default values need to be provided for each arch. In addition, helper functions provided by the matrix classes will simplify the new matrix generator logic (replacement of configure_ci.py).

Changes:

  • Typed fields with defaults — only non-default values need to be specified per entry
  • Arch entries got flatten to sepearate _GFX* MatrixEntry definitions
  • Auto-discovery of _GFX* entries — adding a new GPU requires only a new variable, no registration step
  • Added unittests

Test Config:

  • run_tests auto-determined based on if a runner is set or not; set explicitly to False to reserve a runner without enabling tests
  • GpuRunners.extra — catch-all dict[str, str] for non-standard or temporary runners (e.g. test-sandbox, oem) without schema changes
  • New param to define which tests are to run: TestConfig.test_scope — typed Literal["all", "smoke", "full"]

Introducing new API for accessing the amdgpu matrix:

  • Case-insensitive key lookup — get_entry("GFX950") and get_entry("gfx950") are equivalent
  • Family-level default resolution — get_entry("gfx950") resolves to gfx950-dcgpu via is_family_default
  • get_entries_for_groups — resolves to a predefined group list (e.g. amdgpu_presubmit) to the corresponding MatrixEntry objects in definition order (--> will allow to remove this step from the generator logic)
    • to_dict(platform=None) — serializes full matrix or a single platform config with amdgpu_family key included

Includes changes of:

…n, introduce test_scope, remove push_on_success, always create all configs when declaring platformconfig, make arch lookup case-insensitive, add unittest
@HereThereBeDragons HereThereBeDragons force-pushed the users/lpromber/classify_amdgpu_matrix branch from 6b4fb3a to f6a8535 Compare February 27, 2026 18:05
@HereThereBeDragons HereThereBeDragons marked this pull request as ready for review February 27, 2026 18:12
@HereThereBeDragons
Copy link
Copy Markdown
Contributor Author

what is missing (as not yet merged): #3564

Copy link
Copy Markdown
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some useful improvements to code structure in here but I think some of the details around targets/families aren't yet in alignment with what the CI/CD systems need at their core. I also think that having a separate set of new_* scripts is going to keep biting us. Need to complete the migration so we only have one set of files serving the same purpose at a time.

Comment on lines +15 to +16
scope Subgroup within a family. Generic scopes are "dcgpu", "dgpu", "all".
Specific GPU names (e.g. "gfx1151") are also valid scopes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an exhaustive list. See https://github.com/ROCm/TheRock/blob/main/cmake/therock_amdgpu_targets.cmake. We also have igpu for example.

acronym meaning
dgpu discrete gpu
dcgpu datacenter gpu
igpu integrated gpu

The concept of GPU families is also going to be less consequential when we've switched fully over to the multi-arch pipelines, so let's be careful about what abstractions we add here and if they will continue to make sense.

Looking through that lens and stepping back a bit, I don't understand the distinction this is making between a "family" and a "scope". That CMake file is the source of truth, and it defines a family as the full string (e.g. gfx94X-dcgpu), NOT just a particular prefix (e.g. gfx94X).

  • For (non-multi-arch) CI/CD, we choose a list of target families to build for, as each family can then share device-agnostic build stages and amortize binary size overhead.
  • For multi-arch CI/CD we're going to choose a list of targets to build and those MAY be grouped by family, but they don't need to be as the build will be efficiently parallelized even if they aren't. Similarly the packages will share the same device-agnostic files no matter how the GPU targets/families are split.
  • For testing, we should make it clear what the mappings are for "gpu target" to "runner label". We could try to make those mappings all be 1:1 ("gfx1151" -> "linux-gfx1151-gpu-rocm") instead of being 1:N ("gfx110X-all" -> ["linux-gfx1100-gpu-rocm", "linux-gfx1101-gpu-rocm"]).

I don't think we should focus too much on "dcgpu", "dgpu", "all", etc. at this level of the infra.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this is not an exhaustive list, i guess adding ... will help to avoid this confusion and expanding the frozenset of _GENERIC_SCOPES (but right now we have no entry for igpu.. :) )

the idea for family and scope is similar to family and target - just that scope is agnostic if it defines a single gpu arch or a family (to avoid ambiguities i named it "scope"). this is mainly done due to the CI scripts currently accepting inputs like "gfx94X" and expanding it to "gfx94X-dcgpu". However, we have some families that have multiple igpu, dgpu etc. - so it is not clear to what it should be expanded to - or to be exact: right now this is not possible in amdgpu_family_matrix to have multiple entries for the same family. E.g. having entries for both gfx110X-all and gfx110X-igpu with different runners would not be possible.

As such i added the is_family_default, so if we say: gfx110X-all == is_family_default,you get the following behavior within the CI:

  • gfx110X --> will auto-expand to gfx110X-all
  • gfx110X-all will match gfx110X-all
  • gfx110X-igpu will match with entry _GFX110X_IGPU if available, otherwise fail

-igpu and -all can have different settings, runners, etc.

If we put it in the right order in the file, we could probably even deep-copy the settings and just adjust what we need to.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still concerned here. This looks like a lot of boilerplate that isn't going to be useful in the multi-arch CI/CD world (cutover is imminent - please no new feature work on the older CI).

Target families and scopes, at least in their current form, are a concept that was engineered for the non-multi-arch CI.

  • We need a direct mapping from "gpu architecture" to "runner label".
  • We need a direct mapping from "gpu architecture" to "rocm build status", "rocm test status", "pytorch build status", etc.

We could list targets directly and construct families on top of them. I think this is inverted:

    GFX1151 = MatrixEntry(
        family="gfx115X",
        scope="gfx1151",
        linux=PlatformConfig(

I think I'd rather have something more like:

GFX1151 = GPU_ARCHITECTURE(...)
...

FAMILY_GFX1151 = [GFX1151]
FAMILY_GFX110X_ALL = [GFX1100, GFX1101, GFX1102]

@marbre may need your eyes here too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed this offline wit Laura, she will provide an update on what we concluded on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we decided on the following considering that multi arch will most likely in the future only be operating on specific targets/archs:

  • remove scope, and remove having "-dcgpu" or any other family specification as its own entry (as it is currently in amdgpu_family_matrix). instead have only entries for concrete archs (=target).
  • for each target, auto-populate its families based on cmake/therock_amdgpu_targets.cmake
  • have getters for: get "default" target for when requesting a family (for backward compatibility). is currently based on which gpu runners we have. e.g. gfx120X-all --> returns entry gfx1201.
  • have getters for: "give me all entries for a family" so in the future we could replace "fetch-gfx-targets": "gfx120X-all" --> ["gfx1200", "gfx1201"]
  • have a lookup() that tells you how an entry is resolved: gfx942 -> "target", gfx94X-dcgpu -> "family", gfx94X -> "family_prefix" (see question below)
  • when requesting a list of gpu entries, deduplicate by default: get_entries_for_groups(["gfx942", "gfx94X-dcgpu"]) --> just returns a single entry "gfx942"

NOTE: families are only considered for all the entries which have a MatrixEntry in the file. e.g. we do not consider any "igpu" as we dont have any entries for them. so when requesting a family "-all" those are omitted as no entry exists for them.

open questions:

  • right now we have partial matching of families. however it is difficult to decide which family it should be expanded to. e.g. "gfx120X" --> returns entry "gfx1201". but if we want to set amdgpu_family: in the workflow, we kinda need to expand it to "gfx120X-all". so how do we do it? or do we just remove this functionality?

see claude reply:

❯ i wonder if we should just remove the gfx94X matching. as we need to know exactly if gfx94X-all or gfx94X-dcgpu to know what we are building                                                 

● That's a valid point. The prefix match hides an important distinction — gfx94X-dcgpu builds only discrete GPU targets, while gfx94X-all would include everything in the family. Silently
  picking whichever family happens to match first is ambiguous and could cause wrong builds.

  The arguments for removing it:
  - Forces callers to be explicit about what they're building
  - Removes ambiguity when a family has multiple scopes (dcgpu vs all)
  - The only convenience lost is saving 6 characters (-dcgpu / -all)

  The argument for keeping it: gfx94X is a natural shorthand people might type, and it's better to resolve it than error. But that's a weak argument — failing with "did you mean gfx94X-dcgpu
  or gfx94X-all?" is more helpful than silently picking one.

future questions:

  • what happens if we request a family but have multiple runners? do we return a list of runners? do we keep the "just return the default_family_entry"? something else? or is this just not a scenario that could be happening?
  • i think we can auto-populate fetch-gfx-targets but it would be good to know what the exact requirements are, as for some it returns multiple targets and for others just a single one eventhough we have more entries for the same family

"gfx1152",
"gfx1153",
],
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, postsubmit workflows run the union of the "presubmit" and "postsubmit" sets. Maybe we should either:

  1. Document that here in comments
  2. Actually build the lists that way? A few options:
    # Option A: construct sets separately
    amdgpu_presubmit_set = set(["gfx94X-dcgpu", "gfx110X-all", "gfx1151", "gfx120X-all"])
    amdgpu_postsubmit_set = amdgpu_presubmit_set | set(["gfx950-dcgpu"])
    amdgpu_nightly_set = amdgpu_postsubmit_set | set([
            "gfx906",
            "gfx908",
            "gfx90a",
            "gfx101X-dgpu",
            "gfx103X-dgpu",
            "gfx1150",
            "gfx1152",
            "gfx1153",
        ])
    amdgpu_family_predefined_sets = {
      "amdgpu_presubmit": amdgpu_presubmit_set,
      "amdgpu_postsubmit": amdgpu_postsubmit_set,
      "amdgpu_nightly": amdgpu_nightly_set,
    }
    # Option B: construct a single list and then parse out the individual sets
    from enum import Enum
    class TriggerFrequency(Enum):
        PRESUBMIT = 1
        POSTSUBMIT = 2
        NIGHTLY = 3
    
    amdgpu_family_groups = {
      # Presubmit
      "gfx94X-dcgpu": TriggerFrequency.PRESUBMIT,
      "gfx110X-all": TriggerFrequency.PRESUBMIT,
      ...
      # Postsubmit
      "gfx950-dcgpu": TriggerFrequency.POSTSUBMIT,
      ...
    }
    # Then construct sub-lists from that full list?

I'm basically looking for a design here that makes it

  1. easy to see which GPU targets/families are build/tested by default for each event
  2. easy to move build/test frequency up or down
  3. hard to make configuration mistakes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

option 1 seems for me way easier to understand and extend. for the future we might want to have additional predefined groups for weekly runs etc.

Comment on lines +361 to +367
amdgpu_family_info_matrix_all = AmdGpuFamilyMatrix(
entries=[
v
for k, v in globals().items()
if k.startswith("_GFX") and isinstance(v, MatrixEntry)
]
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is an interesting way to collect entries into a single list without needing to explicitly add them. Could put them in a class scope to avoid needing to use globals() though?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i wanted to make it less error prone when editing. adding a new arch is already annoying enough considering over how many lines a single entry spans - and then keeping manually track of which are available, changed naming or whatever is just horrible

@HereThereBeDragons
Copy link
Copy Markdown
Contributor Author

yes, i know the new_* is biting us but I am missing some extra hands to get this more quickly done :(

Copy link
Copy Markdown
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

much cleaner i like it!!! just one comment but lgtm

"""If True, only a sanity-check test subset is run, not the full suite."""
test_scope: Literal["all", "smoke", "full"] = "all"
"""Which test subset to run: all (default), smoke (sanity only), full (skip smoke)."""
expect_pytorch_failure: bool = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are more of these args coming, it may be good to add that extra dict here as well, as we add/remove optional args

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3992 also added quick as a test scope/type

@geomin12
Copy link
Copy Markdown
Contributor

geomin12 commented Mar 3, 2026

yes, i know the new_* is biting us but I am missing some extra hands to get this more quickly done :(

I think it is fine for the time being!

Copy link
Copy Markdown
Contributor

@erman-gurses erman-gurses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments

Comment on lines +313 to +320
def get_entries_for_groups(self, group_keys: list[str]) -> list[MatrixEntry]:
"""Return entries matching the given keys, preserving order and skipping unknowns."""
result = []
for key in group_keys:
entry = self.get_entry(key)
if entry is not None:
result.append(entry)
return result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_entries_for_groups() currently skips unknown keys silently. For CI configuration, that feels risky because a typo in a predefined group could quietly reduce build/test coverage instead of failing fast. Typo like gfx94X-dcgup would just disappear from the result

I think this should raise on unknown keys, or at least have a strict mode used by tests / workflow generation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will add that not only the matching entries are returned but also unmatched keys.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also note that while predefined groups are being used, in theory any list of strings can be used as group_keys

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. How about something like:
def get_entries_for_groups(group_keys, strict=False):

When strict=False, unknown keys can be skipped (for cases like experimental or user-provided keys).
When strict=True, the function could raise if any key is not recognized.

That way the helper stays flexible for general use, but CI/workflow generation can use strict=True to avoid silently dropping entries due to typos.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd any opinion on it?
i would argue that giving back a tuple of match entries, and unmatched keys is enough and any script on top can have the logic to fail if needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good If we get @ScottTodd's opinion. I see that your approach gives more flexibility on the caller side.

Comment on lines +94 to +100
def test_get_entries_for_groups(self):
entries = amdgpu_family_info_matrix_all.get_entries_for_groups(
amdgpu_family_predefined_groups["amdgpu_presubmit"]
)
self.assertGreater(len(entries), 0)
for entry in entries:
self.assertIsInstance(entry, MatrixEntry)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since get_entries_for_groups() is used on predefined CI groups, maybe we can add a negative test for an unknown key here. Otherwise a typo in a group can silently pass through without coverage loss being obvious.

For example, a test that verifies unknown keys either raise or are handled explicitly would make this safer.

Comment on lines +326 to +331
def to_nested_dict(self) -> dict:
"""Convert to the original nested-dict format: family → target → platform → ..."""
result: dict = {}
for entry in self.entries:
result.setdefault(entry.family, {})[entry.scope] = entry.to_dict()
return result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we use this method anywhere? I may have missed it, but I didn’t find a usage in this PR or in the added tests.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will be using it to create the dictionary that we use in the workflows later (created by the refactored configure_ci.py which will be the next PR step).
can see if i can add a test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add some testing. but not sure how useful it really is

Copy link
Copy Markdown
Contributor

@erman-gurses erman-gurses Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a suggestion - maybe we can add after its usage is active.
As far as I understand, we can do something like this in the future for regression testing.

assert result["gfx94X"]["build"]["amdgpu_family"] == "gfx94X"

jayhawk-commits pushed a commit that referenced this pull request Mar 11, 2026
## Motivation

Bump rocm-systems from 93bc019 to 093b66caa3 (includes fix for hip-tests
issue and revert for mathlib hiprtc issues and revert for rccl-test,
added revert for miopen failures due to PR 653):

Commits:
093b66caa3 (HEAD, origin/develop, origin/HEAD) Revert "SWDEV-546177 -
hipModuleGetLoadingMode API impl (#653)" (#3858)
d8a0adbc9f [AMD-SMI] Hide libamd_smi.so internal symbols (#3777)
d4da458f94 [rocprofiler-sdk] [Documentation ] Updating changelog (#3827)
19fadeb082 (origin/users/abchoudh/fix_dispatch_count) [RCCL][Tuner
Plugin] Enable tuning of RCCL tuning constants (#3757)
b4f5f8a6a8 rocr: Fix IPC dmabuf hang with large allocations (#3211)
64efea0435 RCCL: allow users to override max and per job memory & fix
defaults. (#3797)
9b3dd101bb Removing ready_for_review (#3849)
7e43880a64 [rocprofiler-systems] Update ROCm version to 7.2.0 in CI
workflows for Debian, RedHat, and Ubuntu (#3431)
1fdb6b9827 [rocshmem] add gda/topology unit tests (#3715)
be1ea24a96 Move hipMipmappedArrayGetMemoryRequirements test to common
tests
e4513f04c8 Update amdgpu-windows-interop with latest changes, pal
58aa0bab2ced0cc9ebe8d2d0932db6774feb4e49 2026-03-04(#3773)
b1f964d796 [rocprofiler-compute] Ensure long kernel name fully shows in
compute analyze (#3665)
4dcf1e3ce0 SWDEV-567112 - Replace test names (#3787)
33f5f302e5 ROCM-2428 - fixes hipStreamBatchMemOp invalid operation
checks (#3099)
139f4bfff8 [SWDEV-556456] Align HIP_UUID with rocminfo (#3614)
8e8928544c Reduce buffers alignment to 4 bytes (#3821)
51be29a647 AIRUNTIME-125: Consolidate Windows optimization and debug
flags (#3825)
1407392240 [AMD-SMI] CI: Fix root workflow to use ASIC-specific test
filters (#3807)
63f78a98d7 (origin/users/mcao/fix_rocpdsummary) [ROCM-SMI] Fix DRM
include dirs leaking absolute build paths to consumers (#3808)
caf2f7e1eb [ROCM-186] amd-smi: Add support for a VRAM and GTT tuning
interface (#3636)
a0712d4c2a [TheRock CI] Update projects_to_test lists (#3749)
02090c42c9 rocrtst: install gfx .hsaco files to share/rocrtst (#3744)
4a0a1cbfce Merge other simd table (#3696)
0d07657d78 Add missing kwargs from
rocprofiler_add_integration_validate_test in .cmake-format.yaml (#2336)
3a3df301dc Optimize device counting service GPU interactions (#1583)
95d9da0098 Add SPM Enable flag in build infrastructure (#3677)
12bb9435b2 [rocprofiler-sdk] On-demand GPU profile queue
creation/destruction (#3586)
941057c2c0  Navi4 tuning table iter 1 (#3052)
dbf2b7369f [AMD-SMI] Display N/A for cu_occupancy when file is
unavailable (#3589)
b0efc7c639 [RCCL] [UT] Add ROCTX test (#3625)
ba7a20ea18 Reducing the p2pnChannels for half-subscription A2A on
multi-node MI350 (#3381)
75238c98a2 [clr] Fix memory leak in getOrCreateHostcallBuffer (#3699)
af2ee0e8ad [hip-tests] ASAN Check for image support before we create
context (#3834)
ad4496678e Update windows ci subtree in include amdgpu-windows-interop
(#3814)
c8ad252208 [rocprofiler-register] Fix compilation with system fmt/glog
(#1243)
781881544d Update README to include dbgapi and debug agent components
(#3731)
88e4a7837e ROCProfiler and ROCTracer: Modifying deprecation note (#3831)
b5918a5f35 [ROCM-3124-3125-3126] CUID file generation hangs on MI350
systems/CUID test failures/Segmentation fault in CUID example code
(#3548)
97a5dd993c Update copyright to use SPDX IDs (#3805)
511730ab45 [rocshmem]: add flood-amo tester (#3653)
2d650a0065 [clr] Fix heap use after free error in device allocations
(#3789)
b6b179ad81 Disable hipHostRegister_Negative test for ASAN (#3832)
39ec318c8d [RCCL] Add GDA alltoallv via rocshmem integration (#3613)
fb0f4d53b1 [RCCL] [CUMEM] Fix cuMem multi-process runs (#3811)
c3de7d4bf6 SWDEV-526201 - Fix and enable disabled HIP tests from warp
group (#3089)
8d9a8ca161 roofline: code cleanup and refactor vector types (#3813)
8957e49028 Don't wait on command completion if worker thread is
destroyed (#3790)
9e7586a5fa [rocshmem] Add barrier APIs and expose `ROCSHMEM_TEAM_WORLD`
on device (#3651)
91b09235b0 Revert "fix local gpu release static build failure (#3667)"
(#3799)
0fda754b1b libhsakmt: Add secondary KFD context creation support
ee43db95b0 Revert "Update TheRock reference to 20260303 commit (#3709)"
(#3826)
86e28b9fae Added fix to update GL2C counters instance count for GFX11.5
(#3100)
93f69f7de4 Adjust includes to match use (#3742)
e9fbc3f1a2 (develop) Update TheRock reference to 20260303 commit (#3709)
be0675a1a6 (HEAD) Revert "Support fp8 types in hiprtc (#2605)" (#3792)
3e3a94a4ef [rocprofiler-systems] Add trace_cache support for
std::optional<T> serialization (#3490)
0b42a7f472 clr: Eliminate unnecessary kernel name string copies (#3774)
b6b0d77b29 rocr: Add hsa_amd_memory_async_batch_copy API for batched
memory copies (#3259)
486e6d12d2 Resolve staircase RS regression with 48 max channels (#3684)
eb59c85ac4 [gfx942][gfx950] Leverage new cache bypass builtins for
simple protocol where available (#2847)
4d74d27f0e (origin/users/raramakr/rocm-smi-target) Revert "Auto Labeler:
Add ci:regression-detection label to rccl PRs (#3543)" (#3769)
8f0795517c [AMD-SMI] CI: Use ASIC-specific test blacklists in workflows
(#3775)
7cef5b64c1 Fix MFMA total FLOPS calculation (#3371)
aea37512ba Remove duplicated tests (#3235)
b6c656fdd4 Remove duplicated tests in memory module (#3087)
ca3137d8f9 [rocprofiler-sdk] Install integration tests without building
for therock & Misc. fixes (#3047)
0ab5c41f65 [rdc] Enable on-demand queue mode in rocprofiler-sdk to
prevent inference degradation (#3629)
a1eb2a1f7c rocr/wsl: a library should not output to std::out by default
(#3718)
b7da296cc8 Reenable flood_put/get testers on mlx5 since they should work
after pr2732 (#3748)
000e24de2f [rocprofiler-sdk] Add automatic late-start support to
rocprofiler_force_configure (#2168)
64ea87f592 [hip-tests] Fix memory leaks in hipMemPoolTrimTo tests
(#3643)
543a7d765f rocr: Include code object allocs in lightweight coredump
a58da378d4 [rocdecode] - update rocdecode ctest (#3768)
f88e4ee44d [rocprofiler-systems] Make CDash submit non-fatal and add
GitHub Actions logging (#3525)
cb14debc3a [rocprofiler-systems] Update nlohmann-json submodule (#3391)
449253009a SWDEV-567112 - Introduce new mechanism for tagging and
disabling tests - Part 2 (#3707)
8ca991393d disabling rccl from full build (linux), covered in RCCL CI
(#3770)
c4fdb20b74 [ATT] Re-enable tests. Add option to specify perf to target
CU only (#2819)
615aab95ed ROCM-3816 Out of Memory fix (#3588)
8ffad41b24 Fix rocm_smi64 exporting invalid absolute paths to consumers
(#3717)
042d76a626 rocr: Remove dependency on KFD in Runtime::VMemoryHandleMap
(#2515)
555db59b2a [AMD-SMI] CPU: Added support for family 1A Models 50h-57h
(#3206)
3affa2c7a3 [SWDEV-555935] Fix shared mutex and self-heal (#3729)
ba0bf0f3db Replace hipMemGetInfo with ihipMemGetInfo and use it for
internal calls. (#2845)
c5cef9b18e Fix HIP_RETURN on all HIP API calls. (#2838)
241ce7ba83 Revert "memory: fix "contiguous_bytes" calculation in generic
conversion (#3285)" (#3755)
8a690f482e [kpack/clr] Windows PE/COFF support for kpack artifact
splitting and runtime loading (#3728)
863bdf8aa8 MFMA pre-processor guards for ipc.hip (#3724)
90bb9b1921 Release queue outside of vgpusAccess lock (#3705)
de4523910c clr: Add build support of ROCR and PAL backends together
(#3722)
dfb7abc2d8 [rocprofiler-sdk] RCCL API changes for
RCCL_API_TRACE_VERSION_PATCH = 3 (#3477)
d69d4f23db [AICOMRCCL-633] - Fixed warnings in tests (#3402)
067d86dcaa rocr/wsl: Disable AQL Queue usage with flag ROCR_USE_PM4
(#3663)
594eb60d42 [TheRock CI] rocm-systems build full ROCm stack (#3182)
27d17e8ea0 [ROCProfiler-SDK] Fix SWDEV-556922: Handle comments before
checking for pmc: (#1723)
c80d90439d memory: fix "contiguous_bytes" calculation in generic
conversion (#3285)
669987c83f [hip-tests] ASAN - add missing release handles (#3735)
a24bbd75a4 fix local gpu release static build failure (#3667)
259b2ff913 Speed up DeviceId (#2803)
65d9264bf4 Simplify MPI trace merge logic and remove legacy guards
(#3562)
1076c083cb use system to look for zcat path instead (#3720)
22f1d19db3 [AICOMRCCL-355] Enable threshold-based p2p-batching (#3000)
a2e4c794d2 Partially flatten template tests cases (#2597)
e242abe219 Pass space separated gfx target list to RCCL build command
(#3701)
4f78aea66d SWDEV-570074 - Refactor Memset memory object handling.
(#2228)
b3ad12d834 Support Nvidia build on theRock for HIP-tests (#3335)
a1cf15ea9a Support fp8 types in hiprtc (#2605)
8ef84b0a50 [rocprofiler-systems] Add HPC examples to automated testing
(#3437)
db3a70dfa0 Free memory which was allocated in tests (#3710)
27e6809c7e [rocprofiler-systems]: Fix rhel CI failure on for MPI and UCX
tests (#3700)
0d9aaf59d8 rccl/topo_expl: fix build issue. (#3719)
be04d75765 Fix zcat path used for checking kernel configs (#3423)
cab60a7b27 rocr/thunk/win: Add CU mask support (#3518)
5b3d826c05 [CUMEM] Initial support for cuMem APIs (#2763)
0606ff491f [HIP] [PLAT-194496] Improve Stress_hipMalloc_HighSizeAlloc
reliability (#3550)
05750a77cc fix hip-test name in config (#3716)
33f777f3e9 hsakmt: Remove --high functionality from run_kfdtest.sh
(#2486)
e4c46e3480 Hide the retain under direct dispatch check (#3698)
bfe0ca0279 Add rocprof trace decoder to CI tests (#3690)
a769b6f54e [rocSHMEM] Edgar/abstract allocator ipc part1 (#3411)
659fb52243 [AMD-SMI] Fix bugs, improve error handling, and clean up
NIC/switch code (#3654)
0eb26ea571 hsakmt: Fix Import/Export of dmabuf_fd for WSL/Windows
(#3348)
a122936abb [SWDEV-567812] Add UBB power and power_limit fields to
npm_info (#3262)
c3bec090c5 [rocprofiler-sdk][rocprofv3][rocpd] Updates for KFD data
(#340)
7c44d47740 SWDEV-547659 - Remove HIP_VERSION_GITHASH in logs (#448)
74b6487a6a SWDEV-547008 - Documentation fix for function return values
(#463)
af21cd44f1 SWDEV-545553 - Improve clarity and robustness of CALLBACK
unit tests (#546)
180d639044 SWDEV-544900 - Change hip-test test case name (#547)
feeca99950 Doc improvements (#3688)
c1822b6336 ROCprofiler-SDK: deprecation of legacy tools (#3609)
5d7aff8462 Fix rocprof-compute-viewer link (#3459)
0b0b4846f0 AIRUNTIME-129 - Fix Ocl test failures of 2D image with
pitches. (#3584)
ac569b87e0 Fix memory tests config (#3687)
603fe7a5cf [hip-tests] Enable hipMipmappedArrayGetMemoryRequirements
test via cmake
4fad4452d9 [hip] Docs: Updates to some memory management pages
8cc59559fe AICOMRCCL-656 fix memory leak in ncclCommInitRankFunc (#3628)
94a4595a5d Fix missing amd_comgr linkage in pc-sampling integration test
(#3453)
2a68565dce rocrtst: CMAke file: strip xnack/feature suffixes from gfxNum
in build_kernel (#3652)
c3542bfb2b [rocprofv3] Deprecating input text files for counter
collection (#1562)
ff122e7ed7 SWDEV-573073 - Cleanup hipHostAlloc/Malloc/Register tests
(#3017)
5b1deaf29d SWDEV-567112 - Introduce new mechanism for tagging and
disabling tests - Part 1 - Core (#2351)
6e0cc309e1 rocrtst: MaxSingleAllocationTest: skip CPU NUMA nodes >0
(#3208)
d65f601195 [AICOMRCCL-667] rccl: Change GDR selection logic. (#3607)
f1c44ab200 Patch Back to Old Repo: fixes from manual runs (#3621)
fe53bcd715 [AMD-SMI] Allow amdsmi init to succeed when no NIC hardware
is present (#3403)
b25600efdb [ROCM SMI] Fix fw pldm version not displayed in default
amd-smi (#3594)
169d2ef763 root to module wiring, remove legacy source collection
(#3482)
7469781988 [LRT][clr] SWDEV-512963-Fix CTS test failures for 1D buffer
copy (#3520)
c8f55d9b86 Adding rocprof trace decoder (#3576)
425e983502 Trace decoder codeowners (#3600)
a176efd648 [hip-tests] Add return statements to HIP_SKIP_TEST (#3647)
32687cf183 rocrtst: CPUAccessToGPUMemoryTest: Cap host allocation to 512
MB under ASAN (#3407)
97c0206753 Update codeowners for thunk DXG (#3334)
be44b28bb6 [rocdecode][rocjpeg] - ctest CMakeLists cleanup (#3632)
80ff0b8942 Various memory leak fixes in hip-tests (#3605)
0988f67a85 fix typo in help text (#3314)
9f823c53f1 Fix CUID file lookup by loading files before searching
entries (#3436)
064c89261b SWDEV-546177 - hipModuleGetLoadingMode API impl (#653)
006213e112 ROCM-2696: Ignare size and base if null ptr (#3336)
6060b99d83 Improve atomic min max test perf (#2580)
3fbcc13602 Change printf capture impl (#1127)
93bc01937c (tag: hip-version_7.12.60610,
origin/users/mradosav-amd/rocprofsys-selective-region) [ROCM-CORE]
Update rdhc script to support rocm install prefix
(ROCm/rocm-systems#3596)

[AICOMRCCL-355]:
https://amd-hub.atlassian.net/browse/AICOMRCCL-355?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
…r_groups return unmatched keys and add tests for it.
HereThereBeDragons added a commit that referenced this pull request Mar 12, 2026
 commit  8f69a63 (mar 12, 26).  simplified generator loop, changed  flags to include their label,  optimized config(), more code style improvements.
…reate families based on cmake file. add functions to retrieve entries based on family, or default to a single entry for a family, make deduplciation of entries the default when retrieving multiple via get_entries_for_groups. add lookup() to tell you how the entry was resolved (target, family, family_prefix)
Copy link
Copy Markdown
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, wish I could have given more of this feedback sooner, but I'm still not really seeing the dots fully connecting here. The "GPU family" code is adding significant complexity. I think for multi-arch we need to trim that out and make it less foundational to the CI build/test configuration code paths.

I think my comment on a flatter list of GpuRunners is trending in a cleaner direction.

Comment on lines +111 to +112
def test_unknown_key_returns_none(self):
self.assertIsNone(amdgpu_family_info_matrix_all.get_entry("gfx9999"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also raise KeyError instead of returning None. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would keep it with None as this is kinda the expected behavior for python

get(key, default=None, /)
Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.

https://docs.python.org/3/library/stdtypes.html#dict.get

Comment on lines +124 to +129
def test_get_entries_for_groups_reports_unmatched(self):
result = amdgpu_family_info_matrix_all.get_entries_for_groups(
["gfx94X-dcgpu", "gfx9999-unknown", "gfx1151"]
)
self.assertEqual(len(result.entries), 2)
self.assertEqual(result.unmatched_keys, ["gfx9999-unknown"])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "unmatched keys" useful to track like this, or could we just raise KeyError instead? Is there a use case for passing in groups that aren't in the code and getting this sort of detailed reporting on a successful (non-raising) execution flow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

considering that get() above returns None i sadly think the current approach is the more uniform approach than raising here keyerror.

discussion with claude:
Python style:

  • GroupLookupResult as a @DataClass instead of a tuple directly follows the style guide ("no tuples for structured data")
  • The None-returning get_entry mirrors dict.get() — consistent with standard Python lookup conventions
  • Returning partial results with an explicit miss report is not "silently continuing on errors" — it surfaces the problem. The style guide's fail-fast rule targets dropping errors, not delegating them to the
    caller

Architecture:

  • get_entries_for_groups is a bulk filter operation, not a dict lookup. Bulk operations in Python conventionally report partial results rather than raising on the first miss (e.g., how `pathlib.Path.glob() just
    returns what matches, or how importlib handles missing submodules)
  • The caller-decides-strictness pattern is standard for library code — the function is a primitive, not a policy enforcer

The one thing that would violate the fail-fast principle is if unmatched_keys were silently ignored downstream. But that's a caller problem, not an API problem. The API does its job by surfacing them.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Returning partial results with an explicit miss report is not "silently continuing on errors" — it surfaces the problem. The style guide's fail-fast rule targets dropping errors, not delegating them to the
    caller

Can you give an example of where the caller would provide a list of groups that contains invalid entries? As much as possible, I'd like for the CI system (anything in build_tools/github_actions/) to be correct by construction and not need to patch around issues that it causes on its own. We can add unit tests to enforce only valid GPU targets/groups/families in any configurations (e.g. "run these tests nightly", "run these tests on every PR"). When there is user input (e.g. workflow dispatch text fields), we can validate those inputs during parsing.

I'm not sure if we even need this get_entries_for_groups() function that deduplicates and tracks unmatched. How do you see get_entries_for_groups() and even get_entry() being used? I suspect we could get away with just get_entry() and have that raise KeyError instead of returning None too.

Copy link
Copy Markdown
Contributor Author

@HereThereBeDragons HereThereBeDragons Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_entries_for_group() is the entry point for the workflow files.

the idea is that you take the comma-separate inputs.amdgpu_families, the configure_ci.py transforms it in a list and then you call get_entries_for_group() and get back all matching MatrixEntries. Later you can call MatrixEntry.to_dict(platform=linux) to get the dictionary of that particular arch for the linux platform (or None)

So I thought having an easy way of reporting back on non-matching archs might be nice :)

Comment on lines +131 to +134
def test_family_default_resolves_for_gfx110X(self):
entry = amdgpu_family_info_matrix_all.get_entry("gfx110X-all")
self.assertIsNotNone(entry)
self.assertEqual(entry.key, "gfx1101")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case is very similar to one above:

    def test_family_prefix_resolves_for_gfx110X(self):
        entry = amdgpu_family_info_matrix_all.get_entry("gfx110X")
        self.assertIsNotNone(entry)
        self.assertEqual(entry.key, "gfx1101")

Maybe merge the test cases into a single test? Or put them next to each other?

I'm not sure how useful it is to have both tests. Maybe a single "prefix" test and then a negative test showing that something like gfx110X-unknowngroup raises an exception?

Comment on lines +142 to +167
class TestDefaultConfig(unittest.TestCase):
"""Verify default field values for entries with minimal configuration."""

def test_default_build_config(self):
cfg = BuildConfig()
self.assertEqual(cfg.build_variants, ["release"])
self.assertFalse(cfg.expect_failure)

def test_default_test_config_no_runner(self):
cfg = TestConfig()
self.assertFalse(cfg.run_tests)
self.assertEqual(cfg.test_scope, "all")
self.assertFalse(cfg.sanity_check_only_for_family)
self.assertFalse(cfg.expect_pytorch_failure)
self.assertFalse(cfg.runs_on)
self.assertEqual(cfg.fetch_gfx_targets, [])

def test_default_release_config(self):
cfg = ReleaseConfig()
self.assertFalse(cfg.bypass_tests_for_releases)

def test_platform_config_initializes_all_three(self):
cfg = PlatformConfig()
self.assertIsInstance(cfg.build, BuildConfig)
self.assertIsInstance(cfg.test, TestConfig)
self.assertIsInstance(cfg.release, ReleaseConfig)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tests useful? They look like change detector tests just checking the default field values which are trivially checkable by reading the builder functions.

Comment on lines +177 to +179
def __bool__(self) -> bool:
# True if any runner is set; used by TestConfig.__post_init__ to infer run_tests.
return bool(self.test or self.test_multi_gpu or self.benchmark or self.extra)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator overloading here doesn't feel obvious to me. Could pick a function name like has_any_runner

def has_any_runner(self) -> bool:
  return any([self.test, self.test_multi_gpu, self.benchmark, self.extra])

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean how run_tests is set?
that it is true by default if there are any runners and not overwritten to false?

for me this feels like that makes life quite simple once you get use to it?
i mean we can pack it in an extra function but then you in the ci you need to check 2x - or you just want to rename the funciton?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to the function name (overloading __bool__).

Here are some other style guide references:

  • https://google.github.io/styleguide/pyguide.html#213-properties

    Properties are allowed, but, like operator overloading, should only be used when necessary and match the expectations of typical attribute access; follow the getters and setters rules otherwise.

  • https://google.github.io/styleguide/cppguide.html#Operator_Overloading

    Pros:
    Operator overloading can make code more concise and intuitive by enabling user-defined types to behave the same as built-in types. Overloaded operators are the idiomatic names for certain operations (e.g., ==, <, =, and <<), and adhering to those conventions can make user-defined types more readable and enable them to interoperate with libraries that expect those names.

    Cons:

    • Providing a correct, consistent, and unsurprising set of operator overloads requires some care, and failure to do so can lead to confusion and bugs.

Where this is used, I think this reads more clearly:

    def __post_init__(self):
-       # set run_tests to True if any runners are specified,
-       # False otherwise, if not explicitly set
+       # Default to running tests if any runner is specified.
        if self.run_tests is None:
-           self.run_tests = bool(self.runs_on)
+           self.run_tests = self.runs_on.has_any_runner()

Comment thread build_tools/github_actions/new_amdgpu_family_matrix_types.py Outdated
"""If True, only a sanity-check test subset is run, not the full suite."""
test_scope: Literal["all", "smoke", "full"] = "all"
"""Which test subset to run: all (default), smoke (sanity only), full (skip smoke)."""
expect_pytorch_failure: bool = False
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3992 also added quick as a test scope/type

Comment thread build_tools/github_actions/new_amdgpu_family_matrix_data.py
Comment on lines +48 to +51
AmdGpuFamilyMatrix.get_default_for_family(family)
Return the is_family_default entry whose family list contains the given name,
or None if no default is set (e.g. gfx115X-all where GPUs are registered individually)
or the requested family does not exist.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "family default"? I'm struggling to see what problem that is solving.

I think this gets back to my original first-pass review comments. This is adding indirection that we shouldn't need, and it's going to make changing this code harder.

The dataflow through

  • Configure CI
  • List of GPU targets
  • List of GPU runner labels

should be pretty straight-line, and this further encodes a hierarchy of "familes" with "defaults". If we add gfx1100, gfx1101, and gfx1102 runners for example, this code would keep resolving the "default" for "gfx110X-all" as "gfx1101" when what we really should be doing is passing that full list of gfx1100, gfx1101, gfx1102 directly through the pipeline.

I don't think we should design around internal code checking for gfx110X-all (for example), at all. We can design around workflow inputs that use such convenience families at the API boundary but the CI code should directly use what it intends, without needing all that extra machinery for prefixes / grouping / defaults.

When we build generic targets (https://llvm.org/docs/AMDGPUUsage.html, e.g. gfx11-generic), then I think a "default" of some sort may make sense, but I'm not seeing the value right now. We'll actually want a different one-to-many relationship in that case, like

GFX11_GENERIC = MatrixEntry(
      target="gfx11-generic",
      linux=PlatformConfig(
          test=TestConfig(
              runs_on=[
                  GpuRunners(test="linux-gfx1100-gpu-rocm", gfx_target="gfx1100"),
                  GpuRunners(test="linux-gfx1151-gpu-rocm", gfx_target="gfx1151"),
              ],
          ),
      ),
  )

Here's a write-up from my review agent after some discussion: https://github.com/ScottTodd/claude-rocm-workspace/blob/main/reviews/pr_3653.md#%EF%B8%8F-important-family-resolution-layer-should-be-removed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove it.
i added it to have some mapping from "someone puts manually gfx110X as gpu in the workflow" to "requesting gfx110X means calling gfx110X-all".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah convenience mappings like "gfx110X input" --> "build [gfx1100, gfx1101, gfx1102, gfx1103] and test on any of those that have runners" are fine. Looking further forward, if we add a web UI to trigger CI/CD jobs we can add UI options for that like:
image
image

then, the code at the CI layer could just trust its input to already be sanitized/expanded.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok then we leave it. i will see if i can clarify this a bit better in the comments.

Comment on lines +223 to +241
# gfx94X family — gfx942 is the only member of gfx94X-dcgpu
GFX942 = MatrixEntry(
target="gfx942",
is_family_default=True,
linux=PlatformConfig(
build=BuildConfig(build_variants=["release", "asan", "tsan"]),
test=TestConfig(
runs_on=GpuRunners(
test="linux-mi325-1gpu-ossci-rocm-frac",
test_multi_gpu="linux-mi325-8gpu-ossci-rocm",
# TODO(#2754): Add new benchmark-runs-on runner for benchmarks
benchmark="linux-mi325-8gpu-ossci-rocm",
# TODO(#3433): Remove sandbox label once ASAN tests are passing
extra={"test-sandbox": "linux-mi325-8gpu-ossci-rocm-sandbox"},
),
fetch_gfx_targets=["gfx942"],
),
),
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an idea: what if we pulled the GpuRunners out into a single list, as an inventory of what test machines we have?

  runners = [
      GpuRunner(platform="linux",   target="gfx942",  label="linux-mi325-1gpu-ossci-rocm-frac"),
      GpuRunner(platform="linux",   target="gfx942",  label="linux-mi325-8gpu-ossci-rocm", role="test-multi-gpu"),
      GpuRunner(platform="linux",   target="gfx942",  label="linux-mi325-8gpu-ossci-rocm", role="benchmark"),
      GpuRunner(platform="linux",   target="gfx90a",  label="linux-gfx90a-gpu-rocm"),
      GpuRunner(platform="linux",   target="gfx1030", label="linux-gfx1030-gpu-rocm"),
      GpuRunner(platform="linux",   target="gfx1151", label="linux-gfx1151-gpu-rocm"),
      GpuRunner(platform="linux",   target="gfx1151", label="linux-strix-halo-gpu-rocm-oem", role="oem"),
      GpuRunner(platform="windows", target="gfx1101", label="windows-gfx110X-gpu-rocm"),
      GpuRunner(platform="windows", target="gfx1151", label="windows-gfx1151-gpu-rocm"),
      GpuRunner(platform="windows", target="gfx1151", label="windows-gfx1151-gpu-rocm", role="benchmark"),
      # ...
  ]

(we could also have separate lists for Windows and Linux, to make the code more dense)

Code would query that list of runners when deciding how to test. If we want to take a runner offline we can comment out that row. If we want to pull that list from a dynamic location we don't need to update the entire build configuration (asan/release/etc.) too.

Here's how some of the other code using this could look:

# Matrix entries — per-target-per-platform policy
 entries = [
      MatrixEntry(target="gfx942",  platform="linux"),
      MatrixEntry(target="gfx1151", platform="linux",   sanity_check_only=True),
      MatrixEntry(target="gfx1151", platform="windows", test_scope="full"),
      MatrixEntry(target="gfx1152", platform="linux",   expect_build_failure=True),
      MatrixEntry(target="gfx1152", platform="windows", expect_build_failure=True),
      # ...
  ]

  # Group configs — what CI does per trigger
  presubmit = GroupConfig(
      build_families=["gfx94X-dcgpu", "gfx110X-all", "gfx115X-all", "gfx120X-all"],
      test_targets=["gfx942", "gfx1101", "gfx1151", "gfx1201"],
      build_variants=["release"],
  )

Here's a more concrete proposal: https://github.com/ScottTodd/claude-rocm-workspace/blob/main/reviews/pr_3653.md#proposed-alternative-three-flat-tables

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am all in in putting the gpurunners in a separate list. (i would also hope in the future we have a database to populate it. then we also dont have problems with outdated runners names)

however i think the rest of the proposal of claude is a downgrade and i would rather not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm proposing a similarly structured list in some new code here: #4386

@dataclass(frozen=True)
class S3BucketConfig:
    """Metadata for a single bucket in S3"""

    name: str
    """S3 bucket name (e.g. 'therock-ci-artifacts')"""

    ...

s3_bucket_configs = [
    # CI (self-hosted runners include credentials for therock-ci-artifacts-external)
    S3BucketConfig("therock-ci-artifacts", iam_role="therock-ci"),
    S3BucketConfig("therock-ci-artifacts-external", iam_role=None),
    # Release type "dev"
    S3BucketConfig("therock-dev-artifacts", iam_role="therock-dev"),
    S3BucketConfig("therock-dev-packages", iam_role="therock-dev"),
    S3BucketConfig("therock-dev-python", iam_role="therock-dev"),
    S3BucketConfig("therock-dev-tarball", iam_role="therock-dev"),
    ...
]

_BUCKET_CONFIGS_BY_NAME = {c.name: c for c in s3_bucket_configs}

...

def get_artifacts_bucket_config(...) -> S3BucketConfig:
    ...
    bucket_name = "therock-ci-artifacts"
    return _BUCKET_CONFIGS_BY_NAME[bucket_name]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. i will have a look how to best adapt this.

…ust python style type declaration (remove optional, ...), rename __bool__ with has_any_runner; extend description of is_family_default
…m from 28th april, introduce bypass_tests_for_unscheduled (upstream = nightly_check_only_for_family)
…a deep copy and only contains the build variant requested, remove unsused lookup() which was very similar to getEntry(), support new gpurunner and buildrunner layout
…hted selection, uptodate with april 28 amdgpu_family_matrix.py
@HereThereBeDragons
Copy link
Copy Markdown
Contributor Author

HereThereBeDragons commented Apr 29, 2026

New AMD GPU Family Matrix — what's new and how to use it

TL;DR: amdgpu_family_matrix.py (the big dict-of-dicts) is being replaced by three dataclass-based modules. Same data, but typed, validated, and with a real lookup API. Old file stays in place during transition — keep both in sync.

The matrix now also takes a lot of heavy lifting off configure_ci.py — build-variant scoping (dropping platforms that don't support a variant, trimming build_variants to the requested one), build-runner re-binding for non-release variants, GPU runner inventory selection, and run_tests inference all happen inside get_entry() now, so callers just consume the result.
Manual intervention is only needed for special runners like "oem" that will be delivered in the ["extra"] field and needs to be set for test/test_multi_gpu/etc.

Files

File Purpose
new_amdgpu_family_matrix_types.py Dataclass schema (no data)
new_amdgpu_family_matrix_data.py The actual matrix entries — this is where you add/edit GPUs
new_amdgpu_family_matrix_runners.py Build/GPU runner inventories with weighted selection — this is where you add/edit Runners

How to look up an entry

from new_amdgpu_family_matrix_data import amdgpu_family_info_matrix_all as matrix

# Exact target name
entry = matrix.get_entry("gfx942")

# Or family name — resolves to the is_family_default entry
entry = matrix.get_entry("gfx94X-dcgpu")

# Build variant scoping (default is "release")
asan = matrix.get_entry("gfx942", build_variant="asan")  # None if unsupported

# Multiple keys at once
result = matrix.get_entries_for_groups(["gfx942", "gfx1151"])
result.entries          # matched MatrixEntry list
result.unmatched_keys   # keys with no match (unknown or unsupported variant)

get_entry() always returns a deep copy scoped to the requested variant — mutate freely, no leakage into the shared matrix.

Important behavior changes vs. the old dict

  1. Family membership is auto-populated from cmake/therock_amdgpu_targets.cmake — no longer hand-listed per entry. CMake is the source of truth.
  2. Runners are no longer hardcoded per entry. BuildConfig.runs_on and TestConfig.runs_on are filled from the inventory in _runners.py (with weighted selection across pools). Override per-entry only when you need to pin a specific runner. (already includes [ci] Enabling aws-linux-scale-rocm to 10%  #4899)
  3. run_tests is inferred from whether any GPU runner is set. Set it explicitly to True/False only to override.
  4. Build variants (release / asan / tsan) are first-class. Looking up a non-release variant drops platforms that don't list it and re-binds the build runner to the matching pool.
  5. is_family_default marks the entry returned for family-name lookups (e.g. gfx94X-dcgpu → which specific gfx target). Validated at construction — at most one default per gfx-prefixed family.
  6. Test scope is now a Literal["quick", "comprehensive", "full"] — typo-safe.
  7. bypass_tests_for_unscheduled lets you skip tests on PR/push but still run on nightly schedule. (renamed from nightly_check_only_for_family )
  8. Add a MatrixEntry to _MatrixEntries in _data.py — that's it. Family lists are derived from CMake; runners come from the inventory. Then mirror the change in the legacy amdgpu_family_matrix.py until that file is removed.

Adding a new field

  • Add to the right dataclass in _types.py with a default
  • If it's derived, compute it in __post_init__
  • Add to to_dict() if consumers need it in the serialized output

Notes

@HereThereBeDragons
Copy link
Copy Markdown
Contributor Author

btw @geomin12 volunteered to help port the multi arch configure ci to use this new version :)

and i think this is a good point to kill off/stop supporting the old configure_ci.py

@HereThereBeDragons
Copy link
Copy Markdown
Contributor Author

#4613 for tracking

Copy link
Copy Markdown
Contributor

@jayhawk-commits jayhawk-commits left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need some time to digest the design in this latest state,, but Codex did find a difference in behaviour from the current matrix.

In amdgpu_family_matrix.py, Linux tsan is defined without an expect_failure override, so current consumers treat TSAN as blocking. With this replacement data, BuildVariantInfo.to_dict() will serialize expect_failure=True, and a future migration would turn TSAN into a continue-on-error job unless that policy change is deliberate.

@ScottTodd
Copy link
Copy Markdown
Member

In amdgpu_family_matrix.py, Linux tsan is defined without an expect_failure override, so current consumers treat TSAN as blocking. With this replacement data, BuildVariantInfo.to_dict() will serialize expect_failure=True, and a future migration would turn TSAN into a continue-on-error job unless that policy change is deliberate.

expect_failure isn't even in use or working right now. I think we should remove it, see #4500

Copy link
Copy Markdown
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! i did a cross reference with amdgpu_family_matrix.py, and all is covered (also checked with claude). have some housekeeping comments on easy of viewing and some additional entries

a family, the is_family_default entry carries the full CI config; siblings
get minimal config and inherit on family-name lookups."""

GFX900 = MatrixEntry(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gfx90c was added as well!

therock_add_amdgpu_target(gfx90c "AMD Renoir/Lucienne/Cezanne iGPU" FAMILY igpu-all gfx90c-igpu

##########################################################################################


def _build_groups() -> dict[str, list[str]]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we perhaps set these fields in the _MatrixEntries? that way we don't need to hard-code this and can collect the matrices from _MatrixEntries?

'test-sandbox'). Extra keys are forwarded as-is into GpuRunners.extra."""
label: str
"""The runner label string."""
weight: float = 1.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some comment that is a temporary stopgap? i really don't like separate labels and having to adjust weights. i plan to work with OSSCI to update

_TEST_RUNNER_INVENTORY: list[_GpuRunnerEntry] = [
# gfx90a
_GpuRunnerEntry(
platform="linux", target="gfx90a", role="test", label="linux-gfx90a-gpu-rocm"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we make this new line like the bigger entries? for consistency... although i don't know if pre-commit will complain or not

),
# gfx1030
_GpuRunnerEntry(
platform="linux", target="gfx1030", role="test", label="linux-gfx1030-gpu-rocm"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for all these entries

Comment on lines +245 to +262
_BuildRunnerEntry(
platform="linux", variant="default", label="azure-linux-scale-rocm", weight=90
),
_BuildRunnerEntry(
platform="linux", variant="default", label="aws-linux-scale-rocm", weight=10
),
_BuildRunnerEntry(
platform="linux",
variant="sanitizer",
label="azure-linux-scale-rocm-heavy-ramdisk",
weight=1,
),
_BuildRunnerEntry(
platform="windows",
variant="default",
label="azure-windows-scale-rocm",
weight=1,
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add over-arching comments to highlight the differences?

Suggested change
_BuildRunnerEntry(
platform="linux", variant="default", label="azure-linux-scale-rocm", weight=90
),
_BuildRunnerEntry(
platform="linux", variant="default", label="aws-linux-scale-rocm", weight=10
),
_BuildRunnerEntry(
platform="linux",
variant="sanitizer",
label="azure-linux-scale-rocm-heavy-ramdisk",
weight=1,
),
_BuildRunnerEntry(
platform="windows",
variant="default",
label="azure-windows-scale-rocm",
weight=1,
),
# default
_BuildRunnerEntry(
platform="linux", variant="default", label="azure-linux-scale-rocm", weight=90
),
_BuildRunnerEntry(
platform="linux", variant="default", label="aws-linux-scale-rocm", weight=10
),
# sanitizer
_BuildRunnerEntry(
platform="linux",
variant="sanitizer",
label="azure-linux-scale-rocm-heavy-ramdisk",
weight=1,
),
_BuildRunnerEntry(
platform="windows",
variant="default",
label="azure-windows-scale-rocm",
weight=1,
),

something like so? it was initially tough to spot

platform="linux", variant="default", label="azure-linux-scale-rocm", weight=90
),
_BuildRunnerEntry(
platform="linux", variant="default", label="aws-linux-scale-rocm", weight=10
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, can we make this have new lines with each entry? to make it easier to adjust / consistency with others

as-is regardless of weight; weights are relative and need not sum to 1.0."""
if len(entries) == 1:
return entries[0].label
return random.choices(entries, weights=[e.weight for e in entries], k=1)[0].label
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

way cleaner, thanks :)



# Roles that map to the named fields on GpuRunners. Anything else goes into `extra`.
_NAMED_ROLES = {"test", "test_multi_gpu", "benchmark"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is sandbox supposed to be in here?

suffix="asan",
cmake_preset="linux-release-asan",
),
"tsan": BuildVariantInfo(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know host-asan will be added soon with #4603

Copy link
Copy Markdown
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as "reviewed" to drop from my review queue.

I still think we should remove the concept of gpu "families" from this part of the CI system entirely. This PR is probably too big and complex at this point - it would be faster to start fresh on a new branch/PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

6 participants