Refactor new_amdgpu_family_matrix to use data classes instead of dictionary by HereThereBeDragons · Pull Request #3653 · ROCm/TheRock

HereThereBeDragons · 2026-02-26T19:58:53Z

Replaces the nested dictionary format in new_amdgpu_family_matrix.py with two new files using data classes as suggested in #3452:

new_amdgpu_family_matrix_types.py — dataclass schema definitions
new_amdgpu_family_matrix_data.py — matrix data (arch, predefined arch groups, build variants)

This will allow for easier maintenance of the entire matrix: Parameters must now be defined and have the description right next to it (instead of being spread throughout the file) and only non-default values need to be provided for each arch. In addition, helper functions provided by the matrix classes will simplify the new matrix generator logic (replacement of configure_ci.py).

Changes:

Typed fields with defaults — only non-default values need to be specified per entry
Arch entries got flatten to sepearate _GFX* MatrixEntry definitions
Auto-discovery of _GFX* entries — adding a new GPU requires only a new variable, no registration step
Added unittests

Test Config:

run_tests auto-determined based on if a runner is set or not; set explicitly to False to reserve a runner without enabling tests
GpuRunners.extra — catch-all dict[str, str] for non-standard or temporary runners (e.g. test-sandbox, oem) without schema changes
New param to define which tests are to run: TestConfig.test_scope — typed Literal["all", "smoke", "full"]

Introducing new API for accessing the amdgpu matrix:

Case-insensitive key lookup — get_entry("GFX950") and get_entry("gfx950") are equivalent
Family-level default resolution — get_entry("gfx950") resolves to gfx950-dcgpu via is_family_default
get_entries_for_groups — resolves to a predefined group list (e.g. amdgpu_presubmit) to the corresponding MatrixEntry objects in definition order (--> will allow to remove this step from the generator logic)
- to_dict(platform=None) — serializes full matrix or a single platform config with amdgpu_family key included

Includes changes of:

gfx90X split — gfx906, gfx908, gfx90a registered as individual entries ( Split gfx906/gfx908/gfx90a into separate families #2869 )
removed push_on_success as not used anymore

…n, introduce test_scope, remove push_on_success, always create all configs when declaring platformconfig, make arch lookup case-insensitive, add unittest

HereThereBeDragons · 2026-02-27T18:13:41Z

what is missing (as not yet merged): #3564

ScottTodd

There are some useful improvements to code structure in here but I think some of the details around targets/families aren't yet in alignment with what the CI/CD systems need at their core. I also think that having a separate set of new_* scripts is going to keep biting us. Need to complete the migration so we only have one set of files serving the same purpose at a time.

ScottTodd · 2026-03-02T19:57:14Z

+scope         Subgroup within a family. Generic scopes are "dcgpu", "dgpu", "all".
+              Specific GPU names (e.g. "gfx1151") are also valid scopes.


This is not an exhaustive list. See https://github.com/ROCm/TheRock/blob/main/cmake/therock_amdgpu_targets.cmake. We also have igpu for example.

acronym meaning

dgpu discrete gpu

dcgpu datacenter gpu

igpu integrated gpu

The concept of GPU families is also going to be less consequential when we've switched fully over to the multi-arch pipelines, so let's be careful about what abstractions we add here and if they will continue to make sense.

Looking through that lens and stepping back a bit, I don't understand the distinction this is making between a "family" and a "scope". That CMake file is the source of truth, and it defines a family as the full string (e.g. gfx94X-dcgpu), NOT just a particular prefix (e.g. gfx94X).

For (non-multi-arch) CI/CD, we choose a list of target families to build for, as each family can then share device-agnostic build stages and amortize binary size overhead.

For multi-arch CI/CD we're going to choose a list of targets to build and those MAY be grouped by family, but they don't need to be as the build will be efficiently parallelized even if they aren't. Similarly the packages will share the same device-agnostic files no matter how the GPU targets/families are split.

For testing, we should make it clear what the mappings are for "gpu target" to "runner label". We could try to make those mappings all be 1:1 ("gfx1151" -> "linux-gfx1151-gpu-rocm") instead of being 1:N ("gfx110X-all" -> ["linux-gfx1100-gpu-rocm", "linux-gfx1101-gpu-rocm"]).

I don't think we should focus too much on "dcgpu", "dgpu", "all", etc. at this level of the infra.

i know this is not an exhaustive list, i guess adding ... will help to avoid this confusion and expanding the frozenset of _GENERIC_SCOPES (but right now we have no entry for igpu.. :) )

the idea for family and scope is similar to family and target - just that scope is agnostic if it defines a single gpu arch or a family (to avoid ambiguities i named it "scope"). this is mainly done due to the CI scripts currently accepting inputs like "gfx94X" and expanding it to "gfx94X-dcgpu". However, we have some families that have multiple igpu, dgpu etc. - so it is not clear to what it should be expanded to - or to be exact: right now this is not possible in amdgpu_family_matrix to have multiple entries for the same family. E.g. having entries for both gfx110X-all and gfx110X-igpu with different runners would not be possible.

As such i added the is_family_default, so if we say: gfx110X-all == is_family_default,you get the following behavior within the CI:

gfx110X --> will auto-expand to gfx110X-all

gfx110X-all will match gfx110X-all

gfx110X-igpu will match with entry _GFX110X_IGPU if available, otherwise fail

-igpu and -all can have different settings, runners, etc.

If we put it in the right order in the file, we could probably even deep-copy the settings and just adjust what we need to.

I'm still concerned here. This looks like a lot of boilerplate that isn't going to be useful in the multi-arch CI/CD world (cutover is imminent - please no new feature work on the older CI).

Target families and scopes, at least in their current form, are a concept that was engineered for the non-multi-arch CI.

We need a direct mapping from "gpu architecture" to "runner label".

We need a direct mapping from "gpu architecture" to "rocm build status", "rocm test status", "pytorch build status", etc.

We could list targets directly and construct families on top of them. I think this is inverted:

GFX1151 = MatrixEntry( family="gfx115X", scope="gfx1151", linux=PlatformConfig(

I think I'd rather have something more like:

GFX1151 = GPU_ARCHITECTURE(...) ... FAMILY_GFX1151 = [GFX1151] FAMILY_GFX110X_ALL = [GFX1100, GFX1101, GFX1102]

@marbre may need your eyes here too.

Discussed this offline wit Laura, she will provide an update on what we concluded on.

we decided on the following considering that multi arch will most likely in the future only be operating on specific targets/archs:

remove scope, and remove having "-dcgpu" or any other family specification as its own entry (as it is currently in amdgpu_family_matrix). instead have only entries for concrete archs (=target).

for each target, auto-populate its families based on cmake/therock_amdgpu_targets.cmake

have getters for: get "default" target for when requesting a family (for backward compatibility). is currently based on which gpu runners we have. e.g. gfx120X-all --> returns entry gfx1201.

have getters for: "give me all entries for a family" so in the future we could replace "fetch-gfx-targets": "gfx120X-all" --> ["gfx1200", "gfx1201"]

have a lookup() that tells you how an entry is resolved: gfx942 -> "target", gfx94X-dcgpu -> "family", gfx94X -> "family_prefix" (see question below)

when requesting a list of gpu entries, deduplicate by default: get_entries_for_groups(["gfx942", "gfx94X-dcgpu"]) --> just returns a single entry "gfx942"

NOTE: families are only considered for all the entries which have a MatrixEntry in the file. e.g. we do not consider any "igpu" as we dont have any entries for them. so when requesting a family "-all" those are omitted as no entry exists for them.

open questions:

right now we have partial matching of families. however it is difficult to decide which family it should be expanded to. e.g. "gfx120X" --> returns entry "gfx1201". but if we want to set amdgpu_family: in the workflow, we kinda need to expand it to "gfx120X-all". so how do we do it? or do we just remove this functionality?

see claude reply:

❯ i wonder if we should just remove the gfx94X matching. as we need to know exactly if gfx94X-all or gfx94X-dcgpu to know what we are building ● That's a valid point. The prefix match hides an important distinction — gfx94X-dcgpu builds only discrete GPU targets, while gfx94X-all would include everything in the family. Silently picking whichever family happens to match first is ambiguous and could cause wrong builds. The arguments for removing it: - Forces callers to be explicit about what they're building - Removes ambiguity when a family has multiple scopes (dcgpu vs all) - The only convenience lost is saving 6 characters (-dcgpu / -all) The argument for keeping it: gfx94X is a natural shorthand people might type, and it's better to resolve it than error. But that's a weak argument — failing with "did you mean gfx94X-dcgpu or gfx94X-all?" is more helpful than silently picking one.

future questions:

what happens if we request a family but have multiple runners? do we return a list of runners? do we keep the "just return the default_family_entry"? something else? or is this just not a scenario that could be happening?

i think we can auto-populate fetch-gfx-targets but it would be good to know what the exact requirements are, as for some it returns multiple targets and for others just a single one eventhough we have more entries for the same family

ScottTodd · 2026-03-02T20:15:32Z

+        "gfx1152",
+        "gfx1153",
+    ],
+}


As I understand, postsubmit workflows run the union of the "presubmit" and "postsubmit" sets. Maybe we should either:

Document that here in comments

Actually build the lists that way? A few options:
# Option A: construct sets separately amdgpu_presubmit_set = set(["gfx94X-dcgpu", "gfx110X-all", "gfx1151", "gfx120X-all"]) amdgpu_postsubmit_set = amdgpu_presubmit_set | set(["gfx950-dcgpu"]) amdgpu_nightly_set = amdgpu_postsubmit_set | set([ "gfx906", "gfx908", "gfx90a", "gfx101X-dgpu", "gfx103X-dgpu", "gfx1150", "gfx1152", "gfx1153", ]) amdgpu_family_predefined_sets = { "amdgpu_presubmit": amdgpu_presubmit_set, "amdgpu_postsubmit": amdgpu_postsubmit_set, "amdgpu_nightly": amdgpu_nightly_set, }

# Option B: construct a single list and then parse out the individual sets from enum import Enum class TriggerFrequency(Enum): PRESUBMIT = 1 POSTSUBMIT = 2 NIGHTLY = 3 amdgpu_family_groups = { # Presubmit "gfx94X-dcgpu": TriggerFrequency.PRESUBMIT, "gfx110X-all": TriggerFrequency.PRESUBMIT, ... # Postsubmit "gfx950-dcgpu": TriggerFrequency.POSTSUBMIT, ... } # Then construct sub-lists from that full list?

I'm basically looking for a design here that makes it

easy to see which GPU targets/families are build/tested by default for each event

easy to move build/test frequency up or down

hard to make configuration mistakes

option 1 seems for me way easier to understand and extend. for the future we might want to have additional predefined groups for weekly runs etc.

ScottTodd · 2026-03-02T21:08:08Z

+amdgpu_family_info_matrix_all = AmdGpuFamilyMatrix(
+    entries=[
+        v
+        for k, v in globals().items()
+        if k.startswith("_GFX") and isinstance(v, MatrixEntry)
+    ]
+)


Oh, this is an interesting way to collect entries into a single list without needing to explicitly add them. Could put them in a class scope to avoid needing to use globals() though?

yeah i wanted to make it less error prone when editing. adding a new arch is already annoying enough considering over how many lines a single entry spans - and then keeping manually track of which are available, changed naming or whatever is just horrible

HereThereBeDragons · 2026-03-03T16:10:32Z

yes, i know the new_* is biting us but I am missing some extra hands to get this more quickly done :(

geomin12

much cleaner i like it!!! just one comment but lgtm

geomin12 · 2026-03-03T22:39:05Z

+    """If True, only a sanity-check test subset is run, not the full suite."""
+    test_scope: Literal["all", "smoke", "full"] = "all"
+    """Which test subset to run: all (default), smoke (sanity only), full (skip smoke)."""
+    expect_pytorch_failure: bool = False


there are more of these args coming, it may be good to add that extra dict here as well, as we add/remove optional args

#3992 also added quick as a test scope/type

geomin12 · 2026-03-03T22:39:48Z

yes, i know the new_* is biting us but I am missing some extra hands to get this more quickly done :(

I think it is fine for the time being!

…lass scope to not pollute module space

erman-gurses

Added some comments

erman-gurses · 2026-03-10T05:20:48Z

+    def get_entries_for_groups(self, group_keys: list[str]) -> list[MatrixEntry]:
+        """Return entries matching the given keys, preserving order and skipping unknowns."""
+        result = []
+        for key in group_keys:
+            entry = self.get_entry(key)
+            if entry is not None:
+                result.append(entry)
+        return result


get_entries_for_groups() currently skips unknown keys silently. For CI configuration, that feels risky because a typo in a predefined group could quietly reduce build/test coverage instead of failing fast. Typo like gfx94X-dcgup would just disappear from the result

I think this should raise on unknown keys, or at least have a strict mode used by tests / workflow generation.

i will add that not only the matching entries are returned but also unmatched keys.

please also note that while predefined groups are being used, in theory any list of strings can be used as group_keys

I see your point. How about something like:
def get_entries_for_groups(group_keys, strict=False):

When strict=False, unknown keys can be skipped (for cases like experimental or user-provided keys).
When strict=True, the function could raise if any key is not recognized.

That way the helper stays flexible for general use, but CI/workflow generation can use strict=True to avoid silently dropping entries due to typos.

@ScottTodd any opinion on it?
i would argue that giving back a tuple of match entries, and unmatched keys is enough and any script on top can have the logic to fail if needed.

Would be good If we get @ScottTodd's opinion. I see that your approach gives more flexibility on the caller side.

erman-gurses · 2026-03-10T05:22:46Z

+    def test_get_entries_for_groups(self):
+        entries = amdgpu_family_info_matrix_all.get_entries_for_groups(
+            amdgpu_family_predefined_groups["amdgpu_presubmit"]
+        )
+        self.assertGreater(len(entries), 0)
+        for entry in entries:
+            self.assertIsInstance(entry, MatrixEntry)


Since get_entries_for_groups() is used on predefined CI groups, maybe we can add a negative test for an unknown key here. Otherwise a typo in a group can silently pass through without coverage loss being obvious.

For example, a test that verifies unknown keys either raise or are handled explicitly would make this safer.

erman-gurses · 2026-03-10T05:40:41Z

+    def to_nested_dict(self) -> dict:
+        """Convert to the original nested-dict format: family → target → platform → ..."""
+        result: dict = {}
+        for entry in self.entries:
+            result.setdefault(entry.family, {})[entry.scope] = entry.to_dict()
+        return result


Do we use this method anywhere? I may have missed it, but I didn’t find a usage in this PR or in the added tests.

we will be using it to create the dictionary that we use in the workflows later (created by the refactored configure_ci.py which will be the next PR step).
can see if i can add a test

will add some testing. but not sure how useful it really is

It was just a suggestion - maybe we can add after its usage is active.
As far as I understand, we can do something like this in the future for regression testing.

assert result["gfx94X"]["build"]["amdgpu_family"] == "gfx94X"

## Motivation Bump rocm-systems from 93bc019 to 093b66caa3 (includes fix for hip-tests issue and revert for mathlib hiprtc issues and revert for rccl-test, added revert for miopen failures due to PR 653): Commits: 093b66caa3 (HEAD, origin/develop, origin/HEAD) Revert "SWDEV-546177 - hipModuleGetLoadingMode API impl (#653)" (#3858) d8a0adbc9f [AMD-SMI] Hide libamd_smi.so internal symbols (#3777) d4da458f94 [rocprofiler-sdk] [Documentation ] Updating changelog (#3827) 19fadeb082 (origin/users/abchoudh/fix_dispatch_count) [RCCL][Tuner Plugin] Enable tuning of RCCL tuning constants (#3757) b4f5f8a6a8 rocr: Fix IPC dmabuf hang with large allocations (#3211) 64efea0435 RCCL: allow users to override max and per job memory & fix defaults. (#3797) 9b3dd101bb Removing ready_for_review (#3849) 7e43880a64 [rocprofiler-systems] Update ROCm version to 7.2.0 in CI workflows for Debian, RedHat, and Ubuntu (#3431) 1fdb6b9827 [rocshmem] add gda/topology unit tests (#3715) be1ea24a96 Move hipMipmappedArrayGetMemoryRequirements test to common tests e4513f04c8 Update amdgpu-windows-interop with latest changes, pal 58aa0bab2ced0cc9ebe8d2d0932db6774feb4e49 2026-03-04(#3773) b1f964d796 [rocprofiler-compute] Ensure long kernel name fully shows in compute analyze (#3665) 4dcf1e3ce0 SWDEV-567112 - Replace test names (#3787) 33f5f302e5 ROCM-2428 - fixes hipStreamBatchMemOp invalid operation checks (#3099) 139f4bfff8 [SWDEV-556456] Align HIP_UUID with rocminfo (#3614) 8e8928544c Reduce buffers alignment to 4 bytes (#3821) 51be29a647 AIRUNTIME-125: Consolidate Windows optimization and debug flags (#3825) 1407392240 [AMD-SMI] CI: Fix root workflow to use ASIC-specific test filters (#3807) 63f78a98d7 (origin/users/mcao/fix_rocpdsummary) [ROCM-SMI] Fix DRM include dirs leaking absolute build paths to consumers (#3808) caf2f7e1eb [ROCM-186] amd-smi: Add support for a VRAM and GTT tuning interface (#3636) a0712d4c2a [TheRock CI] Update projects_to_test lists (#3749) 02090c42c9 rocrtst: install gfx .hsaco files to share/rocrtst (#3744) 4a0a1cbfce Merge other simd table (#3696) 0d07657d78 Add missing kwargs from rocprofiler_add_integration_validate_test in .cmake-format.yaml (#2336) 3a3df301dc Optimize device counting service GPU interactions (#1583) 95d9da0098 Add SPM Enable flag in build infrastructure (#3677) 12bb9435b2 [rocprofiler-sdk] On-demand GPU profile queue creation/destruction (#3586) 941057c2c0 Navi4 tuning table iter 1 (#3052) dbf2b7369f [AMD-SMI] Display N/A for cu_occupancy when file is unavailable (#3589) b0efc7c639 [RCCL] [UT] Add ROCTX test (#3625) ba7a20ea18 Reducing the p2pnChannels for half-subscription A2A on multi-node MI350 (#3381) 75238c98a2 [clr] Fix memory leak in getOrCreateHostcallBuffer (#3699) af2ee0e8ad [hip-tests] ASAN Check for image support before we create context (#3834) ad4496678e Update windows ci subtree in include amdgpu-windows-interop (#3814) c8ad252208 [rocprofiler-register] Fix compilation with system fmt/glog (#1243) 781881544d Update README to include dbgapi and debug agent components (#3731) 88e4a7837e ROCProfiler and ROCTracer: Modifying deprecation note (#3831) b5918a5f35 [ROCM-3124-3125-3126] CUID file generation hangs on MI350 systems/CUID test failures/Segmentation fault in CUID example code (#3548) 97a5dd993c Update copyright to use SPDX IDs (#3805) 511730ab45 [rocshmem]: add flood-amo tester (#3653) 2d650a0065 [clr] Fix heap use after free error in device allocations (#3789) b6b179ad81 Disable hipHostRegister_Negative test for ASAN (#3832) 39ec318c8d [RCCL] Add GDA alltoallv via rocshmem integration (#3613) fb0f4d53b1 [RCCL] [CUMEM] Fix cuMem multi-process runs (#3811) c3de7d4bf6 SWDEV-526201 - Fix and enable disabled HIP tests from warp group (#3089) 8d9a8ca161 roofline: code cleanup and refactor vector types (#3813) 8957e49028 Don't wait on command completion if worker thread is destroyed (#3790) 9e7586a5fa [rocshmem] Add barrier APIs and expose `ROCSHMEM_TEAM_WORLD` on device (#3651) 91b09235b0 Revert "fix local gpu release static build failure (#3667)" (#3799) 0fda754b1b libhsakmt: Add secondary KFD context creation support ee43db95b0 Revert "Update TheRock reference to 20260303 commit (#3709)" (#3826) 86e28b9fae Added fix to update GL2C counters instance count for GFX11.5 (#3100) 93f69f7de4 Adjust includes to match use (#3742) e9fbc3f1a2 (develop) Update TheRock reference to 20260303 commit (#3709) be0675a1a6 (HEAD) Revert "Support fp8 types in hiprtc (#2605)" (#3792) 3e3a94a4ef [rocprofiler-systems] Add trace_cache support for std::optional<T> serialization (#3490) 0b42a7f472 clr: Eliminate unnecessary kernel name string copies (#3774) b6b0d77b29 rocr: Add hsa_amd_memory_async_batch_copy API for batched memory copies (#3259) 486e6d12d2 Resolve staircase RS regression with 48 max channels (#3684) eb59c85ac4 [gfx942][gfx950] Leverage new cache bypass builtins for simple protocol where available (#2847) 4d74d27f0e (origin/users/raramakr/rocm-smi-target) Revert "Auto Labeler: Add ci:regression-detection label to rccl PRs (#3543)" (#3769) 8f0795517c [AMD-SMI] CI: Use ASIC-specific test blacklists in workflows (#3775) 7cef5b64c1 Fix MFMA total FLOPS calculation (#3371) aea37512ba Remove duplicated tests (#3235) b6c656fdd4 Remove duplicated tests in memory module (#3087) ca3137d8f9 [rocprofiler-sdk] Install integration tests without building for therock & Misc. fixes (#3047) 0ab5c41f65 [rdc] Enable on-demand queue mode in rocprofiler-sdk to prevent inference degradation (#3629) a1eb2a1f7c rocr/wsl: a library should not output to std::out by default (#3718) b7da296cc8 Reenable flood_put/get testers on mlx5 since they should work after pr2732 (#3748) 000e24de2f [rocprofiler-sdk] Add automatic late-start support to rocprofiler_force_configure (#2168) 64ea87f592 [hip-tests] Fix memory leaks in hipMemPoolTrimTo tests (#3643) 543a7d765f rocr: Include code object allocs in lightweight coredump a58da378d4 [rocdecode] - update rocdecode ctest (#3768) f88e4ee44d [rocprofiler-systems] Make CDash submit non-fatal and add GitHub Actions logging (#3525) cb14debc3a [rocprofiler-systems] Update nlohmann-json submodule (#3391) 449253009a SWDEV-567112 - Introduce new mechanism for tagging and disabling tests - Part 2 (#3707) 8ca991393d disabling rccl from full build (linux), covered in RCCL CI (#3770) c4fdb20b74 [ATT] Re-enable tests. Add option to specify perf to target CU only (#2819) 615aab95ed ROCM-3816 Out of Memory fix (#3588) 8ffad41b24 Fix rocm_smi64 exporting invalid absolute paths to consumers (#3717) 042d76a626 rocr: Remove dependency on KFD in Runtime::VMemoryHandleMap (#2515) 555db59b2a [AMD-SMI] CPU: Added support for family 1A Models 50h-57h (#3206) 3affa2c7a3 [SWDEV-555935] Fix shared mutex and self-heal (#3729) ba0bf0f3db Replace hipMemGetInfo with ihipMemGetInfo and use it for internal calls. (#2845) c5cef9b18e Fix HIP_RETURN on all HIP API calls. (#2838) 241ce7ba83 Revert "memory: fix "contiguous_bytes" calculation in generic conversion (#3285)" (#3755) 8a690f482e [kpack/clr] Windows PE/COFF support for kpack artifact splitting and runtime loading (#3728) 863bdf8aa8 MFMA pre-processor guards for ipc.hip (#3724) 90bb9b1921 Release queue outside of vgpusAccess lock (#3705) de4523910c clr: Add build support of ROCR and PAL backends together (#3722) dfb7abc2d8 [rocprofiler-sdk] RCCL API changes for RCCL_API_TRACE_VERSION_PATCH = 3 (#3477) d69d4f23db [AICOMRCCL-633] - Fixed warnings in tests (#3402) 067d86dcaa rocr/wsl: Disable AQL Queue usage with flag ROCR_USE_PM4 (#3663) 594eb60d42 [TheRock CI] rocm-systems build full ROCm stack (#3182) 27d17e8ea0 [ROCProfiler-SDK] Fix SWDEV-556922: Handle comments before checking for pmc: (#1723) c80d90439d memory: fix "contiguous_bytes" calculation in generic conversion (#3285) 669987c83f [hip-tests] ASAN - add missing release handles (#3735) a24bbd75a4 fix local gpu release static build failure (#3667) 259b2ff913 Speed up DeviceId (#2803) 65d9264bf4 Simplify MPI trace merge logic and remove legacy guards (#3562) 1076c083cb use system to look for zcat path instead (#3720) 22f1d19db3 [AICOMRCCL-355] Enable threshold-based p2p-batching (#3000) a2e4c794d2 Partially flatten template tests cases (#2597) e242abe219 Pass space separated gfx target list to RCCL build command (#3701) 4f78aea66d SWDEV-570074 - Refactor Memset memory object handling. (#2228) b3ad12d834 Support Nvidia build on theRock for HIP-tests (#3335) a1cf15ea9a Support fp8 types in hiprtc (#2605) 8ef84b0a50 [rocprofiler-systems] Add HPC examples to automated testing (#3437) db3a70dfa0 Free memory which was allocated in tests (#3710) 27e6809c7e [rocprofiler-systems]: Fix rhel CI failure on for MPI and UCX tests (#3700) 0d9aaf59d8 rccl/topo_expl: fix build issue. (#3719) be04d75765 Fix zcat path used for checking kernel configs (#3423) cab60a7b27 rocr/thunk/win: Add CU mask support (#3518) 5b3d826c05 [CUMEM] Initial support for cuMem APIs (#2763) 0606ff491f [HIP] [PLAT-194496] Improve Stress_hipMalloc_HighSizeAlloc reliability (#3550) 05750a77cc fix hip-test name in config (#3716) 33f777f3e9 hsakmt: Remove --high functionality from run_kfdtest.sh (#2486) e4c46e3480 Hide the retain under direct dispatch check (#3698) bfe0ca0279 Add rocprof trace decoder to CI tests (#3690) a769b6f54e [rocSHMEM] Edgar/abstract allocator ipc part1 (#3411) 659fb52243 [AMD-SMI] Fix bugs, improve error handling, and clean up NIC/switch code (#3654) 0eb26ea571 hsakmt: Fix Import/Export of dmabuf_fd for WSL/Windows (#3348) a122936abb [SWDEV-567812] Add UBB power and power_limit fields to npm_info (#3262) c3bec090c5 [rocprofiler-sdk][rocprofv3][rocpd] Updates for KFD data (#340) 7c44d47740 SWDEV-547659 - Remove HIP_VERSION_GITHASH in logs (#448) 74b6487a6a SWDEV-547008 - Documentation fix for function return values (#463) af21cd44f1 SWDEV-545553 - Improve clarity and robustness of CALLBACK unit tests (#546) 180d639044 SWDEV-544900 - Change hip-test test case name (#547) feeca99950 Doc improvements (#3688) c1822b6336 ROCprofiler-SDK: deprecation of legacy tools (#3609) 5d7aff8462 Fix rocprof-compute-viewer link (#3459) 0b0b4846f0 AIRUNTIME-129 - Fix Ocl test failures of 2D image with pitches. (#3584) ac569b87e0 Fix memory tests config (#3687) 603fe7a5cf [hip-tests] Enable hipMipmappedArrayGetMemoryRequirements test via cmake 4fad4452d9 [hip] Docs: Updates to some memory management pages 8cc59559fe AICOMRCCL-656 fix memory leak in ncclCommInitRankFunc (#3628) 94a4595a5d Fix missing amd_comgr linkage in pc-sampling integration test (#3453) 2a68565dce rocrtst: CMAke file: strip xnack/feature suffixes from gfxNum in build_kernel (#3652) c3542bfb2b [rocprofv3] Deprecating input text files for counter collection (#1562) ff122e7ed7 SWDEV-573073 - Cleanup hipHostAlloc/Malloc/Register tests (#3017) 5b1deaf29d SWDEV-567112 - Introduce new mechanism for tagging and disabling tests - Part 1 - Core (#2351) 6e0cc309e1 rocrtst: MaxSingleAllocationTest: skip CPU NUMA nodes >0 (#3208) d65f601195 [AICOMRCCL-667] rccl: Change GDR selection logic. (#3607) f1c44ab200 Patch Back to Old Repo: fixes from manual runs (#3621) fe53bcd715 [AMD-SMI] Allow amdsmi init to succeed when no NIC hardware is present (#3403) b25600efdb [ROCM SMI] Fix fw pldm version not displayed in default amd-smi (#3594) 169d2ef763 root to module wiring, remove legacy source collection (#3482) 7469781988 [LRT][clr] SWDEV-512963-Fix CTS test failures for 1D buffer copy (#3520) c8f55d9b86 Adding rocprof trace decoder (#3576) 425e983502 Trace decoder codeowners (#3600) a176efd648 [hip-tests] Add return statements to HIP_SKIP_TEST (#3647) 32687cf183 rocrtst: CPUAccessToGPUMemoryTest: Cap host allocation to 512 MB under ASAN (#3407) 97c0206753 Update codeowners for thunk DXG (#3334) be44b28bb6 [rocdecode][rocjpeg] - ctest CMakeLists cleanup (#3632) 80ff0b8942 Various memory leak fixes in hip-tests (#3605) 0988f67a85 fix typo in help text (#3314) 9f823c53f1 Fix CUID file lookup by loading files before searching entries (#3436) 064c89261b SWDEV-546177 - hipModuleGetLoadingMode API impl (#653) 006213e112 ROCM-2696: Ignare size and base if null ptr (#3336) 6060b99d83 Improve atomic min max test perf (#2580) 3fbcc13602 Change printf capture impl (#1127) 93bc01937c (tag: hip-version_7.12.60610, origin/users/mradosav-amd/rocprofsys-selective-region) [ROCM-CORE] Update rdhc script to support rocm install prefix (ROCm/rocm-systems#3596) [AICOMRCCL-355]: https://amd-hub.atlassian.net/browse/AICOMRCCL-355?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

…r_groups return unmatched keys and add tests for it.

commit 8f69a63 (mar 12, 26). simplified generator loop, changed flags to include their label, optimized config(), more code style improvements.

…reate families based on cmake file. add functions to retrieve entries based on family, or default to a single entry for a family, make deduplciation of entries the default when retrieving multiple via get_entries_for_groups. add lookup() to tell you how the entry was resolved (target, family, family_prefix)

ScottTodd

Sorry, wish I could have given more of this feedback sooner, but I'm still not really seeing the dots fully connecting here. The "GPU family" code is adding significant complexity. I think for multi-arch we need to trim that out and make it less foundational to the CI build/test configuration code paths.

I think my comment on a flatter list of GpuRunners is trending in a cleaner direction.

ScottTodd · 2026-03-23T17:21:29Z

+    def test_unknown_key_returns_none(self):
+        self.assertIsNone(amdgpu_family_info_matrix_all.get_entry("gfx9999"))


This could also raise KeyError instead of returning None. WDYT?

i would keep it with None as this is kinda the expected behavior for python

get(key, default=None, /)
Return the value for key if key is in the dictionary, else default. If default is not given, it defaults to None, so that this method never raises a KeyError.

https://docs.python.org/3/library/stdtypes.html#dict.get

ScottTodd · 2026-03-23T17:23:35Z

+    def test_get_entries_for_groups_reports_unmatched(self):
+        result = amdgpu_family_info_matrix_all.get_entries_for_groups(
+            ["gfx94X-dcgpu", "gfx9999-unknown", "gfx1151"]
+        )
+        self.assertEqual(len(result.entries), 2)
+        self.assertEqual(result.unmatched_keys, ["gfx9999-unknown"])


Is "unmatched keys" useful to track like this, or could we just raise KeyError instead? Is there a use case for passing in groups that aren't in the code and getting this sort of detailed reporting on a successful (non-raising) execution flow?

considering that get() above returns None i sadly think the current approach is the more uniform approach than raising here keyerror.

discussion with claude:
Python style:

GroupLookupResult as a @DataClass instead of a tuple directly follows the style guide ("no tuples for structured data")

The None-returning get_entry mirrors dict.get() — consistent with standard Python lookup conventions

Returning partial results with an explicit miss report is not "silently continuing on errors" — it surfaces the problem. The style guide's fail-fast rule targets dropping errors, not delegating them to the
caller

Architecture:

get_entries_for_groups is a bulk filter operation, not a dict lookup. Bulk operations in Python conventionally report partial results rather than raising on the first miss (e.g., how `pathlib.Path.glob() just
returns what matches, or how importlib handles missing submodules)

The caller-decides-strictness pattern is standard for library code — the function is a primitive, not a policy enforcer

The one thing that would violate the fail-fast principle is if unmatched_keys were silently ignored downstream. But that's a caller problem, not an API problem. The API does its job by surfacing them.

Returning partial results with an explicit miss report is not "silently continuing on errors" — it surfaces the problem. The style guide's fail-fast rule targets dropping errors, not delegating them to the
caller

Can you give an example of where the caller would provide a list of groups that contains invalid entries? As much as possible, I'd like for the CI system (anything in build_tools/github_actions/) to be correct by construction and not need to patch around issues that it causes on its own. We can add unit tests to enforce only valid GPU targets/groups/families in any configurations (e.g. "run these tests nightly", "run these tests on every PR"). When there is user input (e.g. workflow dispatch text fields), we can validate those inputs during parsing.

I'm not sure if we even need this get_entries_for_groups() function that deduplicates and tracks unmatched. How do you see get_entries_for_groups() and even get_entry() being used? I suspect we could get away with just get_entry() and have that raise KeyError instead of returning None too.

get_entries_for_group() is the entry point for the workflow files.

the idea is that you take the comma-separate inputs.amdgpu_families, the configure_ci.py transforms it in a list and then you call get_entries_for_group() and get back all matching MatrixEntries. Later you can call MatrixEntry.to_dict(platform=linux) to get the dictionary of that particular arch for the linux platform (or None)

So I thought having an easy way of reporting back on non-matching archs might be nice :)

ScottTodd · 2026-03-23T17:25:17Z

+    def test_family_default_resolves_for_gfx110X(self):
+        entry = amdgpu_family_info_matrix_all.get_entry("gfx110X-all")
+        self.assertIsNotNone(entry)
+        self.assertEqual(entry.key, "gfx1101")


This test case is very similar to one above:

def test_family_prefix_resolves_for_gfx110X(self): entry = amdgpu_family_info_matrix_all.get_entry("gfx110X") self.assertIsNotNone(entry) self.assertEqual(entry.key, "gfx1101")

Maybe merge the test cases into a single test? Or put them next to each other?

I'm not sure how useful it is to have both tests. Maybe a single "prefix" test and then a negative test showing that something like gfx110X-unknowngroup raises an exception?

ScottTodd · 2026-03-23T17:29:02Z

+class TestDefaultConfig(unittest.TestCase):
+    """Verify default field values for entries with minimal configuration."""
+
+    def test_default_build_config(self):
+        cfg = BuildConfig()
+        self.assertEqual(cfg.build_variants, ["release"])
+        self.assertFalse(cfg.expect_failure)
+
+    def test_default_test_config_no_runner(self):
+        cfg = TestConfig()
+        self.assertFalse(cfg.run_tests)
+        self.assertEqual(cfg.test_scope, "all")
+        self.assertFalse(cfg.sanity_check_only_for_family)
+        self.assertFalse(cfg.expect_pytorch_failure)
+        self.assertFalse(cfg.runs_on)
+        self.assertEqual(cfg.fetch_gfx_targets, [])
+
+    def test_default_release_config(self):
+        cfg = ReleaseConfig()
+        self.assertFalse(cfg.bypass_tests_for_releases)
+
+    def test_platform_config_initializes_all_three(self):
+        cfg = PlatformConfig()
+        self.assertIsInstance(cfg.build, BuildConfig)
+        self.assertIsInstance(cfg.test, TestConfig)
+        self.assertIsInstance(cfg.release, ReleaseConfig)


Are these tests useful? They look like change detector tests just checking the default field values which are trivially checkable by reading the builder functions.

ScottTodd · 2026-03-23T17:34:07Z

+    def __bool__(self) -> bool:
+        # True if any runner is set; used by TestConfig.__post_init__ to infer run_tests.
+        return bool(self.test or self.test_multi_gpu or self.benchmark or self.extra)


The operator overloading here doesn't feel obvious to me. Could pick a function name like has_any_runner

def has_any_runner(self) -> bool: return any([self.test, self.test_multi_gpu, self.benchmark, self.extra])

you mean how run_tests is set?
that it is true by default if there are any runners and not overwritten to false?

for me this feels like that makes life quite simple once you get use to it?
i mean we can pack it in an extra function but then you in the ci you need to check 2x - or you just want to rename the funciton?

I'm referring to the function name (overloading __bool__).

Here are some other style guide references:

https://google.github.io/styleguide/pyguide.html#213-properties

Properties are allowed, but, like operator overloading, should only be used when necessary and match the expectations of typical attribute access; follow the getters and setters rules otherwise.

https://google.github.io/styleguide/cppguide.html#Operator_Overloading

Pros:
Operator overloading can make code more concise and intuitive by enabling user-defined types to behave the same as built-in types. Overloaded operators are the idiomatic names for certain operations (e.g., ==, <, =, and <<), and adhering to those conventions can make user-defined types more readable and enable them to interoperate with libraries that expect those names.

Cons:

Providing a correct, consistent, and unsurprising set of operator overloads requires some care, and failure to do so can lead to confusion and bugs.

Where this is used, I think this reads more clearly:

def __post_init__(self): - # set run_tests to True if any runners are specified, - # False otherwise, if not explicitly set + # Default to running tests if any runner is specified. if self.run_tests is None: - self.run_tests = bool(self.runs_on) + self.run_tests = self.runs_on.has_any_runner()

ScottTodd · 2026-03-23T17:58:52Z

+    """If True, only a sanity-check test subset is run, not the full suite."""
+    test_scope: Literal["all", "smoke", "full"] = "all"
+    """Which test subset to run: all (default), smoke (sanity only), full (skip smoke)."""
+    expect_pytorch_failure: bool = False


#3992 also added quick as a test scope/type

ScottTodd · 2026-03-23T18:20:06Z

+AmdGpuFamilyMatrix.get_default_for_family(family)
+    Return the is_family_default entry whose family list contains the given name,
+    or None if no default is set (e.g. gfx115X-all where GPUs are registered individually)
+    or the requested family does not exist.


What is a "family default"? I'm struggling to see what problem that is solving.

I think this gets back to my original first-pass review comments. This is adding indirection that we shouldn't need, and it's going to make changing this code harder.

The dataflow through

Configure CI

List of GPU targets

List of GPU runner labels

should be pretty straight-line, and this further encodes a hierarchy of "familes" with "defaults". If we add gfx1100, gfx1101, and gfx1102 runners for example, this code would keep resolving the "default" for "gfx110X-all" as "gfx1101" when what we really should be doing is passing that full list of gfx1100, gfx1101, gfx1102 directly through the pipeline.

I don't think we should design around internal code checking for gfx110X-all (for example), at all. We can design around workflow inputs that use such convenience families at the API boundary but the CI code should directly use what it intends, without needing all that extra machinery for prefixes / grouping / defaults.

When we build generic targets (https://llvm.org/docs/AMDGPUUsage.html, e.g. gfx11-generic), then I think a "default" of some sort may make sense, but I'm not seeing the value right now. We'll actually want a different one-to-many relationship in that case, like

GFX11_GENERIC = MatrixEntry( target="gfx11-generic", linux=PlatformConfig( test=TestConfig( runs_on=[ GpuRunners(test="linux-gfx1100-gpu-rocm", gfx_target="gfx1100"), GpuRunners(test="linux-gfx1151-gpu-rocm", gfx_target="gfx1151"), ], ), ), )

Here's a write-up from my review agent after some discussion: https://github.com/ScottTodd/claude-rocm-workspace/blob/main/reviews/pr_3653.md#%EF%B8%8F-important-family-resolution-layer-should-be-removed

we can remove it.
i added it to have some mapping from "someone puts manually gfx110X as gpu in the workflow" to "requesting gfx110X means calling gfx110X-all".

Yeah convenience mappings like "gfx110X input" --> "build [gfx1100, gfx1101, gfx1102, gfx1103] and test on any of those that have runners" are fine. Looking further forward, if we add a web UI to trigger CI/CD jobs we can add UI options for that like:

then, the code at the CI layer could just trust its input to already be sanitized/expanded.

ok then we leave it. i will see if i can clarify this a bit better in the comments.

ScottTodd · 2026-03-23T19:25:44Z

+    # gfx94X family — gfx942 is the only member of gfx94X-dcgpu
+    GFX942 = MatrixEntry(
+        target="gfx942",
+        is_family_default=True,
+        linux=PlatformConfig(
+            build=BuildConfig(build_variants=["release", "asan", "tsan"]),
+            test=TestConfig(
+                runs_on=GpuRunners(
+                    test="linux-mi325-1gpu-ossci-rocm-frac",
+                    test_multi_gpu="linux-mi325-8gpu-ossci-rocm",
+                    # TODO(#2754): Add new benchmark-runs-on runner for benchmarks
+                    benchmark="linux-mi325-8gpu-ossci-rocm",
+                    # TODO(#3433): Remove sandbox label once ASAN tests are passing
+                    extra={"test-sandbox": "linux-mi325-8gpu-ossci-rocm-sandbox"},
+                ),
+                fetch_gfx_targets=["gfx942"],
+            ),
+        ),
+    )


Here's an idea: what if we pulled the GpuRunners out into a single list, as an inventory of what test machines we have?

runners = [ GpuRunner(platform="linux", target="gfx942", label="linux-mi325-1gpu-ossci-rocm-frac"), GpuRunner(platform="linux", target="gfx942", label="linux-mi325-8gpu-ossci-rocm", role="test-multi-gpu"), GpuRunner(platform="linux", target="gfx942", label="linux-mi325-8gpu-ossci-rocm", role="benchmark"), GpuRunner(platform="linux", target="gfx90a", label="linux-gfx90a-gpu-rocm"), GpuRunner(platform="linux", target="gfx1030", label="linux-gfx1030-gpu-rocm"), GpuRunner(platform="linux", target="gfx1151", label="linux-gfx1151-gpu-rocm"), GpuRunner(platform="linux", target="gfx1151", label="linux-strix-halo-gpu-rocm-oem", role="oem"), GpuRunner(platform="windows", target="gfx1101", label="windows-gfx110X-gpu-rocm"), GpuRunner(platform="windows", target="gfx1151", label="windows-gfx1151-gpu-rocm"), GpuRunner(platform="windows", target="gfx1151", label="windows-gfx1151-gpu-rocm", role="benchmark"), # ... ]

(we could also have separate lists for Windows and Linux, to make the code more dense)

Code would query that list of runners when deciding how to test. If we want to take a runner offline we can comment out that row. If we want to pull that list from a dynamic location we don't need to update the entire build configuration (asan/release/etc.) too.

Here's how some of the other code using this could look:

# Matrix entries — per-target-per-platform policy entries = [ MatrixEntry(target="gfx942", platform="linux"), MatrixEntry(target="gfx1151", platform="linux", sanity_check_only=True), MatrixEntry(target="gfx1151", platform="windows", test_scope="full"), MatrixEntry(target="gfx1152", platform="linux", expect_build_failure=True), MatrixEntry(target="gfx1152", platform="windows", expect_build_failure=True), # ... ] # Group configs — what CI does per trigger presubmit = GroupConfig( build_families=["gfx94X-dcgpu", "gfx110X-all", "gfx115X-all", "gfx120X-all"], test_targets=["gfx942", "gfx1101", "gfx1151", "gfx1201"], build_variants=["release"], )

Here's a more concrete proposal: https://github.com/ScottTodd/claude-rocm-workspace/blob/main/reviews/pr_3653.md#proposed-alternative-three-flat-tables

i am all in in putting the gpurunners in a separate list. (i would also hope in the future we have a database to populate it. then we also dont have problems with outdated runners names)

however i think the rest of the proposal of claude is a downgrade and i would rather not.

I'm proposing a similarly structured list in some new code here: #4386

@dataclass(frozen=True) class S3BucketConfig: """Metadata for a single bucket in S3""" name: str """S3 bucket name (e.g. 'therock-ci-artifacts')""" ... s3_bucket_configs = [ # CI (self-hosted runners include credentials for therock-ci-artifacts-external) S3BucketConfig("therock-ci-artifacts", iam_role="therock-ci"), S3BucketConfig("therock-ci-artifacts-external", iam_role=None), # Release type "dev" S3BucketConfig("therock-dev-artifacts", iam_role="therock-dev"), S3BucketConfig("therock-dev-packages", iam_role="therock-dev"), S3BucketConfig("therock-dev-python", iam_role="therock-dev"), S3BucketConfig("therock-dev-tarball", iam_role="therock-dev"), ... ] _BUCKET_CONFIGS_BY_NAME = {c.name: c for c in s3_bucket_configs} ... def get_artifacts_bucket_config(...) -> S3BucketConfig: ... bucket_name = "therock-ci-artifacts" return _BUCKET_CONFIGS_BY_NAME[bucket_name]

sounds good. i will have a look how to best adapt this.

…ust python style type declaration (remove optional, ...), rename __bool__ with has_any_runner; extend description of is_family_default

…s.py

…m from 28th april, introduce bypass_tests_for_unscheduled (upstream = nightly_check_only_for_family)

…ed, add auto-fill for runner selection

…a deep copy and only contains the build variant requested, remove unsused lookup() which was very similar to getEntry(), support new gpurunner and buildrunner layout

…hted selection, uptodate with april 28 amdgpu_family_matrix.py

…fy_amdgpu_matrix

…k_amdgpu_targets.cmake.

HereThereBeDragons · 2026-04-29T14:03:11Z

New AMD GPU Family Matrix — what's new and how to use it

TL;DR: amdgpu_family_matrix.py (the big dict-of-dicts) is being replaced by three dataclass-based modules. Same data, but typed, validated, and with a real lookup API. Old file stays in place during transition — keep both in sync.

The matrix now also takes a lot of heavy lifting off configure_ci.py — build-variant scoping (dropping platforms that don't support a variant, trimming build_variants to the requested one), build-runner re-binding for non-release variants, GPU runner inventory selection, and run_tests inference all happen inside get_entry() now, so callers just consume the result.
Manual intervention is only needed for special runners like "oem" that will be delivered in the ["extra"] field and needs to be set for test/test_multi_gpu/etc.

Files

File	Purpose
`new_amdgpu_family_matrix_types.py`	Dataclass schema (no data)
`new_amdgpu_family_matrix_data.py`	The actual matrix entries — this is where you add/edit GPUs
`new_amdgpu_family_matrix_runners.py`	Build/GPU runner inventories with weighted selection — this is where you add/edit Runners

How to look up an entry

from new_amdgpu_family_matrix_data import amdgpu_family_info_matrix_all as matrix

# Exact target name
entry = matrix.get_entry("gfx942")

# Or family name — resolves to the is_family_default entry
entry = matrix.get_entry("gfx94X-dcgpu")

# Build variant scoping (default is "release")
asan = matrix.get_entry("gfx942", build_variant="asan")  # None if unsupported

# Multiple keys at once
result = matrix.get_entries_for_groups(["gfx942", "gfx1151"])
result.entries          # matched MatrixEntry list
result.unmatched_keys   # keys with no match (unknown or unsupported variant)

get_entry() always returns a deep copy scoped to the requested variant — mutate freely, no leakage into the shared matrix.

Important behavior changes vs. the old dict

Family membership is auto-populated from cmake/therock_amdgpu_targets.cmake — no longer hand-listed per entry. CMake is the source of truth.
Runners are no longer hardcoded per entry. BuildConfig.runs_on and TestConfig.runs_on are filled from the inventory in _runners.py (with weighted selection across pools). Override per-entry only when you need to pin a specific runner. (already includes [ci] Enabling aws-linux-scale-rocm to 10% #4899)
run_tests is inferred from whether any GPU runner is set. Set it explicitly to True/False only to override.
Build variants (release / asan / tsan) are first-class. Looking up a non-release variant drops platforms that don't list it and re-binds the build runner to the matching pool.
is_family_default marks the entry returned for family-name lookups (e.g. gfx94X-dcgpu → which specific gfx target). Validated at construction — at most one default per gfx-prefixed family.
Test scope is now a Literal["quick", "comprehensive", "full"] — typo-safe.
bypass_tests_for_unscheduled lets you skip tests on PR/push but still run on nightly schedule. (renamed from nightly_check_only_for_family )
Add a MatrixEntry to _MatrixEntries in _data.py — that's it. Family lists are derived from CMake; runners come from the inventory. Then mirror the change in the legacy amdgpu_family_matrix.py until that file is removed.

Adding a new field

Add to the right dataclass in _types.py with a default
If it's derived, compute it in __post_init__
Add to to_dict() if consumers need it in the serialized output

Notes

not yet included: [ci] Adding host-asan for postsubmit #4603

HereThereBeDragons · 2026-04-29T14:04:20Z

btw @geomin12 volunteered to help port the multi arch configure ci to use this new version :)

and i think this is a good point to kill off/stop supporting the old configure_ci.py

HereThereBeDragons · 2026-04-29T19:07:26Z

#4613 for tracking

jayhawk-commits

Still need some time to digest the design in this latest state,, but Codex did find a difference in behaviour from the current matrix.

In amdgpu_family_matrix.py, Linux tsan is defined without an expect_failure override, so current consumers treat TSAN as blocking. With this replacement data, BuildVariantInfo.to_dict() will serialize expect_failure=True, and a future migration would turn TSAN into a continue-on-error job unless that policy change is deliberate.

ScottTodd · 2026-05-01T20:18:47Z

In amdgpu_family_matrix.py, Linux tsan is defined without an expect_failure override, so current consumers treat TSAN as blocking. With this replacement data, BuildVariantInfo.to_dict() will serialize expect_failure=True, and a future migration would turn TSAN into a continue-on-error job unless that policy change is deliberate.

expect_failure isn't even in use or working right now. I think we should remove it, see #4500

geomin12

looks good! i did a cross reference with amdgpu_family_matrix.py, and all is covered (also checked with claude). have some housekeeping comments on easy of viewing and some additional entries

geomin12 · 2026-05-07T21:12:34Z

+    a family, the is_family_default entry carries the full CI config; siblings
+    get minimal config and inherit on family-name lookups."""
+
+    GFX900 = MatrixEntry(


gfx90c was added as well!

TheRock/cmake/therock_amdgpu_targets.cmake

Line 58 in 8d042a2

therock_add_amdgpu_target(gfx90c "AMD Renoir/Lucienne/Cezanne iGPU" FAMILY igpu-all gfx90c-igpu

geomin12 · 2026-05-07T21:15:51Z

+##########################################################################################
+
+
+def _build_groups() -> dict[str, list[str]]:


should we perhaps set these fields in the _MatrixEntries? that way we don't need to hard-code this and can collect the matrices from _MatrixEntries?

geomin12 · 2026-05-07T21:17:50Z

+    'test-sandbox'). Extra keys are forwarded as-is into GpuRunners.extra."""
+    label: str
+    """The runner label string."""
+    weight: float = 1.0


can we add some comment that is a temporary stopgap? i really don't like separate labels and having to adjust weights. i plan to work with OSSCI to update

geomin12 · 2026-05-07T21:18:36Z

+_TEST_RUNNER_INVENTORY: list[_GpuRunnerEntry] = [
+    # gfx90a
+    _GpuRunnerEntry(
+        platform="linux", target="gfx90a", role="test", label="linux-gfx90a-gpu-rocm"


nit: can we make this new line like the bigger entries? for consistency... although i don't know if pre-commit will complain or not

geomin12 · 2026-05-07T21:18:44Z

+    ),
+    # gfx1030
+    _GpuRunnerEntry(
+        platform="linux", target="gfx1030", role="test", label="linux-gfx1030-gpu-rocm"


same for all these entries

geomin12 · 2026-05-07T21:19:34Z

+    _BuildRunnerEntry(
+        platform="linux", variant="default", label="azure-linux-scale-rocm", weight=90
+    ),
+    _BuildRunnerEntry(
+        platform="linux", variant="default", label="aws-linux-scale-rocm", weight=10
+    ),
+    _BuildRunnerEntry(
+        platform="linux",
+        variant="sanitizer",
+        label="azure-linux-scale-rocm-heavy-ramdisk",
+        weight=1,
+    ),
+    _BuildRunnerEntry(
+        platform="windows",
+        variant="default",
+        label="azure-windows-scale-rocm",
+        weight=1,
+    ),


can we add over-arching comments to highlight the differences?

Suggested change

_BuildRunnerEntry(

platform="linux", variant="default", label="azure-linux-scale-rocm", weight=90

),

_BuildRunnerEntry(

platform="linux", variant="default", label="aws-linux-scale-rocm", weight=10

),

_BuildRunnerEntry(

platform="linux",

variant="sanitizer",

label="azure-linux-scale-rocm-heavy-ramdisk",

weight=1,

),

_BuildRunnerEntry(

platform="windows",

variant="default",

label="azure-windows-scale-rocm",

weight=1,

),

# default

_BuildRunnerEntry(

platform="linux", variant="default", label="azure-linux-scale-rocm", weight=90

),

_BuildRunnerEntry(

platform="linux", variant="default", label="aws-linux-scale-rocm", weight=10

),

# sanitizer

_BuildRunnerEntry(

platform="linux",

variant="sanitizer",

label="azure-linux-scale-rocm-heavy-ramdisk",

weight=1,

),

_BuildRunnerEntry(

platform="windows",

variant="default",

label="azure-windows-scale-rocm",

weight=1,

),

something like so? it was initially tough to spot

geomin12 · 2026-05-07T21:19:51Z

+        platform="linux", variant="default", label="azure-linux-scale-rocm", weight=90
+    ),
+    _BuildRunnerEntry(
+        platform="linux", variant="default", label="aws-linux-scale-rocm", weight=10


oh, can we make this have new lines with each entry? to make it easier to adjust / consistency with others

geomin12 · 2026-05-07T21:20:31Z

+    as-is regardless of weight; weights are relative and need not sum to 1.0."""
+    if len(entries) == 1:
+        return entries[0].label
+    return random.choices(entries, weights=[e.weight for e in entries], k=1)[0].label


way cleaner, thanks :)

geomin12 · 2026-05-07T21:21:21Z

+
+
+# Roles that map to the named fields on GpuRunners. Anything else goes into `extra`.
+_NAMED_ROLES = {"test", "test_multi_gpu", "benchmark"}


is sandbox supposed to be in here?

geomin12 · 2026-05-07T21:25:03Z

+            suffix="asan",
+            cmake_preset="linux-release-asan",
+        ),
+        "tsan": BuildVariantInfo(


i know host-asan will be added soon with #4603

ScottTodd

Marking as "reviewed" to drop from my review queue.

I still think we should remove the concept of gpu "families" from this part of the CI system entirely. This PR is probably too big and complex at this point - it would be faster to start fresh on a new branch/PR.

github-project-automation Bot added this to TheRock Triage Feb 26, 2026

github-project-automation Bot moved this to TODO in TheRock Triage Feb 26, 2026

HereThereBeDragons added 2 commits February 27, 2026 09:40

use classes for amdgpu matrix

f4afb86

update matrix data to match latest version, adjust dictionary creatio…

f6a8535

…n, introduce test_scope, remove push_on_success, always create all configs when declaring platformconfig, make arch lookup case-insensitive, add unittest

HereThereBeDragons force-pushed the users/lpromber/classify_amdgpu_matrix branch from 6b4fb3a to f6a8535 Compare February 27, 2026 18:05

HereThereBeDragons marked this pull request as ready for review February 27, 2026 18:12

HereThereBeDragons requested review from ScottTodd and geomin12 February 27, 2026 18:12

ScottTodd reviewed Mar 2, 2026

View reviewed changes

geomin12 approved these changes Mar 3, 2026

View reviewed changes

HereThereBeDragons added 2 commits March 6, 2026 18:34

adding igpu and clarifying "generic_scopes"

a4a3eee

move amdgpu_family_predefined_groups and _GFX entries into function/c…

2271c2d

…lass scope to not pollute module space

HereThereBeDragons requested a review from ScottTodd March 6, 2026 18:41

rahulc-gh mentioned this pull request Mar 9, 2026

Bump rocm-systems submodule from 93bc019 to 093b66caa3 #3754

Merged

amd-shiraz self-requested a review March 10, 2026 00:44

erman-gurses self-requested a review March 10, 2026 05:17

erman-gurses reviewed Mar 10, 2026

View reviewed changes

change request erman: add to_nested_dict tests and let get_entries_fo…

8f69a63

…r_groups return unmatched keys and add tests for it.

HereThereBeDragons requested a review from erman-gurses March 12, 2026 10:36

jayhawk-commits mentioned this pull request Mar 16, 2026

[issue] update fragile empty string for mi325 runner skip #3991

Open

HereThereBeDragons mentioned this pull request Mar 23, 2026

[ci] amdgpu_family_matrix.py data from S3 bucket #4020

Closed

ScottTodd reviewed Mar 23, 2026

View reviewed changes

This was referenced Mar 31, 2026

Fork configuration scripts and workflows for multi-arch CI #4123

Merged

ci: Fetch gfx1101 kpack artifacts for Windows gfx110X tests #4281

Merged

HereThereBeDragons requested a review from ScottTodd April 7, 2026 19:43

This was referenced Apr 9, 2026

[Multi-arch] Packaging workflows need complete target lists, not families for KPACK_SPLIT_ARTIFACTS #4433

Closed

Add --expand-family-to-targets to artifact_manager fetch #4449

Merged

HereThereBeDragons added 11 commits April 27, 2026 18:52

matrix_types: replace test scope with full, comprehensive, smoke; adj…

a27532d

…ust python style type declaration (remove optional, ...), rename __bool__ with has_any_runner; extend description of is_family_default

add buildconfig runs_on build runner to new_amdgpu_family_matrix_type…

59a0532

…s.py

_data.py: remove runners, add gfx900, update config to match upstrea…

6c45b6c

…m from 28th april, introduce bypass_tests_for_unscheduled (upstream = nightly_check_only_for_family)

_types: remove expect_pytorch_failure, add bypass_tests_for_unschedul…

0466aa8

…ed, add auto-fill for runner selection

_types.py: update and condense comments, getEntry now always returns …

cf49d7a

…a deep copy and only contains the build variant requested, remove unsused lookup() which was very similar to getEntry(), support new gpurunner and buildrunner layout

_runner.py: new file for buildrunner and gpurunners. support for weig…

e077b57

…hted selection, uptodate with april 28 amdgpu_family_matrix.py

fix precommit

0c52a0d

_types: fix how run_test bool is evaluated.

ea5edca

_data: update comments/description

a4ee3f6

_test.py: update and expand tests

8c9bb15

Merge remote-tracking branch 'origin/main' into users/lpromber/classi…

ed1fbf0

…fy_amdgpu_matrix

HereThereBeDragons requested a review from geomin12 April 29, 2026 13:36

start using #4449 marius script to extract families from cmake/theroc…

8b24f69

…k_amdgpu_targets.cmake.

add license header

ec6398c

HereThereBeDragons requested a review from jayhawk-commits April 30, 2026 12:17

jayhawk-commits reviewed May 1, 2026

View reviewed changes

geomin12 reviewed May 7, 2026

View reviewed changes

ScottTodd mentioned this pull request May 14, 2026

[feature] Improving runner data retrieval #5260

Open

ScottTodd reviewed May 18, 2026

View reviewed changes

		scope Subgroup within a family. Generic scopes are "dcgpu", "dgpu", "all".
		Specific GPU names (e.g. "gfx1151") are also valid scopes.

acronym	meaning
dgpu	discrete gpu
dcgpu	datacenter gpu
igpu	integrated gpu

		def test_unknown_key_returns_none(self):
		self.assertIsNone(amdgpu_family_info_matrix_all.get_entry("gfx9999"))

		##########################################################################################


		def _build_groups() -> dict[str, list[str]]:



		# Roles that map to the named fields on GpuRunners. Anything else goes into `extra`.
		_NAMED_ROLES = {"test", "test_multi_gpu", "benchmark"}

Conversation

HereThereBeDragons commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HereThereBeDragons commented Feb 27, 2026

Uh oh!

ScottTodd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HereThereBeDragons commented Mar 3, 2026

Uh oh!

geomin12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geomin12 commented Mar 3, 2026

Uh oh!

erman-gurses left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erman-gurses Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ScottTodd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HereThereBeDragons Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HereThereBeDragons commented Feb 26, 2026 •

edited

Loading

erman-gurses Mar 12, 2026 •

edited

Loading

HereThereBeDragons Apr 8, 2026 •

edited

Loading