Skip to content

[CI] Weekly CI and introducing a new amdgpu matrix generator #1732

Draft
HereThereBeDragons wants to merge 27 commits into
mainfrom
users/lpromber/ci_weekly
Draft

[CI] Weekly CI and introducing a new amdgpu matrix generator #1732
HereThereBeDragons wants to merge 27 commits into
mainfrom
users/lpromber/ci_weekly

Conversation

@HereThereBeDragons
Copy link
Copy Markdown
Contributor

This introduces a new weekly CI.

To make this comfortable, larger changes of the amdgpu matrix generator were necessary. The amdgpu matrix generator can now be used to create build, test and/or release target config in a single run for both Windows and Linux. For pull request it keeps the functionality of dynamically deciding based on the modified files if it needs to build the target or not.

In addition, the new_setup.yml is used to provide the container image. It can either use the default CI TheRock image or accepts a custom one.

@HereThereBeDragons
Copy link
Copy Markdown
Contributor Author

HereThereBeDragons commented Jan 27, 2026

@marbre @ScottTodd @geomin12 @jayhawk-commits

just for info:
This is now at a state where I will fan this out into smaller PRs to get it merge.

  • new amdgpu_matrix layout and corresponding configure_amdgpu_matrix.py (replacing configure_ci.py); missing is the multi_arch generator(). will be done in 2nd step. this already integrated some version of geo's [ci] Adding ability to run kernel-specified test runners #2902
  • ci_weekly using the new layout/allowing to set container image that has now a default at central location (new_)setup.yml

this would run in parallel until all workflows are transferred. only needs to be adjusted for the high level workflows.

Successful cmake4 run (rebase on main from last Friday or this Monday?): https://github.com/ROCm/TheRock/actions/runs/21402078653
new run: https://github.com/ROCm/TheRock/actions/runs/21407180474

new layout of the matrix for the workflows:

amdgpu_family_matrix:
{'linux': [{'amdgpu_family': 'gfx94X-dcgpu',
            'build': {'artifact_group': 'gfx94X-dcgpu',
                      'build_variant_cmake_preset': '',
                      'build_variant_label': 'release',
                      'build_variant_suffix': '',
                      'expect_failure': False},
            'test': {'expect_pytorch_failure': False,
                     'run_tests': True,
                     'runs_on': {'benchmark': 'linux-mi325-1gpu-ossci-rocm-frac',
                                 'test': 'linux-mi325-1gpu-ossci-rocm-frac'},
                     'sanity_check_only_for_family': False}}]}

build, test, release are only added when requested and if there is content (e.g. no test runners = no test config returned). same with windows and linux.

HereThereBeDragons added a commit that referenced this pull request Feb 5, 2026
Move all functions related to ci path filtering and determining based on
this if the CI should be run or not into a separate file
`configure_ci_path_filters.py`.

Aside from adjusting description, also gives better names to the
following functions:
- get_modified_paths -> get_git_modified_paths
- get_therock_submodule_paths -> get_git_submodule_paths
- should_ci_run_given_modified_paths -> is_ci_run_required


Part of CI weekly progress (big picture #1732 ) so that the new and old
CI configurators can share those functions.
HereThereBeDragons added a commit that referenced this pull request Feb 12, 2026
This PR is part of enabling CI weekly (big picture PR #1732 ) . For
this, refactoring of amdgpu_family is needed for easier selection of a
specific gpu, and not having to rely on knowing in which group the
specific gpu is part of (presubmit/postsubmit/nightlies).

New layout puts all gpus in a single dictionary
`amdgpu_family_info_matrix_all`. Choices for pre/postsubmit and
nightlies are now done via a list.

`amdgpu_family_info_matrix_all` has more depth in hierarchy to better
define which parameters belong to which step: build, test, release.
- Tests can be now turned off with a single bool flag. no need to
comment out the runner
- Default runner labels are "test", "benchmark" and
"test-runs-on-multi-gpu". Further labels can be introduced, e.g. "oem"
which in the future can overwrite the default runners.

This new layout is introduced in parallel to the old one. The old one
stays unchanged to allow gradual move for the CI to use the new layout.
 commit  8f69a63 (mar 12, 26).  simplified generator loop, changed  flags to include their label,  optimized config(), more code style improvements.
ScottTodd added a commit that referenced this pull request Mar 31, 2026
This adds new `setup_multi_arch.yml` and
`configure_multi_arch_ci_summary.py` configuration code for multi-arch
CI.

## Motivation

The existing
[`build_tools/github_actions/configure_ci.py`](https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/configure_ci.py)
script and
[`.github/workflows/setup.yml`](https://github.com/ROCm/TheRock/blob/main/.github/workflows/setup.yml)
workflow were both tightly coupled to the non-multi-arch CI pipelines
and multi-arch CI has some unique needs:

* #3399
* Different workflow I/O (single-arch CI has a matrix across each
family, multi-arch CI has a single pipeline per platform)
* (Related) Setting alternate schedules, see
#1732

I also judged that starting fresh would be easier, given architectural
issues with the existing code:
* Mixed responsibilities in a 300 LOC `matrix_generator()` function
* Sequencing of decisions (whether to skip, what to build, what to test,
etc.) was scattered and sometimes duplicated

## Technical Details

> [!NOTE]
> See my worklog for this feature branch here:
[`tasks/active/multi-arch-configure.md`](https://github.com/ScottTodd/claude-rocm-workspace/blob/main/tasks/active/multi-arch-configure.md)

The new code is structured as a sequence of stages:

1. Parse inputs from github (triggers, labels) and git (files changed)
2. Check "skip CI" gate to early return
3. Decide which jobs to run and with what options (skip/use
prebuilt/build)
4. Decide which GPU targets/families to build (trigger type + labels -->
per-platform GPU families)
5. Expand per-platform build configuration data structures
6. Write outputs to github outputs and step summary

It includes several ✨NEW✨ features:
* Rendered markdown summarizing the configure output
    Before | After
    -- | --
https://github.com/ROCm/TheRock/actions/runs/23509071878?pr=4142 <img
width="586" height="440" alt="image"
src="https://github.com/user-attachments/assets/d03be1d9-c9cc-4075-b06f-2e48eb80f47a"
/> | https://github.com/ROCm/TheRock/actions/runs/23465566745?pr=4123
<img width="716" height="978" alt="image"
src="https://github.com/user-attachments/assets/83dac546-4e0d-4351-98a5-2170c562317d"
/>
* A collapsed build graph instead of a matrix with a single build
variant
    Before | After
    -- | --
https://github.com/ROCm/TheRock/actions/runs/23509071878?pr=4142 <img
width="967" height="430" alt="image"
src="https://github.com/user-attachments/assets/ace99a26-2e21-4763-aff0-cf8ffea85ab1"
/> | https://github.com/ROCm/TheRock/actions/runs/23465566745?pr=4123
<img width="1962" height="619" alt="image"
src="https://github.com/user-attachments/assets/b5101768-5390-4adf-87de-c5a99c45799f"
/>
* Reading inputs via `GITHUB_EVENT_PATH` instead of explicit/verbose
`github.event.inputs` plumbing
* Passing outputs via `build_config` JSON instead of explicit/verbose
`artifact_group`, `matrix_per_family_json`, `dist_amdgpu_families`, etc.
(to stay under input count limits and make cross-file maintenance
hopefully easier)

As well as a few 🪦REMOVED 🪦 features:
* No longer using `determine_long_lived_branch()` to change `push`
behavior based on the branch name - if we enable the workflow on a
branch we should run the same set of jobs
* No `run_functional_tests` plumbing - this has not been added to
multi-arch CI [yet?]

### Comparison Metrics

<details><summary>📊 Feature Comparison</summary>
<p>

| Feature | Old | New | Change |
|---------|-----|-----|--------|
| Pipeline architecture | Monolithic `matrix_generator` (295 lines) |
6-step pipeline of pure functions | Redesigned |
| Typing | Untyped `base_args` dict | 11 frozen dataclasses, `JobAction`
enum | New |
| Skip CI gate | Buried in `matrix_generator` + `main()` |
`should_skip_ci()` | Simplified |
| Target selection | Interleaved with matrix expansion |
`select_targets()` | Separated |
| Test type | Post-hoc mutation in `main()` | `_determine_test_type()` |
Separated |
| Matrix output | Array of per-variant rows via `strategy.matrix` |
Single `build_config` JSON per platform | Simplified |
| Prebuilt stages | Boolean `use_prebuilt_artifacts` | Per-stage
`dict[str, JobAction]` on `BuildRocmDecision` | More granular |
| Workflow contract | Implicit (YAML ↔ Python drift possible) | Contract
tests extract `fromJSON` refs from YAML, assert against dataclass fields
| Validated |
| Step summary | Limited information | Markdown with skip reasons,
per-family test table, non-default callouts | Redesigned |
| `setup.yml` coupling | 8+ env vars piped through YAML → Python |
Script reads `GITHUB_EVENT_PATH` directly | Decoupled |

</p>
</details>

<details><summary>🔎 Code Complexity</summary>
<p>

Around the same number of statements but with more comments and
structure:

| Metric | Old `configure_ci.py` (main) | New
`configure_multi_arch_ci.py` + summary | Remaining `configure_ci.py` |
|--------|-----|-----|-----|
| Lines | 828 | 947 + 204 = 1,151 | 676 |
| Statements | 348 | 335 + 114 = 449 | 302 |
| Functions | 8 | 30 + 6 = 36 | 7 |
| Classes/dataclasses | 0 | 11 | 0 |

</p>
</details>

<details><summary>🧪 Test Coverage</summary>
<p>

Logic and dataclasses are tested step by step, inputs/outputs are pushed
to the boundaries for easier testing.

| Metric | Old (main) | New | Delta |
|--------|-----------|-----|-------|
| Test functions | 33 (configure_ci_test) | 47 (multi_arch) | +14 |
| Test lines | 907 | 826 | -81 |
| Remaining configure_ci tests | — | 27 | -6 (multi-arch tests removed)
|
| Statement coverage (main script) | ~63% (configure_ci.py) | 84%
(configure_multi_arch_ci.py) | +21pp |
| Uncovered code | `main()`, multi-arch paths mixed with single-arch |
I/O boundary only (`from_environ`, `write_outputs`, `main`) | cleaner
boundary |

</p>
</details> 

## Test Plan

* New unit tests
* Tests for each stage, using the dataclasses/enums/etc. that are passed
between them
  * A few integration tests for the whole `configure()` pipeline
  * Tests for the workflow YAML files and their `fromJSON` usage
* Manual testing
  * on `pull_request`, multi-arch CI should run by default
  * on `push` it should run with the same behavior as before
* Watch multi-arch CI behavior for `push` events after merge

## Test Result

Expected jobs ran on
https://github.com/ROCm/TheRock/actions/runs/23773060560?pr=4123,
configure output markdown looks as expected.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

2 participants