Fork configuration scripts and workflows for multi-arch CI#4123
Conversation
New script for multi-arch CI configuration, designed as a pipeline of data transformations with typed dataclass interfaces between steps: 1. Parse Inputs (CIInputs from GITHUB_EVENT_PATH) 2. Check Skip CI (gate: skip-ci label, docs-only changes) 3. Select Targets (trigger type + labels → GPU families) 4. Decide Stages (changed files → rebuild/prebuilt per stage) 5. Expand Matrix (families × variant → matrix entries) 6. Write Outputs (JSON → GITHUB_OUTPUT) This commit is the scaffold — dataclasses and pipeline shape with stub implementations. Each step function has a TODO for Phase 2 logic. Tests (19 passing) demonstrate the pattern for exercising each step. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The CI pipeline is a DAG of job groups (build-rocm, test-rocm, build-rocm-python, build-pytorch, test-pytorch), not a flat list of stages. This restructures the data model to match: - JobGroupDecision base class (run/prebuilt/skip) with subclasses for groups that need extra details (BuildRocmDecision for per-stage granularity, TestRocmDecision for test type) - JobDecisions with explicit named fields for each job group node - decide_jobs() replaces decide_stages() - CIOutputs carries JobDecisions instead of flat test_type/stage lists 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Rename internal field to is_ci_enabled (output key stays enable_build_jobs for workflow compat) - Remove from __future__ import annotations per style guide - Move inline imports to top of file - Use named arguments for multi-param function calls - Drop unused targets param from decide_jobs() - Note that steps 3 and 4 are independent 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Collapse 4 event type property tests into 1 mutual exclusivity check - Extract _run_from_environ helper for GITHUB_EVENT_PATH test fixtures with doc link explaining the testing strategy - Remove _make_inputs helper — inline CIInputs construction so each test's inputs are visible at the call site - Use named args on expand_matrix test call - Focus from_environ tests on what's unique per event type (PR labels, workflow_dispatch inputs, push before SHA) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Reorder pipeline steps: decide_jobs (3) before select_targets (4) so target selection and matrix expansion are adjacent. select_targets policy per trigger type: - workflow_dispatch: explicit per-platform family lists from inputs - pull_request: presubmit defaults + gfx*/run-all-archs-ci label opt-ins - push: presubmit+postsubmit families - schedule: all families - Unknown trigger types raise ValueError Input parsing moved to the boundary: CIInputs.linux_amdgpu_families is now list[str] (parsed from comma-separated input in from_environ). Unknown family names fail fast with ValueError. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Expand comments in select_targets to explain the design rationale: trigger types form a progression from smallest (pull_request) to broadest (schedule), with workflow_dispatch giving full manual control. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…matrix Port the multi-arch matrix expansion logic from configure_ci.py's generate_multi_arch_matrix into the new pipeline's expand_matrix function. Groups families by build variant into a single MatrixEntry with per-family JSON for downstream per-architecture job expansion. A parity test confirms identical output to the old function for the same inputs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changes: - Replace variant-grouping loop with a flat membership check — there's only ever one build_variant per workflow run - Add expand_matrices() returning a MatrixExpansion dataclass for both platforms, with _expand_matrix_for_platform as a private helper - Rename lookup_matrix → all_families for consistency with select_targets - Log when a platform has no config for the requested build variant Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add amdgpu_family_matrix_test.py with schema validation tests: no duplicate family names per platform, required fields present, build_variants non-empty. This lets _expand_matrix_for_platform drop its defensive checks (seen_families dedup, .get() defaults for required fields) and index directly, relying on upstream validation from select_targets and data invariant tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename MatrixEntry → BuildConfig, expand_matrices → expand_build_configs. The helper returns BuildConfig | None instead of a 0-or-1-element list. BuildConfigs holds linux/windows as Optional fields. write_outputs now emits linux_build_config / windows_build_config as JSON objects (or empty string when skipped) plus linux_build_enabled / windows_build_enabled booleans, replacing the JSON array outputs that existed only to support a matrix strategy we don't need yet. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test through the public expand_build_configs API, not the private per-platform helper. Verify structural properties (schema, consistency between dist_amdgpu_families and matrix_per_family_json, filtering behavior) rather than specific data values from amdgpu_family_matrix.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Check skip-ci PR label first (takes priority), then delegate to is_ci_run_required() from configure_ci_path_filters.py for docs-only / no-files-changed detection. schedule and workflow_dispatch always proceed (changed_files is None). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add test_type determination to decide_jobs: quick (default), comprehensive (schedule), full (submodule changes or test labels), with test_filter: PR label override. Adopts the test type names from PR ROCm#3992. Extract GitContext dataclass to separate git-derived data from CIInputs. configure() is now fully pure — takes CIInputs + GitContext, no git calls. main() is the only place that builds GitContext.from_repo(). Tests construct GitContext directly. Rename parameters to ci_inputs/git_context throughout for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace mutable test_type/reason with early returns in priority order: test_filter label > test:* labels > schedule > submodule change > default. Eliminates risk of fall-through overwrites between conditions. Raise ValueError on unrecognized test_filter values instead of silently ignoring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each pipeline dataclass (CIInputs, GitContext, TargetSelection, JobDecisions, BuildConfigs) now has a log() method for CI diagnostics. configure() calls them after each step, replacing ad-hoc prints in main() and step functions. Remove getattr loops in favor of explicit field access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add setup_multi_arch.yml: checkout, call configure_multi_arch_ci.py (reads GITHUB_EVENT_PATH directly, no env var pass-through chain), compute package version. Update multi_arch_ci.yml to call setup_multi_arch.yml instead of setup.yml. Replace matrix strategy with direct fromJSON property access on per-platform build config objects. Route prebuilt_stages and baseline_run_id through the job graph model (BuildRocmDecision.stage_decisions and .baseline_run_id) rather than as raw pass-through strings. write_outputs derives the output values from the structured decisions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix stale test_type values in module docstring (smoke/full → quick/ standard/comprehensive/full) - Replace sys.exit(1) with raise RuntimeError in from_environ() - Rename lookup_matrix → all_families in _filter_families_by_platform for consistency with the rest of the file - Remove unused sys import Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lowercase family names in _parse_comma_list and PR labels in from_environ(). Matrix keys are lowercase (gfx94x) but workflow_dispatch descriptions show uppercase (gfx94X) — normalizing at the input boundary so internal functions don't need to care. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the minimal bullet list with structured markdown: test type with reason, platform build config table (families, variant, artifact group), job group decision table, and prebuilt stages when set. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rich step summary with DAG visualization, per-family test runner table, build-pytorch details, and non-default configuration callouts. Moved to a separate file to keep the main script focused on pipeline logic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…torch - Use GitHub alert syntax (> [!NOTE]) for non-default callouts - Mechanical trigger line: "Trigger: `event` on `branch`, `variant` variant." - "Runner Label" column header in test-rocm table - Add "# Build graph" comment to DAG - Remove build-pytorch section (not enough detail yet to be useful) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add === phase headers === to configure() so logs read as a narrative.
Print all CIInputs fields unconditionally (empty values are informative).
Add context to check_skip_ci ("Checking N files against path filters").
Handle empty-families case in summary ("No GPU families selected").
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace 7+ individual BuildConfig inputs on multi_arch_ci_linux.yml and multi_arch_ci_windows.yml with a single build_config JSON string. Downstream workflows unpack with fromJSON(inputs.build_config).field. Rename matrix_per_family_json (double-encoded JSON string) to per_family_info (native list). Eliminates the fragile fromJSON(fromJSON(...)) pattern — matrix strategies now use fromJSON(inputs.build_config).per_family_info directly. Fix github_actions_utils → github_actions_api import after rebase. Remove stale parity test (matrix data diverged on main). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verify that workflow YAML references to fromJSON(inputs.build_config) only use fields that exist on the Python BuildConfig dataclass. Linux checks exact match (all fields used), Windows checks no unknown fields (build_pytorch not yet used there). Also move imports to top of test file and update ci:skip/ci:run-all-archs label names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pass prebuilt stage info through expand_build_configs to _expand_build_config_for_platform, setting fields at construction time rather than mutating frozen dataclasses after the fact. Removes 2 separate outputs from setup_multi_arch.yml (9 → 7). Workflow YAML reads them from build_config via fromJSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ript Multi-arch CI now uses configure_multi_arch_ci.py exclusively. Remove generate_multi_arch_matrix, the multi_arch parameter from matrix_generator, and the ci:run-multi-arch label check from configure_ci.py. Add ci:run-multi-arch label gate to check_skip_ci in the new script: pull_request triggers require the label to opt in, avoiding doubled CI load during the transition. push/schedule/workflow_dispatch are unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Motivation I'm adding new code that depends on [`amdgpu_family_matrix.py`](https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/amdgpu_family_matrix.py) in #4123 and I want these structural invariants tested locally so I don't need to assert on them in the new downstream code. ## Technical Details We'll migrate over to https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/new_amdgpu_family_matrix.py soon, so I'm just adding a minimal set of tests for now. ## Test Plan * Run the unit tests ## Test Results Tests can fail as expected: * `test_required_fields_present` ```diff λ git diff diff --git a/build_tools/github_actions/amdgpu_family_matrix.py b/build_tools/github_actions/amdgpu_family_matrix.py --- a/build_tools/github_actions/amdgpu_family_matrix.py +++ b/build_tools/github_actions/amdgpu_family_matrix.py @@ -174,7 +174,6 @@ amdgpu_family_info_matrix_nightly = { "windows": { "test-runs-on": "", "family": "gfx900", - "fetch-gfx-targets": [], "build_variants": ["release"], "expect_pytorch_failure": True, }, ``` ``` (.venv) λ pytest build_tools/github_actions/tests/amdgpu_family_matrix_test.py -vv ... build_tools\github_actions\tests\amdgpu_family_matrix_test.py::TestFamilyMatrixInvariants::test_required_fields_present - AssertionError: gfx900/windows missing required fields: {'fetch-gfx-targets'} ``` * `test_no_duplicate_family_names_per_platform` ```diff diff --git a/build_tools/github_actions/amdgpu_family_matrix.py b/build_tools/github_actions/amdgpu_family_matrix.py index 63ad8c8..6eae4125 100644 --- a/build_tools/github_actions/amdgpu_family_matrix.py +++ b/build_tools/github_actions/amdgpu_family_matrix.py @@ -146,6 +146,18 @@ amdgpu_family_info_matrix_presubmit = { "build_variants": ["release"], }, }, + "gfx120x-all": { + "linux": { + # TODO(#2683): Re-enable label once stable + # Label is linux-gfx120X-gpu-rocm + "test-runs-on": "", + "family": "gfx120X-all", + "fetch-gfx-targets": ["gfx1200", "gfx1201"], + "bypass_tests_for_releases": True, + "build_variants": ["release"], + "sanity_check_only_for_family": True, + }, + }, } ``` ``` (.venv) λ pytest build_tools/github_actions/tests/amdgpu_family_matrix_test.py -vv ... build_tools\github_actions\tests\amdgpu_family_matrix_test.py::TestFamilyMatrixInvariants::test_no_duplicate_family_names_per_platform - AssertionError: Duplicate family 'gfx120X-all' on linux: target 'gfx120x-all' and 'gfx120x' ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
HereThereBeDragons
left a comment
There was a problem hiding this comment.
looks good so far what i saw.
reviewed everything but:
build_tools/github_actions/configure_multi_arch_ci.py
build_tools/github_actions/tests/configure_multi_arch_ci_test.py
will do those tomorrow
| return "\n".join(lines) | ||
|
|
||
|
|
||
| def _non_default_callouts(ci_inputs: CIInputs, outputs: CIOutputs) -> list[str]: |
There was a problem hiding this comment.
i never heard callout in that context. maybe a few comments could help to set the expectation
There was a problem hiding this comment.
Would "highlights" make more sense than "callouts"? These are any parts of the CI configuration that are different from what runs by "default", such as when a label was added to opt-in to some extra builds or tests. I figured that if we only showed tables with the computed values, it wouldn't be obvious what about those values is worth paying attention to.
There was a problem hiding this comment.
ping pong with claude suggests:
_config_overrides
_override_notes
_active_overrides
## Motivation Progress on #3399. I'm extending the configuration scripts/workflows for multi-arch CI in #4123 and the docs could use a refresh. ## Technical Details * Highlight the differences between what jobs run on different triggers (I also have changes on #4123 that will surface this in the step summary output) * Elaborate on "CI" vs "Multi-Arch CI" setup and roadmap ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude <noreply@anthropic.com>
|
Thanks for the feedback. We've been landing substantial changes to the old/current configure_ci.py and workflows as part of switching our default workflow from ci.yml to multi_arch_ci.yml. I'll sync this, resolve merge conflicts, then start addressing the comments next week. |
HereThereBeDragons
left a comment
There was a problem hiding this comment.
reviewing: build_tools/github_actions/configure_multi_arch_ci.py
i would recommend splitting this in multiple files. you already have 5 distinct steps + the data classes.
still missing: test file
see also comment about running those workflows for release branches.
Resolved conflicts and moved code for * log links * skip labels
Pass CIInputs to expand_build_configs instead of just build_variant, giving it access to PR labels. When a test_runner:<kernel> label is present (e.g. test_runner:oem), override the test runner for families that have a matching test-runs-on-kernel entry in the family matrix. Families without kernel support get their test runner cleared. Also adds test_runner: to the step summary highlights section. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
I think splitting files will make more sense once we start filling in new logic for the build and test jobs. As is, this is just porting the existing configure_ci.py logic to a new framework. |
|
Addressed all comments. PTAL? |
## Motivation I'm adding new code that depends on [`amdgpu_family_matrix.py`](https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/amdgpu_family_matrix.py) in #4123 and I want these structural invariants tested locally so I don't need to assert on them in the new downstream code. ## Technical Details We'll migrate over to https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/new_amdgpu_family_matrix.py soon, so I'm just adding a minimal set of tests for now. ## Test Plan * Run the unit tests ## Test Results Tests can fail as expected: * `test_required_fields_present` ```diff λ git diff diff --git a/build_tools/github_actions/amdgpu_family_matrix.py b/build_tools/github_actions/amdgpu_family_matrix.py --- a/build_tools/github_actions/amdgpu_family_matrix.py +++ b/build_tools/github_actions/amdgpu_family_matrix.py @@ -174,7 +174,6 @@ amdgpu_family_info_matrix_nightly = { "windows": { "test-runs-on": "", "family": "gfx900", - "fetch-gfx-targets": [], "build_variants": ["release"], "expect_pytorch_failure": True, }, ``` ``` (.venv) λ pytest build_tools/github_actions/tests/amdgpu_family_matrix_test.py -vv ... build_tools\github_actions\tests\amdgpu_family_matrix_test.py::TestFamilyMatrixInvariants::test_required_fields_present - AssertionError: gfx900/windows missing required fields: {'fetch-gfx-targets'} ``` * `test_no_duplicate_family_names_per_platform` ```diff diff --git a/build_tools/github_actions/amdgpu_family_matrix.py b/build_tools/github_actions/amdgpu_family_matrix.py index 63ad8c8..6eae4125 100644 --- a/build_tools/github_actions/amdgpu_family_matrix.py +++ b/build_tools/github_actions/amdgpu_family_matrix.py @@ -146,6 +146,18 @@ amdgpu_family_info_matrix_presubmit = { "build_variants": ["release"], }, }, + "gfx120x-all": { + "linux": { + # TODO(#2683): Re-enable label once stable + # Label is linux-gfx120X-gpu-rocm + "test-runs-on": "", + "family": "gfx120X-all", + "fetch-gfx-targets": ["gfx1200", "gfx1201"], + "bypass_tests_for_releases": True, + "build_variants": ["release"], + "sanity_check_only_for_family": True, + }, + }, } ``` ``` (.venv) λ pytest build_tools/github_actions/tests/amdgpu_family_matrix_test.py -vv ... build_tools\github_actions\tests\amdgpu_family_matrix_test.py::TestFamilyMatrixInvariants::test_no_duplicate_family_names_per_platform - AssertionError: Duplicate family 'gfx120X-all' on linux: target 'gfx120x-all' and 'gfx120x' ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
HereThereBeDragons
left a comment
There was a problem hiding this comment.
thanks you. looks good.
only minor ideas for improvement, directly answered in the comments/your reply.
Looks good: https://github.com/ROCm/TheRock/actions/runs/23810551251.
|
This adds new
setup_multi_arch.ymlandconfigure_multi_arch_ci_summary.pyconfiguration code for multi-arch CI.Motivation
The existing
build_tools/github_actions/configure_ci.pyscript and.github/workflows/setup.ymlworkflow were both tightly coupled to the non-multi-arch CI pipelines and multi-arch CI has some unique needs:I also judged that starting fresh would be easier, given architectural issues with the existing code:
matrix_generator()functionTechnical Details
Note
See my worklog for this feature branch here:
tasks/active/multi-arch-configure.mdThe new code is structured as a sequence of stages:
It includes several ✨NEW✨ features:
GITHUB_EVENT_PATHinstead of explicit/verbosegithub.event.inputsplumbingbuild_configJSON instead of explicit/verboseartifact_group,matrix_per_family_json,dist_amdgpu_families, etc. (to stay under input count limits and make cross-file maintenance hopefully easier)As well as a few 🪦REMOVED 🪦 features:
determine_long_lived_branch()to changepushbehavior based on the branch name - if we enable the workflow on a branch we should run the same set of jobsrun_functional_testsplumbing - this has not been added to multi-arch CI [yet?]Comparison Metrics
📊 Feature Comparison
matrix_generator(295 lines)base_argsdictJobActionenummatrix_generator+main()should_skip_ci()select_targets()main()_determine_test_type()strategy.matrixbuild_configJSON per platformuse_prebuilt_artifactsdict[str, JobAction]onBuildRocmDecisionfromJSONrefs from YAML, assert against dataclass fieldssetup.ymlcouplingGITHUB_EVENT_PATHdirectly🔎 Code Complexity
Around the same number of statements but with more comments and structure:
configure_ci.py(main)configure_multi_arch_ci.py+ summaryconfigure_ci.py🧪 Test Coverage
Logic and dataclasses are tested step by step, inputs/outputs are pushed to the boundaries for easier testing.
main(), multi-arch paths mixed with single-archfrom_environ,write_outputs,main)Test Plan
configure()pipelinefromJSONusagepull_request, multi-arch CI should run by defaultpushit should run with the same behavior as beforepushevents after mergeTest Result
Expected jobs ran on https://github.com/ROCm/TheRock/actions/runs/23773060560?pr=4123, configure output markdown looks as expected.
Submission Checklist