[ci] amdgpu_family_matrix.py data from S3 bucket#4020
Conversation
Enable runtime configuration of CI runner labels without requiring PRs. Overrides are fetched from a public S3 bucket and merged sparsely into the family matrix during CI configuration. Key features: - Graceful fallback if S3 is unreachable (CI continues with static config) - Module-level caching (one fetch per process) - Environment variables for testing and disabling overrides - Comprehensive unit tests with mocked network calls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Set THEROCK_DISABLE_RUNNER_OVERRIDES=true in setup.yml so TheRock uses static matrix values (source of truth) - Use str2bool for env var parsing instead of "1" check - Add generate_overrides_json() helper to create S3 JSON from matrix - Simplify test_runner_overrides.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
|
||
| # Public HTTPS URL (no auth needed for reads) | ||
| DEFAULT_OVERRIDE_URL = ( | ||
| "https://therock-ci-config.s3.amazonaws.com/therock-runner-config.json" |
There was a problem hiding this comment.
Is this in version control somewhere? It might be hard to track changes, or troubleshoot if things go awry otherwise.
If it were committed and tracked in TheRock repo, couldn't it just be requested from there? As an example, currently the pre/post cleanup processes script is sparsely checked out from TheRock repo in test workflows in rocm-systems and rocm-libraries without being pinned. Is there a particular reason to use S3 buckets?
There was a problem hiding this comment.
there can be version control on the bucket. the purpose of this is to:
- TheRock being source of truth (when we onboard new machines, we test on TheRock first then land, then make update to s3)
- If we unpinned SHA for this script in monorepos, machines may go wacky as all workflows use these machines immediately (we want TheRock to test first - similar to gfx110X windows, we found issues!)
- Once machines are tested on TheRock, we can just use this
generate_overrides_jsonfunction to copy exactly and paste into s3
There was a problem hiding this comment.
Thanks for the context!
There was a problem hiding this comment.
I don't think this is the right way. If you trigger a workflow, how do you trace back which therock-runner-config.json version have been pulled? This is most likely not reproducible. An alternative could be a TheRock-config repo which you checkout without pinning, so you always get the latest version but you know what you got as this was logged. You can get the same with extra logging but a git repo feels way more natural.
There was a problem hiding this comment.
that's fair as well, I wanted some seamless way to update runner names without much PR friction
However, an extra repo of TheRock-config would end up being the same scenario of TheRock with rocm-libraries/rocm-systems (as we want to test via TheRock first, then propogate to rocm-libraries/systems).
I'm okay with keeping the current format of amdgpu_family_matrix.py with additional PRs in other places, let me know! I can close this PR out
There was a problem hiding this comment.
Also note that the data schema used for this config is far from trivial, and has no built in versioning (which you could get from a format like protobuf, flatbuffers, a version field in JSON, etc.). Using a completely unversioned file without any pinning is just asking for forwards/backwards-compatibility issues as the infrastructure code that reads from the file is changed (and it is changing already today - there's a rewrite in https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/new_amdgpu_family_matrix.py and @HereThereBeDragons has more changes coming).
There was a problem hiding this comment.
If the concern is wanting to test in TheRock without propagating immediately to rocm-systems and rocm-libraries, why not create a production branch in TheRock repo, that r-s and r-l can point to for checkout. This way it's controllable in TheRock, yet you could allow whatever gpu matrix setup be the latest in TheRock main branch.
This is how I implemented the ability for rocm-systems and rocm-libraries to always take the latest build_tools from TheRock (and only this folder with sparse checkout) since no ref is specified. The next checkout is the original pinned checkout used for the remainder of the job.
A live example of logs showing the latest ref being checked out if unspecified per @marbre:
https://github.com/ROCm/rocm-libraries/actions/runs/23351169868/job/67941706099#step:2:65
During the step Fetch 'build_tools' from repository
Determining the default branch
Retrieving the default branch name
Default branch 'main'
Fetching the repository
"C:\Program Files\Git\cmd\git.exe" -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --filter=blob:none --depth=1 origin +refs/heads/main:refs/remotes/origin/main
From https://github.com/ROCm/TheRock
* [new branch] main -> origin/main
Determining the checkout info
Setting up sparse checkout
"C:\Program Files\Git\cmd\git.exe" sparse-checkout set build_tools
Checking out the ref
"C:\Program Files\Git\cmd\git.exe" checkout --progress --force -B main refs/remotes/origin/main
Switched to a new branch 'main'
branch 'main' set up to track 'origin/main'.
"C:\Program Files\Git\cmd\git.exe" log -1 --format=%H
095cfd35cc0c87d277c884757402768d76c7bb3e
Versus the step Checkout Repository
Fetching the repository
"C:\Program Files\Git\cmd\git.exe" -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin 1408a826dc2c3446a0295c0ef0c71c4d92f0f4f9
From https://github.com/ROCm/TheRock
* branch 1408a826dc2c3446a0295c0ef0c71c4d92f0f4f9 -> FETCH_HEAD
Determining the checkout info
"C:\Program Files\Git\cmd\git.exe" sparse-checkout disable
"C:\Program Files\Git\cmd\git.exe" config --local --unset-all extensions.worktreeConfig
Checking out the ref
"C:\Program Files\Git\cmd\git.exe" checkout --progress --force 1408a826dc2c3446a0295c0ef0c71c4d92f0f4f9
Warning: you are leaving 1 commit behind, not connected to
any of your branches:
f9cf9bb [ci] fixing hipsparse gtest filter (#4053)
If you want to keep it by creating a new branch, this may be a good time
to do so with:
git branch <new-branch-name> f9cf9bb
HEAD is now at 1408a82 [ci] disabling mi325 test runners (#3969)
"C:\Program Files\Git\cmd\git.exe" log -1 --format=%H
1408a826dc2c3446a0295c0ef0c71c4d92f0f4f9There was a problem hiding this comment.
That's brittle too - we can't rename or move that cleanup_processes.ps1 file now without breaking CI in that repository. It's a small file, just copy it into the repository that needs it?
There was a problem hiding this comment.
That's definitely a limitation as there were moments where I did want to rename the script. I think at the time the trade off was for faster iteration since every repo would be running tests on runners shared by all repos. Would have to consider alternate implementations in the future.
There was a problem hiding this comment.
We pin the versions of dependencies precisely so we don't get stuck being unable to update either repository safely like that. It might not break today, but any unpinned dependency like that is just waiting to break in the future.
marbre
left a comment
There was a problem hiding this comment.
Please take a look at my review comment. I think this needs more discussion before this should go in.
|
|
||
| # Public HTTPS URL (no auth needed for reads) | ||
| DEFAULT_OVERRIDE_URL = ( | ||
| "https://therock-ci-config.s3.amazonaws.com/therock-runner-config.json" |
There was a problem hiding this comment.
I don't think this is the right way. If you trigger a workflow, how do you trace back which therock-runner-config.json version have been pulled? This is most likely not reproducible. An alternative could be a TheRock-config repo which you checkout without pinning, so you always get the latest version but you know what you got as this was logged. You can get the same with extra logging but a git repo feels way more natural.
|
closing to re-work |
|
#3653 this is the current rework of the new amdgpu family matrix, still waiting for final approval from @ScottTodd. |
Currently when
amdgpu_family_matrix.pyis updated, this requires a change in TheRock, rocm-libraries and rocm-systems. When doing runner updates (such as loss of mi325 / maintenance /etc), it takes time to land these changesWith the S3 JSON file, we are able to test in TheRock then propogate changes in the S3 file where other repos can instantly pick up updates.
Changes:
gpu_runner_s3_config.pyto pull data from s3 file and update matrix (optionally)therock-ci-configwith JSON dataA workflow_dispatch test with
THEROCK_DISABLE_RUNNER_OVERRIDES = falseso it gets s3 changes ( i made gfx94X get a runner calledhello-from-aws): https://github.com/ROCm/TheRock/actions/runs/23220869538/job/67493747775#step:4:327With this update, forked PRs and normal PRs will be able to pick up this change
This is included in multi-arch since multi-arch shares setup.yml and configure_ci.py