Skip to content

[ci] amdgpu_family_matrix.py data from S3 bucket#4020

Closed
geomin12 wants to merge 8 commits into
mainfrom
users/geomin12/amdgpu-matrix-file-management
Closed

[ci] amdgpu_family_matrix.py data from S3 bucket#4020
geomin12 wants to merge 8 commits into
mainfrom
users/geomin12/amdgpu-matrix-file-management

Conversation

@geomin12
Copy link
Copy Markdown
Contributor

@geomin12 geomin12 commented Mar 17, 2026

Currently when amdgpu_family_matrix.py is updated, this requires a change in TheRock, rocm-libraries and rocm-systems. When doing runner updates (such as loss of mi325 / maintenance /etc), it takes time to land these changes

With the S3 JSON file, we are able to test in TheRock then propogate changes in the S3 file where other repos can instantly pick up updates.

Changes:

  • Creating gpu_runner_s3_config.py to pull data from s3 file and update matrix (optionally)
  • updates to setup.yml to not pick up override (for TheRock)
  • function to json.dumps to easily update s3 json file
  • created new s3 bucket called therock-ci-config with JSON data

A workflow_dispatch test with THEROCK_DISABLE_RUNNER_OVERRIDES = false so it gets s3 changes ( i made gfx94X get a runner called hello-from-aws): https://github.com/ROCm/TheRock/actions/runs/23220869538/job/67493747775#step:4:327

With this update, forked PRs and normal PRs will be able to pick up this change

This is included in multi-arch since multi-arch shares setup.yml and configure_ci.py

geomin12 and others added 7 commits March 16, 2026 17:15
Enable runtime configuration of CI runner labels without requiring PRs.
Overrides are fetched from a public S3 bucket and merged sparsely into
the family matrix during CI configuration.

Key features:
- Graceful fallback if S3 is unreachable (CI continues with static config)
- Module-level caching (one fetch per process)
- Environment variables for testing and disabling overrides
- Comprehensive unit tests with mocked network calls

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Set THEROCK_DISABLE_RUNNER_OVERRIDES=true in setup.yml so TheRock
  uses static matrix values (source of truth)
- Use str2bool for env var parsing instead of "1" check
- Add generate_overrides_json() helper to create S3 JSON from matrix
- Simplify test_runner_overrides.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@geomin12 geomin12 marked this pull request as ready for review March 18, 2026 15:14

# Public HTTPS URL (no auth needed for reads)
DEFAULT_OVERRIDE_URL = (
"https://therock-ci-config.s3.amazonaws.com/therock-runner-config.json"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in version control somewhere? It might be hard to track changes, or troubleshoot if things go awry otherwise.

If it were committed and tracked in TheRock repo, couldn't it just be requested from there? As an example, currently the pre/post cleanup processes script is sparsely checked out from TheRock repo in test workflows in rocm-systems and rocm-libraries without being pinned. Is there a particular reason to use S3 buckets?

Copy link
Copy Markdown
Contributor Author

@geomin12 geomin12 Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there can be version control on the bucket. the purpose of this is to:

  • TheRock being source of truth (when we onboard new machines, we test on TheRock first then land, then make update to s3)
  • If we unpinned SHA for this script in monorepos, machines may go wacky as all workflows use these machines immediately (we want TheRock to test first - similar to gfx110X windows, we found issues!)
  • Once machines are tested on TheRock, we can just use this generate_overrides_json function to copy exactly and paste into s3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right way. If you trigger a workflow, how do you trace back which therock-runner-config.json version have been pulled? This is most likely not reproducible. An alternative could be a TheRock-config repo which you checkout without pinning, so you always get the latest version but you know what you got as this was logged. You can get the same with extra logging but a git repo feels way more natural.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair as well, I wanted some seamless way to update runner names without much PR friction

However, an extra repo of TheRock-config would end up being the same scenario of TheRock with rocm-libraries/rocm-systems (as we want to test via TheRock first, then propogate to rocm-libraries/systems).

I'm okay with keeping the current format of amdgpu_family_matrix.py with additional PRs in other places, let me know! I can close this PR out

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that the data schema used for this config is far from trivial, and has no built in versioning (which you could get from a format like protobuf, flatbuffers, a version field in JSON, etc.). Using a completely unversioned file without any pinning is just asking for forwards/backwards-compatibility issues as the infrastructure code that reads from the file is changed (and it is changing already today - there's a rewrite in https://github.com/ROCm/TheRock/blob/main/build_tools/github_actions/new_amdgpu_family_matrix.py and @HereThereBeDragons has more changes coming).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the concern is wanting to test in TheRock without propagating immediately to rocm-systems and rocm-libraries, why not create a production branch in TheRock repo, that r-s and r-l can point to for checkout. This way it's controllable in TheRock, yet you could allow whatever gpu matrix setup be the latest in TheRock main branch.

https://github.com/ROCm/rocm-libraries/blob/47d3429cea5a05d25ac54cb2eede19a983e501f2/.github/workflows/therock-test-component.yml#L54-L72

This is how I implemented the ability for rocm-systems and rocm-libraries to always take the latest build_tools from TheRock (and only this folder with sparse checkout) since no ref is specified. The next checkout is the original pinned checkout used for the remainder of the job.

A live example of logs showing the latest ref being checked out if unspecified per @marbre:
https://github.com/ROCm/rocm-libraries/actions/runs/23351169868/job/67941706099#step:2:65
During the step Fetch 'build_tools' from repository

Determining the default branch
  Retrieving the default branch name
  Default branch 'main'
Fetching the repository
  "C:\Program Files\Git\cmd\git.exe" -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --filter=blob:none --depth=1 origin +refs/heads/main:refs/remotes/origin/main
  From https://github.com/ROCm/TheRock
   * [new branch]      main       -> origin/main
Determining the checkout info
Setting up sparse checkout
  "C:\Program Files\Git\cmd\git.exe" sparse-checkout set build_tools
Checking out the ref
  "C:\Program Files\Git\cmd\git.exe" checkout --progress --force -B main refs/remotes/origin/main
  Switched to a new branch 'main'
  branch 'main' set up to track 'origin/main'.
"C:\Program Files\Git\cmd\git.exe" log -1 --format=%H
095cfd35cc0c87d277c884757402768d76c7bb3e

Versus the step Checkout Repository

Fetching the repository
  "C:\Program Files\Git\cmd\git.exe" -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin 1408a826dc2c3446a0295c0ef0c71c4d92f0f4f9
  From https://github.com/ROCm/TheRock
   * branch            1408a826dc2c3446a0295c0ef0c71c4d92f0f4f9 -> FETCH_HEAD
Determining the checkout info
"C:\Program Files\Git\cmd\git.exe" sparse-checkout disable
"C:\Program Files\Git\cmd\git.exe" config --local --unset-all extensions.worktreeConfig
Checking out the ref
  "C:\Program Files\Git\cmd\git.exe" checkout --progress --force 1408a826dc2c3446a0295c0ef0c71c4d92f0f4f9
  Warning: you are leaving 1 commit behind, not connected to
  any of your branches:
  
    f9cf9bb [ci] fixing hipsparse gtest filter (#4053)
  
  If you want to keep it by creating a new branch, this may be a good time
  to do so with:
  
   git branch <new-branch-name> f9cf9bb
  
  HEAD is now at 1408a82 [ci] disabling mi325 test runners (#3969)
"C:\Program Files\Git\cmd\git.exe" log -1 --format=%H
1408a826dc2c3446a0295c0ef0c71c4d92f0f4f9

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's brittle too - we can't rename or move that cleanup_processes.ps1 file now without breaking CI in that repository. It's a small file, just copy it into the repository that needs it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's definitely a limitation as there were moments where I did want to rename the script. I think at the time the trade off was for faster iteration since every repo would be running tests on runners shared by all repos. Would have to consider alternate implementations in the future.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We pin the versions of dependencies precisely so we don't get stuck being unable to update either repository safely like that. It might not break today, but any unpinned dependency like that is just waiting to break in the future.

@geomin12 geomin12 requested a review from amd-justchen March 18, 2026 22:49
Copy link
Copy Markdown
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at my review comment. I think this needs more discussion before this should go in.


# Public HTTPS URL (no auth needed for reads)
DEFAULT_OVERRIDE_URL = (
"https://therock-ci-config.s3.amazonaws.com/therock-runner-config.json"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right way. If you trigger a workflow, how do you trace back which therock-runner-config.json version have been pulled? This is most likely not reproducible. An alternative could be a TheRock-config repo which you checkout without pinning, so you always get the latest version but you know what you got as this was logged. You can get the same with extra logging but a git repo feels way more natural.

@geomin12 geomin12 marked this pull request as draft March 20, 2026 03:40
@geomin12
Copy link
Copy Markdown
Contributor Author

closing to re-work

@geomin12 geomin12 closed this Mar 20, 2026
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Mar 20, 2026
@HereThereBeDragons
Copy link
Copy Markdown
Contributor

#3653 this is the current rework of the new amdgpu family matrix, still waiting for final approval from @ScottTodd.
we could think about expanding the GpuRunner class do also pull setting from remote to get the latest runner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants