Add gating mechanism to ensure PyTorch wheels pass tests before releasing#1110
Conversation
marbre
left a comment
There was a problem hiding this comment.
Some early feedback but I would like to take a closer look.
Please copy over the relevant parts from the issue description to the PR description as not all conclusions are implemented with this PR (e.g. ROCm releases are released also if PyTorch tests / builds fail) whereas when reading the the issue one might come to another conclusion.
marbre
left a comment
There was a problem hiding this comment.
The workflow right now uses
echo "torch_version=`cd ${{ env.PACKAGE_DIST_DIR }}; python -c 'import glob; print(glob.glob("torch-*.whl")[0].split("-")[1])'`" >> $GITHUB_OUTPUTin the Build PyTorch Wheels step to determine the exact torch package version. Instead of using {package}-*+${rocm_suffix} in the later upload step to copy from staging to release, the build step should rather be extended to explicitly name the torchvision and torchaudio versions as well. As part of #1100, @ScottTodd as authored build_tools/github_actions/write_torch_version.py which could be landed first and then used and extended here. With that you can make sure you install and test the correct versions.
Side note: torch_version is passed to setup.venv.py
while torchaudio and torchvision versions are not. I assume those are not tested anyway right now but we could add a TODO note to make sure versions for them get passed as well as soon as / if we add tests for those.
Landed that script in 1b403be.
Yes, those are not tested yet. |
7abb8e1 to
48ab8a9
Compare
There was a problem hiding this comment.
Are these changes for testing while the PR is still under development? If so, please call them out as such in the code, so it doesn't confuse reviewers.
The `${{ env.PACKAGE_DIST_DIR }}` value was not making its way to the
script correctly (see [logs
1](https://github.com/ROCm/TheRock/actions/runs/16538063884/job/46775446523),
[logs
2](https://github.com/ROCm/TheRock/actions/runs/16542014324/job/46784536594)).
We would look in `/__w/TheRock/TheRock/output/packages/dist` while the
files were in `/home/runner/_work/TheRock/TheRock/output/packages/dist`.
This also includes other changes discussed on
#1110 (comment).
Tested:
* Linux: https://github.com/ROCm/TheRock/actions/runs/16542818823.
```
2025-07-26T19:11:37.0162660Z Looking for wheels in
'/home/runner/_work/TheRock/TheRock/output/packages/dist'
2025-07-26T19:11:37.0163431Z Found files in that directory:
2025-07-26T19:11:37.0164351Z
/home/runner/_work/TheRock/TheRock/output/packages/dist/torchaudio-2.7.1a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-linux_x86_64.whl
2025-07-26T19:11:37.0165576Z
/home/runner/_work/TheRock/TheRock/output/packages/dist/pytorch_triton_rocm-3.3.1-cp312-cp312-linux_x86_64.whl
2025-07-26T19:11:37.0166748Z
/home/runner/_work/TheRock/TheRock/output/packages/dist/torch-2.7.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-linux_x86_64.whl
2025-07-26T19:11:37.0167850Z
/home/runner/_work/TheRock/TheRock/output/packages/dist/torchvision-0.22.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-linux_x86_64.whl
2025-07-26T19:11:37.0168436Z
2025-07-26T19:11:37.0168632Z Looking for 'torch in
'/home/runner/_work/TheRock/TheRock/output/packages/dist'
2025-07-26T19:11:37.0169204Z Found wheel at
'torch-2.7.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-linux_x86_64.whl'
2025-07-26T19:11:37.0169772Z Parsed version
'2.7.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9'
2025-07-26T19:11:37.0170235Z Looking for 'torchaudio in
'/home/runner/_work/TheRock/TheRock/output/packages/dist'
2025-07-26T19:11:37.0170861Z Found wheel at
'torchaudio-2.7.1a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-linux_x86_64.whl'
2025-07-26T19:11:37.0171469Z Parsed version
'2.7.1a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9'
2025-07-26T19:11:37.0171936Z Looking for 'torchvision in
'/home/runner/_work/TheRock/TheRock/output/packages/dist'
2025-07-26T19:11:37.0172641Z Found wheel at
'torchvision-0.22.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-linux_x86_64.whl'
2025-07-26T19:11:37.0173242Z Parsed version
'0.22.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9'
2025-07-26T19:11:37.0173740Z Looking for 'pytorch_triton_rocm in
'/home/runner/_work/TheRock/TheRock/output/packages/dist'
2025-07-26T19:11:37.0174222Z Found wheel at
'pytorch_triton_rocm-3.3.1-cp312-cp312-linux_x86_64.whl'
2025-07-26T19:11:37.0174539Z Parsed version '3.3.1'
2025-07-26T19:11:37.0174675Z
2025-07-26T19:11:37.0174679Z
2025-07-26T19:11:37.0174762Z Setting github output:
2025-07-26T19:11:37.0175683Z {'torch_version':
'2.7.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9',
'torchaudio_version':
'2.7.1a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9',
'torchvision_version':
'0.22.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9',
'triton_version': '3.3.1'}
```
Note: tests failed with `ERROR: Could not find a version that satisfies
the requirement pytorch-triton-rocm==3.3.1 (from torch) (from versions:
none)`. I think the wheels are in the bucket but the index needs to be
populated? We also need to plumb through the new versions, as discussed
on #1110.
* Windows: https://github.com/ROCm/TheRock/actions/runs/16543042986
```
Looking for wheels in
'C:\home\runner\_work\TheRock\TheRock\output\packages\dist'
Found files in that directory:
C:\home\runner\_work\TheRock\TheRock\output\packages\dist\torch-2.9.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-win_amd64.whl
C:\home\runner\_work\TheRock\TheRock\output\packages\dist\torchaudio-2.8.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-win_amd64.whl
Looking for 'torch in
'C:\home\runner\_work\TheRock\TheRock\output\packages\dist'
Found wheel at
'torch-2.9.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-win_amd64.whl'
Parsed version
'2.9.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9'
Looking for 'torchaudio in
'C:\home\runner\_work\TheRock\TheRock\output\packages\dist'
Found wheel at
'torchaudio-2.8.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-win_amd64.whl'
Parsed version
'2.8.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9'
Looking for 'torchvision in
'C:\home\runner\_work\TheRock\TheRock\output\packages\dist'
WARNING: Found no 'torchvision' wheels matching 'torchvision-*.whl'
Looking for 'pytorch_triton_rocm in
'C:\home\runner\_work\TheRock\TheRock\output\packages\dist'
WARNING: Found no 'pytorch_triton_rocm' wheels matching
'pytorch_triton_rocm-*.whl'
Did not find torchvision (that's okay, is not currently built on
Windows)
Did not find triton (that's okay, is not currently built on Windows)
Setting github output:
{'torch_version':
'2.9.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9',
'torchaudio_version':
'2.8.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9'}
```
070aab3 to
3ffddcd
Compare
02c2475 to
c8b11dd
Compare
marbre
left a comment
There was a problem hiding this comment.
When building the torch packages, ROCm is pulled from the CloudFront URL and not via the one pointing to the staging index. Should we pull from the staging bucket instead?
The main issue that this will block releasing torch wheels for which there isn't yet a test runner, see https://github.com/ROCm/TheRock/actions/runs/16649247276/job/47118083365. If target to run doesn't return a test target, we should still upload.
|
@ScottTodd @marbre wanted to get inputs on "When building the torch packages, ROCm is pulled from the CloudFront URL and not via the one pointing to the staging index. Should we pull from the staging bucket instead?" We copy the ROCm to staging URL too, so I can trigger to build from staging index if needed, but dont see too much of an advantage here as we are uploading the built wheels only after gating with tests. Will this be of any added significance? |
|
I think we'll need to remove |
This reverts commit 30dc451.
…sult is skipped if test_runs_on is populated
…2 only if tests pass even for missing CI runners case
…els to v2 only if tests pass even for missing CI runners case" This reverts commit bc9d2097f640bd902903adfb8fe764a012b4c50b.
cc51a5b to
91e4976
Compare
… an override and utilizing the same to decide upload to v2
| python ./build_tools/third_party/s3_management/manage.py ${{ env.S3_STAGING_SUBDIR }}/${{ matrix.target_bundle.amdgpu_family }} | ||
|
|
||
| ## TODO: Restrict uploading to the non-staging S3 directory until ROCm sanity checks and all validation tests have successfully passed. | ||
| - name: Upload Releases to S3 |
There was a problem hiding this comment.
(future work)
Out of curiosity, is this "upload from local to cloud" slower or faster than the "copy from cloud to cloud" that the Copy PyTorch wheels from staging to release S3 step in .github/workflows/build_portable_linux_pytorch_wheels.yml?
I think we could script these uploads/copies so we aren't inlining as much code into .yml files, in which case we could have a "copy release from staging to tested" mode on the script that we can use for both pytorch wheels and rocm wheels.
There was a problem hiding this comment.
(future work)
Let's make sure these changes are carried over to the Windows release workflows too. Moving steps into scripts instead of inlined commands in yml will help with that.
| amdgpu_family_info_matrix = ( | ||
| amdgpu_family_info_matrix_presubmit | amdgpu_family_info_matrix_postsubmit | ||
| ) |
There was a problem hiding this comment.
Fine for now since this is following the existing code patterns, but I just edited some of this code in 1376958.
We should be able to replace this code pattern with amdgpu_family_info_matrix_all now, in case we add runners for targets that are in the "nightly" (previously "xfail") group.
|
@ScottTodd made the changes requested. I have triggered a run to verify the last PR changes https://github.com/ROCm/TheRock/actions/runs/17283708950 |
Add inputs added for build_portable_linux_pytorch_wheels.yml from #1110 Added required inputs for both the release_portable_linux_packages(Which runs build_portable_linux_pytorch_wheels.yml) and also release_portable_linux_pytorch_wheels. --------- Co-authored-by: arravikum <arravikum@amd.com>
Raising this PR to add the changes we added for Pytorch Gating in Linux through #1110 Also added the staging bucket mechanism added for Linux builds Below are the steps we want integrated as part of windows builds too: Build PyTorch wheels Upload the built wheels to the v2-staging (staging bucket) Run PyTorch tests using wheels from the staging bucket Only if tests pass, copy the validated wheels from the staging bucket to the release bucket If no runner is available: Promotion is blocked by default. Set bypass_tests_for_releases=true only for exceptional cases under amdgpu_family_matrix.py --------- Co-authored-by: arravikum <arravikum@amd.com>
## Motivation Follow-up to #1110 and #1382. Progress on #1072. This test run https://github.com/ROCm/TheRock/actions/runs/17662170140/job/50199734639#step:6:37 failed with ``` ++ Exec [C:\runner\_work\TheRock\TheRock]$ 'C:\runner\_work\TheRock\TheRock\.venv\Scripts\python.exe' -m pip install --index-url=https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151 torch==2.10.0a0+rocm7.0.0rc20250908 Looking in indexes: https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151 ERROR: Could not find a version that satisfies the requirement torch==2.10.0a0+rocm7.0.0rc20250908 (from versions: 2.7.0a0+rocm7.0.0.dev0.661b3907cf184e33f44256c24b88fc28a9251ec4, 2.7.0a0+rocm7.0.0.dev0.98ed4ad77f79822694ec01a36180ec3b95f4bd00, 2.7.0a0+rocm7.0.0.dev0.dea79b8f65819d046c7ec00a2b3ccdf5e98fbe5a, 2.7.0a0+rocm7.0.0.dev0.e0d25c8e8ca28b56c8155902c8f04e1767de4394, 2.7.0a0+rocm7.0.0.dev0.e96d36b9b628476463ef6cecee752f601052a4d2, 2.9.0a0+rocm7.0.0rc20250804, 2.9.0a0+rocmsdk20250819, 2.9.0a0+rocmsdk20250820, 2.9.0a0+rocmsdk20250821, 2.9.0a0+rocmsdk20250825) ERROR: No matching distribution found for torch==2.10.0a0+rocm7.0.0rc20250908 ``` ## Technical Details The build job only uploaded to v2-staging, so the test job should download from v2-staging, not v2. * https://d25kgig7rdsyks.cloudfront.net/v2-staging/gfx1151/torch/ uploaded to * https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151/torch/ attempted to download from The change here was made on Linux but was overlooked during the porting to Windows: https://github.com/ROCm/TheRock/blob/eb05061cb055b8626ec2e083e4eb87e90bd79f02/.github/workflows/build_portable_linux_pytorch_wheels.yml#L219-L227 The https://github.com/ROCm/TheRock/actions/runs/17585233778 test run mentioned at #1382 (comment) did not run any tests and did not exercise this code because we only have gfx1151 runners and that used gfx110X-dgpu. Those are easy mistakes to make. We could 1. Rename the `cloudfront_url` input to carry more meaning about what it represents, like `package_index_url` or `staging_package_index_url` 2. Bring up runners for more GPUs or change the default workflow_dispatch GPU type to one we have test runners for ## Test Plan Untested. ## Test Result Nope. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
This PR introduces a gating mechanism to ensure PyTorch wheels are tested before being published to the release bucket.
Implementation design and alternatives were explored and documented in #1072
Only the solution below was selected and implemented:
This ensures that only tested and verified wheels are published to the release index.