Skip to content

Add gating mechanism to ensure PyTorch wheels pass tests before releasing#1110

Merged
araravik-psd merged 57 commits into
mainfrom
users/arravikum/pytorch_test_gating
Aug 28, 2025
Merged

Add gating mechanism to ensure PyTorch wheels pass tests before releasing#1110
araravik-psd merged 57 commits into
mainfrom
users/arravikum/pytorch_test_gating

Conversation

@araravik-psd
Copy link
Copy Markdown
Contributor

@araravik-psd araravik-psd commented Jul 24, 2025

This PR introduces a gating mechanism to ensure PyTorch wheels are tested before being published to the release bucket.

Implementation design and alternatives were explored and documented in #1072
Only the solution below was selected and implemented:


  1. Build PyTorch wheels
  2. Upload the built wheels to the v2-staging (staging bucket)
  3. Run PyTorch tests using wheels from the staging bucket
  4. Only if tests pass, copy the validated wheels from the staging bucket to the release bucket

This ensures that only tested and verified wheels are published to the release index.

Copy link
Copy Markdown
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some early feedback but I would like to take a closer look.

Please copy over the relevant parts from the issue description to the PR description as not all conclusions are implemented with this PR (e.g. ROCm releases are released also if PyTorch tests / builds fail) whereas when reading the the issue one might come to another conclusion.

Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread .github/workflows/release_portable_linux_packages.yml Outdated
Comment thread .github/workflows/release_portable_linux_packages.yml Outdated
Comment thread .github/workflows/release_portable_linux_packages.yml Outdated
Comment thread .github/workflows/release_portable_linux_packages.yml
Comment thread .github/workflows/release_portable_linux_packages.yml Outdated
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread build_tools/third_party/s3_management/manage.py Outdated
Copy link
Copy Markdown
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow right now uses

echo "torch_version=`cd ${{ env.PACKAGE_DIST_DIR }}; python -c 'import glob; print(glob.glob("torch-*.whl")[0].split("-")[1])'`" >> $GITHUB_OUTPUT

in the Build PyTorch Wheels step to determine the exact torch package version. Instead of using {package}-*+${rocm_suffix} in the later upload step to copy from staging to release, the build step should rather be extended to explicitly name the torchvision and torchaudio versions as well. As part of #1100, @ScottTodd as authored build_tools/github_actions/write_torch_version.py which could be landed first and then used and extended here. With that you can make sure you install and test the correct versions.

Side note: torch_version is passed to setup.venv.py

--packages torch==${TORCH_VERSION} \

while torchaudio and torchvision versions are not. I assume those are not tested anyway right now but we could add a TODO note to make sure versions for them get passed as well as soon as / if we add tests for those.

Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
@ScottTodd
Copy link
Copy Markdown
Member

The workflow right now uses

echo "torch_version=`cd ${{ env.PACKAGE_DIST_DIR }}; python -c 'import glob; print(glob.glob("torch-*.whl")[0].split("-")[1])'`" >> $GITHUB_OUTPUT

in the Build PyTorch Wheels step to determine the exact torch package version. Instead of using {package}-*+${rocm_suffix} in the later upload step to copy from staging to release, the build step should rather be extended to explicitly name the torchvision and torchaudio versions as well. As part of #1100, @ScottTodd as authored build_tools/github_actions/write_torch_version.py which could be landed first and then used and extended here. With that you can make sure you install and test the correct versions.

Landed that script in 1b403be.

Side note: torch_version is passed to setup.venv.py

--packages torch==${TORCH_VERSION} \

while torchaudio and torchvision versions are not. I assume those are not tested anyway right now but we could add a TODO note to make sure versions for them get passed as well as soon as / if we add tests for those.

Yes, those are not tested yet.

@araravik-psd araravik-psd force-pushed the users/arravikum/pytorch_test_gating branch from 7abb8e1 to 48ab8a9 Compare July 25, 2025 01:48
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread build_tools/github_actions/write_torch_versions.py Outdated
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment on lines 117 to 140
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes for testing while the PR is still under development? If so, please call them out as such in the code, so it doesn't confuse reviewers.

ScottTodd added a commit that referenced this pull request Jul 26, 2025
The `${{ env.PACKAGE_DIST_DIR }}` value was not making its way to the
script correctly (see [logs
1](https://github.com/ROCm/TheRock/actions/runs/16538063884/job/46775446523),
[logs
2](https://github.com/ROCm/TheRock/actions/runs/16542014324/job/46784536594)).
We would look in `/__w/TheRock/TheRock/output/packages/dist` while the
files were in `/home/runner/_work/TheRock/TheRock/output/packages/dist`.

This also includes other changes discussed on
#1110 (comment).

Tested:

* Linux: https://github.com/ROCm/TheRock/actions/runs/16542818823.

    ```
2025-07-26T19:11:37.0162660Z Looking for wheels in
'/home/runner/_work/TheRock/TheRock/output/packages/dist'
    2025-07-26T19:11:37.0163431Z Found files in that directory:
2025-07-26T19:11:37.0164351Z
/home/runner/_work/TheRock/TheRock/output/packages/dist/torchaudio-2.7.1a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-linux_x86_64.whl
2025-07-26T19:11:37.0165576Z
/home/runner/_work/TheRock/TheRock/output/packages/dist/pytorch_triton_rocm-3.3.1-cp312-cp312-linux_x86_64.whl
2025-07-26T19:11:37.0166748Z
/home/runner/_work/TheRock/TheRock/output/packages/dist/torch-2.7.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-linux_x86_64.whl
2025-07-26T19:11:37.0167850Z
/home/runner/_work/TheRock/TheRock/output/packages/dist/torchvision-0.22.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-linux_x86_64.whl
    2025-07-26T19:11:37.0168436Z 
2025-07-26T19:11:37.0168632Z Looking for 'torch in
'/home/runner/_work/TheRock/TheRock/output/packages/dist'
2025-07-26T19:11:37.0169204Z Found wheel at
'torch-2.7.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-linux_x86_64.whl'
2025-07-26T19:11:37.0169772Z Parsed version
'2.7.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9'
2025-07-26T19:11:37.0170235Z Looking for 'torchaudio in
'/home/runner/_work/TheRock/TheRock/output/packages/dist'
2025-07-26T19:11:37.0170861Z Found wheel at
'torchaudio-2.7.1a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-linux_x86_64.whl'
2025-07-26T19:11:37.0171469Z Parsed version
'2.7.1a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9'
2025-07-26T19:11:37.0171936Z Looking for 'torchvision in
'/home/runner/_work/TheRock/TheRock/output/packages/dist'
2025-07-26T19:11:37.0172641Z Found wheel at
'torchvision-0.22.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-linux_x86_64.whl'
2025-07-26T19:11:37.0173242Z Parsed version
'0.22.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9'
2025-07-26T19:11:37.0173740Z Looking for 'pytorch_triton_rocm in
'/home/runner/_work/TheRock/TheRock/output/packages/dist'
2025-07-26T19:11:37.0174222Z Found wheel at
'pytorch_triton_rocm-3.3.1-cp312-cp312-linux_x86_64.whl'
    2025-07-26T19:11:37.0174539Z   Parsed version '3.3.1'
    2025-07-26T19:11:37.0174675Z 
    2025-07-26T19:11:37.0174679Z 
    2025-07-26T19:11:37.0174762Z Setting github output:
2025-07-26T19:11:37.0175683Z {'torch_version':
'2.7.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9',
'torchaudio_version':
'2.7.1a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9',
'torchvision_version':
'0.22.1+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9',
'triton_version': '3.3.1'}
    ```

Note: tests failed with `ERROR: Could not find a version that satisfies
the requirement pytorch-triton-rocm==3.3.1 (from torch) (from versions:
none)`. I think the wheels are in the bucket but the index needs to be
populated? We also need to plumb through the new versions, as discussed
on #1110.

* Windows: https://github.com/ROCm/TheRock/actions/runs/16543042986
    
    ```
Looking for wheels in
'C:\home\runner\_work\TheRock\TheRock\output\packages\dist'
    Found files in that directory:

C:\home\runner\_work\TheRock\TheRock\output\packages\dist\torch-2.9.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-win_amd64.whl

C:\home\runner\_work\TheRock\TheRock\output\packages\dist\torchaudio-2.8.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-win_amd64.whl
Looking for 'torch in
'C:\home\runner\_work\TheRock\TheRock\output\packages\dist'
Found wheel at
'torch-2.9.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-win_amd64.whl'
Parsed version
'2.9.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9'
Looking for 'torchaudio in
'C:\home\runner\_work\TheRock\TheRock\output\packages\dist'
Found wheel at
'torchaudio-2.8.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9-cp312-cp312-win_amd64.whl'
Parsed version
'2.8.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9'
Looking for 'torchvision in
'C:\home\runner\_work\TheRock\TheRock\output\packages\dist'
WARNING: Found no 'torchvision' wheels matching 'torchvision-*.whl'
Looking for 'pytorch_triton_rocm in
'C:\home\runner\_work\TheRock\TheRock\output\packages\dist'
WARNING: Found no 'pytorch_triton_rocm' wheels matching
'pytorch_triton_rocm-*.whl'
Did not find torchvision (that's okay, is not currently built on
Windows)
    Did not find triton (that's okay, is not currently built on Windows)
    Setting github output:
{'torch_version':
'2.9.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9',
'torchaudio_version':
'2.8.0a0+rocm7.0.0.dev0.9b07c81904cf711789f5c8c43919135d018f75a9'}
    ```
@araravik-psd araravik-psd force-pushed the users/arravikum/pytorch_test_gating branch from 070aab3 to 3ffddcd Compare July 29, 2025 15:02
araravik-psd pushed a commit that referenced this pull request Jul 30, 2025
@araravik-psd araravik-psd force-pushed the users/arravikum/pytorch_test_gating branch from 02c2475 to c8b11dd Compare July 30, 2025 14:05
@araravik-psd araravik-psd requested review from ScottTodd and marbre July 30, 2025 17:24
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread .github/workflows/release_portable_linux_packages.yml
@araravik-psd araravik-psd requested a review from ScottTodd July 30, 2025 22:21
Copy link
Copy Markdown
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, may want @marbre to take another look before merging though.

Copy link
Copy Markdown
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When building the torch packages, ROCm is pulled from the CloudFront URL and not via the one pointing to the staging index. Should we pull from the staging bucket instead?

The main issue that this will block releasing torch wheels for which there isn't yet a test runner, see https://github.com/ROCm/TheRock/actions/runs/16649247276/job/47118083365. If target to run doesn't return a test target, we should still upload.

Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
Comment thread .github/workflows/release_portable_linux_packages.yml Outdated
@araravik-psd
Copy link
Copy Markdown
Contributor Author

@ScottTodd @marbre wanted to get inputs on

#1110 (review)

"When building the torch packages, ROCm is pulled from the CloudFront URL and not via the one pointing to the staging index. Should we pull from the staging bucket instead?"

We copy the ROCm to staging URL too, so I can trigger to build from staging index if needed, but dont see too much of an advantage here as we are uploading the built wheels only after gating with tests. Will this be of any added significance?

@ScottTodd
Copy link
Copy Markdown
Member

I think we'll need to remove benc-uk/workflow-dispatch from the middle of these workflows so we can have a single workflow run that builds, tests, and then pushes. I'm not comfortable with partial pushes if tests on one architecture pass but tests on another fail. We can get that behavior in other ways, but the big disconnect there is making this quite error prone.

@araravik-psd araravik-psd force-pushed the users/arravikum/pytorch_test_gating branch from cc51a5b to 91e4976 Compare August 27, 2025 18:36
@ScottTodd ScottTodd self-requested a review August 27, 2025 22:32
Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
python ./build_tools/third_party/s3_management/manage.py ${{ env.S3_STAGING_SUBDIR }}/${{ matrix.target_bundle.amdgpu_family }}

## TODO: Restrict uploading to the non-staging S3 directory until ROCm sanity checks and all validation tests have successfully passed.
- name: Upload Releases to S3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(future work)

Out of curiosity, is this "upload from local to cloud" slower or faster than the "copy from cloud to cloud" that the Copy PyTorch wheels from staging to release S3 step in .github/workflows/build_portable_linux_pytorch_wheels.yml?

I think we could script these uploads/copies so we aren't inlining as much code into .yml files, in which case we could have a "copy release from staging to tested" mode on the script that we can use for both pytorch wheels and rocm wheels.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(future work)

Let's make sure these changes are carried over to the Windows release workflows too. Moving steps into scripts instead of inlined commands in yml will help with that.

Comment thread build_tools/github_actions/amdgpu_family_matrix.py Outdated
Comment on lines +53 to +55
amdgpu_family_info_matrix = (
amdgpu_family_info_matrix_presubmit | amdgpu_family_info_matrix_postsubmit
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for now since this is following the existing code patterns, but I just edited some of this code in 1376958.

We should be able to replace this code pattern with amdgpu_family_info_matrix_all now, in case we add runners for targets that are in the "nightly" (previously "xfail") group.

Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
@araravik-psd
Copy link
Copy Markdown
Contributor Author

araravik-psd commented Aug 28, 2025

@ScottTodd made the changes requested. I have triggered a run to verify the last PR changes

https://github.com/ROCm/TheRock/actions/runs/17283708950

Copy link
Copy Markdown
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_tools_github_actions_configure_target_run_test failed in the CI. This needs to be addressed before we're able to land this.

Comment thread .github/workflows/build_portable_linux_pytorch_wheels.yml Outdated
@araravik-psd araravik-psd requested a review from marbre August 28, 2025 13:57
Copy link
Copy Markdown
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks!

@araravik-psd araravik-psd merged commit fca250f into main Aug 28, 2025
8 checks passed
@araravik-psd araravik-psd deleted the users/arravikum/pytorch_test_gating branch August 28, 2025 14:04
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Aug 28, 2025
araravik-psd added a commit that referenced this pull request Aug 29, 2025
Add inputs added for build_portable_linux_pytorch_wheels.yml from #1110 

Added required inputs for both the release_portable_linux_packages(Which
runs build_portable_linux_pytorch_wheels.yml) and also
release_portable_linux_pytorch_wheels.

---------

Co-authored-by: arravikum <arravikum@amd.com>
araravik-psd added a commit that referenced this pull request Sep 9, 2025
Raising this PR to add the changes we added for Pytorch Gating in Linux
through #1110

Also added the staging bucket mechanism added for Linux builds

Below are the steps we want integrated as part of windows builds too:

Build PyTorch wheels
Upload the built wheels to the v2-staging (staging bucket)
Run PyTorch tests using wheels from the staging bucket
Only if tests pass, copy the validated wheels from the staging bucket to
the release bucket

If no runner is available: Promotion is blocked by default. Set
bypass_tests_for_releases=true only for exceptional cases under
amdgpu_family_matrix.py

---------

Co-authored-by: arravikum <arravikum@amd.com>
ScottTodd added a commit that referenced this pull request Sep 12, 2025
## Motivation

Follow-up to #1110 and
#1382. Progress on
#1072.

This test run
https://github.com/ROCm/TheRock/actions/runs/17662170140/job/50199734639#step:6:37
failed with
```
 ++ Exec [C:\runner\_work\TheRock\TheRock]$ 'C:\runner\_work\TheRock\TheRock\.venv\Scripts\python.exe' -m pip install --index-url=https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151 torch==2.10.0a0+rocm7.0.0rc20250908
Looking in indexes: https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151
ERROR: Could not find a version that satisfies the requirement torch==2.10.0a0+rocm7.0.0rc20250908 (from versions: 2.7.0a0+rocm7.0.0.dev0.661b3907cf184e33f44256c24b88fc28a9251ec4, 2.7.0a0+rocm7.0.0.dev0.98ed4ad77f79822694ec01a36180ec3b95f4bd00, 2.7.0a0+rocm7.0.0.dev0.dea79b8f65819d046c7ec00a2b3ccdf5e98fbe5a, 2.7.0a0+rocm7.0.0.dev0.e0d25c8e8ca28b56c8155902c8f04e1767de4394, 2.7.0a0+rocm7.0.0.dev0.e96d36b9b628476463ef6cecee752f601052a4d2, 2.9.0a0+rocm7.0.0rc20250804, 2.9.0a0+rocmsdk20250819, 2.9.0a0+rocmsdk20250820, 2.9.0a0+rocmsdk20250821, 2.9.0a0+rocmsdk20250825)
ERROR: No matching distribution found for torch==2.10.0a0+rocm7.0.0rc20250908
```

## Technical Details

The build job only uploaded to v2-staging, so the test job should
download from v2-staging, not v2.

* https://d25kgig7rdsyks.cloudfront.net/v2-staging/gfx1151/torch/
uploaded to
* https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151/torch/ attempted to
download from

The change here was made on Linux but was overlooked during the porting
to Windows:
https://github.com/ROCm/TheRock/blob/eb05061cb055b8626ec2e083e4eb87e90bd79f02/.github/workflows/build_portable_linux_pytorch_wheels.yml#L219-L227

The https://github.com/ROCm/TheRock/actions/runs/17585233778 test run
mentioned at
#1382 (comment) did
not run any tests and did not exercise this code because we only have
gfx1151 runners and that used gfx110X-dgpu.

Those are easy mistakes to make. We could
1. Rename the `cloudfront_url` input to carry more meaning about what it
represents, like `package_index_url` or `staging_package_index_url`
2. Bring up runners for more GPUs or change the default
workflow_dispatch GPU type to one we have test runners for

## Test Plan

Untested.

## Test Result

Nope.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants