Windows Pytorch builds gating with tests#1382
Merged
Merged
Conversation
ScottTodd
requested changes
Sep 3, 2025
…com:ROCm/TheRock into users/arravikum/gating_windows_torch_builds
Contributor
Author
|
Made changes requested in the PR review comments. It will be great if we can re-review and land this if everything looks good. Passing run with gating: https://github.com/ROCm/TheRock/actions/runs/17585233778 |
marbre
reviewed
Sep 9, 2025
ScottTodd
reviewed
Sep 9, 2025
Member
ScottTodd
left a comment
There was a problem hiding this comment.
Behavior looks about right, thanks for moving into a script. A few comments about docs then otherwise LGTM
ScottTodd
approved these changes
Sep 9, 2025
araravik-psd
added a commit
that referenced
this pull request
Sep 10, 2025
…#1446) Raising this PR to address the failure in Release Windows Packages workflow, recently added an upload to staging S3 step as part of merge #1382. Failure logs: [Release Windows packages · 28386af](https://github.com/ROCm/TheRock/actions/runs/17588163557/job/49961710451#step:17:74). Fix has been validated with the workflow run below: https://github.com/ROCm/TheRock/actions/runs/17594149418/job/49982205211 Co-authored-by: arravikum <arravikum@amd.com>
1 task
ScottTodd
added a commit
that referenced
this pull request
Sep 12, 2025
## Motivation Follow-up to #1110 and #1382. Progress on #1072. This test run https://github.com/ROCm/TheRock/actions/runs/17662170140/job/50199734639#step:6:37 failed with ``` ++ Exec [C:\runner\_work\TheRock\TheRock]$ 'C:\runner\_work\TheRock\TheRock\.venv\Scripts\python.exe' -m pip install --index-url=https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151 torch==2.10.0a0+rocm7.0.0rc20250908 Looking in indexes: https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151 ERROR: Could not find a version that satisfies the requirement torch==2.10.0a0+rocm7.0.0rc20250908 (from versions: 2.7.0a0+rocm7.0.0.dev0.661b3907cf184e33f44256c24b88fc28a9251ec4, 2.7.0a0+rocm7.0.0.dev0.98ed4ad77f79822694ec01a36180ec3b95f4bd00, 2.7.0a0+rocm7.0.0.dev0.dea79b8f65819d046c7ec00a2b3ccdf5e98fbe5a, 2.7.0a0+rocm7.0.0.dev0.e0d25c8e8ca28b56c8155902c8f04e1767de4394, 2.7.0a0+rocm7.0.0.dev0.e96d36b9b628476463ef6cecee752f601052a4d2, 2.9.0a0+rocm7.0.0rc20250804, 2.9.0a0+rocmsdk20250819, 2.9.0a0+rocmsdk20250820, 2.9.0a0+rocmsdk20250821, 2.9.0a0+rocmsdk20250825) ERROR: No matching distribution found for torch==2.10.0a0+rocm7.0.0rc20250908 ``` ## Technical Details The build job only uploaded to v2-staging, so the test job should download from v2-staging, not v2. * https://d25kgig7rdsyks.cloudfront.net/v2-staging/gfx1151/torch/ uploaded to * https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151/torch/ attempted to download from The change here was made on Linux but was overlooked during the porting to Windows: https://github.com/ROCm/TheRock/blob/eb05061cb055b8626ec2e083e4eb87e90bd79f02/.github/workflows/build_portable_linux_pytorch_wheels.yml#L219-L227 The https://github.com/ROCm/TheRock/actions/runs/17585233778 test run mentioned at #1382 (comment) did not run any tests and did not exercise this code because we only have gfx1151 runners and that used gfx110X-dgpu. Those are easy mistakes to make. We could 1. Rename the `cloudfront_url` input to carry more meaning about what it represents, like `package_index_url` or `staging_package_index_url` 2. Bring up runners for more GPUs or change the default workflow_dispatch GPU type to one we have test runners for ## Test Plan Untested. ## Test Result Nope. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
ScottTodd
added a commit
that referenced
this pull request
Sep 17, 2025
Progress on #1072. More follow-up to #1382 for missed plumbing due to workflows being copy/pasted instead of reusing the same scripts. This workflow run failed: https://github.com/ROCm/TheRock/actions/runs/17788093970/job/50562946451#step:6:35, since it was looking outside of the staging subdirectory: ``` ++ Exec [C:\runner\_work\TheRock\TheRock]$ 'C:\runner\_work\TheRock\TheRock\.venv\Scripts\python.exe' -m pip install --index-url=https://rocm.nightlies.amd.com/gfx1151 torch==2.10.0a0+rocm7.0.0rc20250917 Looking in indexes: https://rocm.nightlies.amd.com/gfx1151 ERROR: Could not find a version that satisfies the requirement torch==2.10.0a0+rocm7.0.0rc20250917 (from versions: none) ``` The workflow that triggered it had this variable unset, so it appended empty string to the base URL: https://github.com/ROCm/TheRock/actions/runs/17786272473/job/50554458282#step:6:5 ``` echo "cloudfront_url=${cloudfront_base_url}/v2" >> $GITHUB_OUTPUT echo "cloudfront_staging_url=${cloudfront_base_url}/" >> $GITHUB_OUTPUT ``` Untested.
araravik-psd
added a commit
that referenced
this pull request
Sep 25, 2025
…eels step run when workflow is cancelled (#1579) An issue was seen with the upload_pytorch_wheels gating check added through PR #1072 and #1382 . Promotion to v2 from v2-staging should be blocked when workflow is cancelled. Instead through the promote_wheels_based_on_policy.py helped script we end up uploading the wheels even if it is not tested to v2 from v2 staging if upload flag is determined to be true here. https://github.com/ROCm/TheRock/blob/82d23dd64ad23e7ee9915240823375641661125d/build_tools/github_actions/promote_wheels_based_on_policy.py#L19-L44 falling through to the # 4) Otherwise → upload=true This PR makes the changes in the workflow to handle cancelled workflow for both linux and windows. --------- Co-authored-by: arravikum <arravikum@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Raising this PR to add the changes we added for Pytorch Gating in Linux through #1110
Also added the staging bucket mechanism added for Linux builds
Below are the steps we want integrated as part of windows builds too:
Build PyTorch wheels
Upload the built wheels to the v2-staging (staging bucket)
Run PyTorch tests using wheels from the staging bucket
Only if tests pass, copy the validated wheels from the staging bucket to the release bucket
If no runner is available: Promotion is blocked by default. Set bypass_tests_for_releases=true only for exceptional cases under amdgpu_family_matrix.py