Skip to content

Windows Pytorch builds gating with tests#1382

Merged
araravik-psd merged 17 commits into
mainfrom
users/arravikum/gating_windows_torch_builds
Sep 9, 2025
Merged

Windows Pytorch builds gating with tests#1382
araravik-psd merged 17 commits into
mainfrom
users/arravikum/gating_windows_torch_builds

Conversation

@araravik-psd
Copy link
Copy Markdown
Contributor

Raising this PR to add the changes we added for Pytorch Gating in Linux through #1110

Also added the staging bucket mechanism added for Linux builds

Below are the steps we want integrated as part of windows builds too:

Build PyTorch wheels
Upload the built wheels to the v2-staging (staging bucket)
Run PyTorch tests using wheels from the staging bucket
Only if tests pass, copy the validated wheels from the staging bucket to the release bucket

If no runner is available: Promotion is blocked by default. Set bypass_tests_for_releases=true only for exceptional cases under amdgpu_family_matrix.py

Comment thread .github/workflows/build_windows_pytorch_wheels.yml
Comment thread .github/workflows/build_windows_pytorch_wheels.yml
Comment thread .github/workflows/build_windows_pytorch_wheels.yml Outdated
Comment thread .github/workflows/build_windows_pytorch_wheels.yml Outdated
Comment thread .github/workflows/build_windows_pytorch_wheels.yml Outdated
Comment thread .github/workflows/release_windows_pytorch_wheels.yml Outdated
Comment thread .github/workflows/release_windows_pytorch_wheels.yml Outdated
Comment thread .github/workflows/release_windows_pytorch_wheels.yml Outdated
Comment thread external-builds/pytorch/README.md Outdated
Comment thread external-builds/pytorch/README.md Outdated
@araravik-psd araravik-psd changed the title Users/arravikum/gating windows torch builds Windows Pytorch builds gating with tests Sep 3, 2025
@araravik-psd
Copy link
Copy Markdown
Contributor Author

Made changes requested in the PR review comments.

It will be great if we can re-review and land this if everything looks good.

Passing run with gating:

https://github.com/ROCm/TheRock/actions/runs/17585233778

Comment thread external-builds/pytorch/README.md Outdated
Comment thread external-builds/pytorch/README.md Outdated
Copy link
Copy Markdown
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior looks about right, thanks for moving into a script. A few comments about docs then otherwise LGTM

Comment thread .github/workflows/build_windows_pytorch_wheels.yml
Comment thread external-builds/pytorch/README.md Outdated
Comment thread external-builds/pytorch/README.md Outdated
@araravik-psd araravik-psd merged commit 7bbed25 into main Sep 9, 2025
7 checks passed
@araravik-psd araravik-psd deleted the users/arravikum/gating_windows_torch_builds branch September 9, 2025 18:29
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Sep 9, 2025
araravik-psd added a commit that referenced this pull request Sep 10, 2025
…#1446)

Raising this PR to address the failure in Release Windows Packages
workflow, recently added an upload to staging S3 step as part of merge
#1382.

Failure logs:
[Release Windows packages ·
28386af](https://github.com/ROCm/TheRock/actions/runs/17588163557/job/49961710451#step:17:74).
 
Fix has been validated with the workflow run below:

https://github.com/ROCm/TheRock/actions/runs/17594149418/job/49982205211

Co-authored-by: arravikum <arravikum@amd.com>
ScottTodd added a commit that referenced this pull request Sep 12, 2025
## Motivation

Follow-up to #1110 and
#1382. Progress on
#1072.

This test run
https://github.com/ROCm/TheRock/actions/runs/17662170140/job/50199734639#step:6:37
failed with
```
 ++ Exec [C:\runner\_work\TheRock\TheRock]$ 'C:\runner\_work\TheRock\TheRock\.venv\Scripts\python.exe' -m pip install --index-url=https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151 torch==2.10.0a0+rocm7.0.0rc20250908
Looking in indexes: https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151
ERROR: Could not find a version that satisfies the requirement torch==2.10.0a0+rocm7.0.0rc20250908 (from versions: 2.7.0a0+rocm7.0.0.dev0.661b3907cf184e33f44256c24b88fc28a9251ec4, 2.7.0a0+rocm7.0.0.dev0.98ed4ad77f79822694ec01a36180ec3b95f4bd00, 2.7.0a0+rocm7.0.0.dev0.dea79b8f65819d046c7ec00a2b3ccdf5e98fbe5a, 2.7.0a0+rocm7.0.0.dev0.e0d25c8e8ca28b56c8155902c8f04e1767de4394, 2.7.0a0+rocm7.0.0.dev0.e96d36b9b628476463ef6cecee752f601052a4d2, 2.9.0a0+rocm7.0.0rc20250804, 2.9.0a0+rocmsdk20250819, 2.9.0a0+rocmsdk20250820, 2.9.0a0+rocmsdk20250821, 2.9.0a0+rocmsdk20250825)
ERROR: No matching distribution found for torch==2.10.0a0+rocm7.0.0rc20250908
```

## Technical Details

The build job only uploaded to v2-staging, so the test job should
download from v2-staging, not v2.

* https://d25kgig7rdsyks.cloudfront.net/v2-staging/gfx1151/torch/
uploaded to
* https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151/torch/ attempted to
download from

The change here was made on Linux but was overlooked during the porting
to Windows:
https://github.com/ROCm/TheRock/blob/eb05061cb055b8626ec2e083e4eb87e90bd79f02/.github/workflows/build_portable_linux_pytorch_wheels.yml#L219-L227

The https://github.com/ROCm/TheRock/actions/runs/17585233778 test run
mentioned at
#1382 (comment) did
not run any tests and did not exercise this code because we only have
gfx1151 runners and that used gfx110X-dgpu.

Those are easy mistakes to make. We could
1. Rename the `cloudfront_url` input to carry more meaning about what it
represents, like `package_index_url` or `staging_package_index_url`
2. Bring up runners for more GPUs or change the default
workflow_dispatch GPU type to one we have test runners for

## Test Plan

Untested.

## Test Result

Nope.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
ScottTodd added a commit that referenced this pull request Sep 17, 2025
Progress on #1072. More follow-up
to #1382 for missed plumbing due to
workflows being copy/pasted instead of reusing the same scripts.

This workflow run failed:
https://github.com/ROCm/TheRock/actions/runs/17788093970/job/50562946451#step:6:35,
since it was looking outside of the staging subdirectory:
```
++ Exec [C:\runner\_work\TheRock\TheRock]$ 'C:\runner\_work\TheRock\TheRock\.venv\Scripts\python.exe' -m pip install --index-url=https://rocm.nightlies.amd.com/gfx1151 torch==2.10.0a0+rocm7.0.0rc20250917
Looking in indexes: https://rocm.nightlies.amd.com/gfx1151
ERROR: Could not find a version that satisfies the requirement torch==2.10.0a0+rocm7.0.0rc20250917 (from versions: none)
```

The workflow that triggered it had this variable unset, so it appended
empty string to the base URL:
https://github.com/ROCm/TheRock/actions/runs/17786272473/job/50554458282#step:6:5
```
  echo "cloudfront_url=${cloudfront_base_url}/v2" >> $GITHUB_OUTPUT
  echo "cloudfront_staging_url=${cloudfront_base_url}/" >> $GITHUB_OUTPUT
```

Untested.
araravik-psd added a commit that referenced this pull request Sep 25, 2025
…eels step run when workflow is cancelled (#1579)

An issue was seen with the upload_pytorch_wheels gating check added
through PR #1072 and #1382 .

Promotion to v2 from v2-staging should be blocked when workflow is
cancelled. Instead through the promote_wheels_based_on_policy.py helped
script we end up uploading the wheels even if it is not tested to v2
from v2 staging if upload flag is determined to be true here.


https://github.com/ROCm/TheRock/blob/82d23dd64ad23e7ee9915240823375641661125d/build_tools/github_actions/promote_wheels_based_on_policy.py#L19-L44
falling through to the # 4) Otherwise → upload=true

This PR makes the changes in the workflow to handle cancelled workflow
for both linux and windows.

---------

Co-authored-by: arravikum <arravikum@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants