Skip to content

Use cloudfront_staging_url for Windows pytorch testing workflow.#1469

Merged
ScottTodd merged 1 commit into
mainfrom
users/scotttodd/torch-fix-staging-url
Sep 12, 2025
Merged

Use cloudfront_staging_url for Windows pytorch testing workflow.#1469
ScottTodd merged 1 commit into
mainfrom
users/scotttodd/torch-fix-staging-url

Conversation

@ScottTodd
Copy link
Copy Markdown
Member

Motivation

Follow-up to #1110 and #1382. Progress on #1072.

This test run https://github.com/ROCm/TheRock/actions/runs/17662170140/job/50199734639#step:6:37 failed with

 ++ Exec [C:\runner\_work\TheRock\TheRock]$ 'C:\runner\_work\TheRock\TheRock\.venv\Scripts\python.exe' -m pip install --index-url=https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151 torch==2.10.0a0+rocm7.0.0rc20250908
Looking in indexes: https://d25kgig7rdsyks.cloudfront.net/v2/gfx1151
ERROR: Could not find a version that satisfies the requirement torch==2.10.0a0+rocm7.0.0rc20250908 (from versions: 2.7.0a0+rocm7.0.0.dev0.661b3907cf184e33f44256c24b88fc28a9251ec4, 2.7.0a0+rocm7.0.0.dev0.98ed4ad77f79822694ec01a36180ec3b95f4bd00, 2.7.0a0+rocm7.0.0.dev0.dea79b8f65819d046c7ec00a2b3ccdf5e98fbe5a, 2.7.0a0+rocm7.0.0.dev0.e0d25c8e8ca28b56c8155902c8f04e1767de4394, 2.7.0a0+rocm7.0.0.dev0.e96d36b9b628476463ef6cecee752f601052a4d2, 2.9.0a0+rocm7.0.0rc20250804, 2.9.0a0+rocmsdk20250819, 2.9.0a0+rocmsdk20250820, 2.9.0a0+rocmsdk20250821, 2.9.0a0+rocmsdk20250825)
ERROR: No matching distribution found for torch==2.10.0a0+rocm7.0.0rc20250908

Technical Details

The build job only uploaded to v2-staging, so the test job should download from v2-staging, not v2.

The change here was made on Linux but was overlooked during the porting to Windows:

test_pytorch_wheels:
name: Test | ${{ inputs.amdgpu_family }} | ${{ needs.generate_target_to_run.outputs.test_runs_on }}
if: ${{ needs.generate_target_to_run.outputs.test_runs_on != '' }}
needs: [build_pytorch_wheels, generate_target_to_run]
uses: ./.github/workflows/test_pytorch_wheels.yml
with:
amdgpu_family: ${{ inputs.amdgpu_family }}
test_runs_on: ${{ needs.generate_target_to_run.outputs.test_runs_on }}
cloudfront_url: ${{ inputs.cloudfront_staging_url }}

The https://github.com/ROCm/TheRock/actions/runs/17585233778 test run mentioned at #1382 (comment) did not run any tests and did not exercise this code because we only have gfx1151 runners and that used gfx110X-dgpu.

Those are easy mistakes to make. We could

  1. Rename the cloudfront_url input to carry more meaning about what it represents, like package_index_url or staging_package_index_url
  2. Bring up runners for more GPUs or change the default workflow_dispatch GPU type to one we have test runners for

Test Plan

Untested.

Test Result

Nope.

Submission Checklist

Copy link
Copy Markdown
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are easy mistakes to make. We could

  1. Rename the cloudfront_url input to carry more meaning about what it represents, like package_index_url or staging_package_index_url
  2. Bring up runners for more GPUs or change the default workflow_dispatch GPU type to one we have test runners for

Agree with both suggestions. Especially the renaming is a low hanging fruit.

@araravik-psd
Copy link
Copy Markdown
Contributor

The change looks good.

I ran the workflow with the changes on both the Archs gfx1151(We have runners for this) and gfx110(We dont have runners). Unfortunately missed the test step on my run for for gfx1151 below, as builds were failing for gfx1151 and did not proceed to the test step.

Run on gfx1151:

https://github.com/ROCm/TheRock/actions/runs/17590905752

Copy link
Copy Markdown
Contributor

@araravik-psd araravik-psd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

ScottTodd added a commit that referenced this pull request Sep 12, 2025
…#1437)

## Motivation

Fixes #1040, enabling aotriton for
flash attention in pytorch (if it works). This is expected to improve
performance in workloads like ComfyUI image generation by upwards of 60%
(e.g. 12.6 it/s to 20.0 it/s).

## Technical Details

Follow-up to #1432 and depends on
pytorch/pytorch#162330.

Note that support is experimental for some GPUs like gfx1100, so the
`TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` environment variable may be
needed to try aotriton on those systems.

## Test Plan

Trigger either
https://github.com/ROCm/TheRock/actions/workflows/build_windows_pytorch_wheels.yml
or
https://github.com/ROCm/TheRock/actions/workflows/release_windows_pytorch_wheels.yml
across the matrix of GPU families once that PyTorch PR is merged.

We're still going to need automated tests and documentation for this.
I'd like numerics tests running somewhere and documentation that shows
how to check which pytorch features are enabled in the wheels that a
user installs.

## Test Result

Test runs:

* https://github.com/ROCm/TheRock/actions/runs/17660396787 using this
branch and `7.0.0rc20250908` for gfx110X-dgpu
* ~~https://github.com/ROCm/TheRock/actions/runs/17660456285 using the
branch and `7.0.0rc20250908` for gfx1151~~
* https://github.com/ROCm/TheRock/actions/runs/17662170140 using the
branch and `7.0.0rc20250908` for gfx1151
* Tests not running should be fixed with
#1469

(may need to retrigger to pick up fixes for flaky checkouts)

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
@ScottTodd
Copy link
Copy Markdown
Member Author

Unfortunately missed the test step on my run for for gfx1151 below, as builds were failing for gfx1151 and did not proceed to the test step.

Run on gfx1151:

https://github.com/ROCm/TheRock/actions/runs/17590905752

ERROR: No matching distribution found for rocm==7.0.0rc20250904 maybe you missed running https://github.com/ROCm/TheRock/actions/workflows/copy_release.yml for gfx1151?

@ScottTodd ScottTodd merged commit c937648 into main Sep 12, 2025
5 checks passed
@ScottTodd ScottTodd deleted the users/scotttodd/torch-fix-staging-url branch September 12, 2025 14:38
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Sep 12, 2025
ScottTodd added a commit that referenced this pull request Sep 15, 2025
## Motivation

Following up on some ideas from
#1469 to make workflows easier to
use.

## Details

* Set Windows workflow defaults to all gfx1151 (which we have test
machines for). Linux defaults to gfx94X-dcgpu
* Add some descriptions to workflow inputs
* Rename cloudfront_url to package_index_url (in one place, others may
want to change too)
* Pack job names with more information

## Test Plan

* copy_release.yml:
https://github.com/ROCm/TheRock/actions/runs/17686001057
* I have **not** tested running the other workflows. Could do that on
request (or revert this if it breaks workflows)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants