Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
90043d8
Made changes to split upload S3 with conditions
aravind-ravi1206 Jul 21, 2025
2830983
refactor changes
aravind-ravi1206 Jul 21, 2025
2db2495
Added copy to staging sub-dir and copy from staging to release steps
aravind-ravi1206 Jul 22, 2025
c7ed846
Added copy to staging sub-dir and copy from staging to release steps
aravind-ravi1206 Jul 22, 2025
25e4ffe
Updated directories for test
aravind-ravi1206 Jul 22, 2025
3c79406
Adding Staging sub-dir
aravind-ravi1206 Jul 22, 2025
a22814a
Adding staging prefixes to s3_management script
aravind-ravi1206 Jul 22, 2025
4f24452
fixed_typo
aravind-ravi1206 Jul 22, 2025
c7bc598
Made test execute from staging bucket
aravind-ravi1206 Jul 22, 2025
c40e61e
Made test execute from staging bucket
aravind-ravi1206 Jul 22, 2025
76a4f93
Changed the build index to staging bucket
aravind-ravi1206 Jul 22, 2025
c47dad2
Changed the build index to dev bucket
aravind-ravi1206 Jul 22, 2025
26acce7
Added test gating
aravind-ravi1206 Jul 22, 2025
d284128
Added repo checkout in upload_pytorch_wheels step
aravind-ravi1206 Jul 23, 2025
55e3938
Added staging sub-dirs and upload of python packages to staging
aravind-ravi1206 Jul 24, 2025
0f31e5c
Updated description for added staging attributes
aravind-ravi1206 Jul 24, 2025
9495af6
Fixed precommit issues
aravind-ravi1206 Jul 24, 2025
bfb8546
Added staging sub_dir variable in portable_linux_packages job
aravind-ravi1206 Jul 24, 2025
fd5c912
Addressed PR comments on #1110
aravind-ravi1206 Jul 24, 2025
c9d86fa
Fixed ordering of Staging index creation based on PR comments
aravind-ravi1206 Jul 24, 2025
81654c3
Fixed pre-commit failures
aravind-ravi1206 Jul 24, 2025
d700c99
Added PR comments changes
aravind-ravi1206 Jul 24, 2025
731d1d1
Added torchvision and torchaudio versions in write_torch_version.py a…
aravind-ravi1206 Jul 25, 2025
6a62bf4
Added ordering of non-staging and staging subdirs in manage.py
aravind-ravi1206 Jul 25, 2025
3d114d4
Explicit usage of bash as shell
aravind-ravi1206 Jul 25, 2025
7b0dcb9
Fixed typo in torchvideo to torchvision, split up cp_version output step
aravind-ravi1206 Jul 25, 2025
cb41903
Commenting out checkut of triton repo
aravind-ravi1206 Jul 25, 2025
1bbd27f
Adding locally tested file changes
aravind-ravi1206 Jul 25, 2025
2f05501
Reverting to pytorch/pytorch repo to verify build
aravind-ravi1206 Jul 25, 2025
16fba93
Reverting to pytorch/pytorch repo to verify build
aravind-ravi1206 Jul 25, 2025
d40c0c9
Adding changes after rebase, restoring torch checkouts to ROCm/pytorch
aravind-ravi1206 Jul 29, 2025
3cc50f8
Fixed pre-commit issues and trimmer write_torch_versions according to…
aravind-ravi1206 Jul 29, 2025
746e9ff
Added / in index_url to get proper index links
aravind-ravi1206 Jul 29, 2025
2248163
Adding quotes for index_url
aravind-ravi1206 Jul 29, 2025
b3ffedc
Reverted test_pytorch_changes and added GITHUB_OUTPUT usage from writ…
aravind-ravi1206 Jul 30, 2025
25bfa39
Added extraction of cp_version from env vars in job output
aravind-ravi1206 Jul 30, 2025
a88b0fb
Resolved PR review comments
aravind-ravi1206 Jul 30, 2025
db929bf
Resolved new PR review comments and added conditional upload if test …
aravind-ravi1206 Jul 31, 2025
bdff065
Modified if clause in 'Upload_pytorch_wheels' job
aravind-ravi1206 Jul 31, 2025
bc63a87
Resolved upload step conditional logic to remove always() inside if:
aravind-ravi1206 Jul 31, 2025
7192390
Adding test_pytorch_wheels in needs: of upload_pytorch_wheels job to …
aravind-ravi1206 Jul 31, 2025
ee26b7d
Add check-gate step output variable
aravind-ravi1206 Aug 1, 2025
f957f64
addressed PR reviews and converted upload flag to env variable
aravind-ravi1206 Aug 11, 2025
971d9f9
Removed check-gate id from workflow step
aravind-ravi1206 Aug 11, 2025
029ea02
Merged condition check for wheels upload and upload step
aravind-ravi1206 Aug 12, 2025
c0ed8f1
Revert "Merged condition check for wheels upload and upload step"
aravind-ravi1206 Aug 12, 2025
8d6b63e
Added needs.pytorch_build_wheels.results to upload check
aravind-ravi1206 Aug 12, 2025
1aa1473
Added a condition to skip upload even if needs.test_pytorch_wheels.re…
aravind-ravi1206 Aug 13, 2025
be258df
This is a dummy commit
aravind-ravi1206 Aug 21, 2025
cb3f485
Removing gating check step in job, enforcing promotion of wheels to v…
aravind-ravi1206 Aug 22, 2025
c4dc878
Revert "Removing gating check step in job, enforcing promotion of whe…
aravind-ravi1206 Aug 25, 2025
91e4976
Re-worked upload logic to be more readable
aravind-ravi1206 Aug 25, 2025
84b034e
Added upload_always flag for arch families that dont have a runner as…
aravind-ravi1206 Aug 27, 2025
d7d4e01
Adding always_upload in ouptuts for generate_target_to_run job
aravind-ravi1206 Aug 27, 2025
a68c3de
Addressing PR comments, changed 'always_upload' flag to 'bypass_tests…
aravind-ravi1206 Aug 28, 2025
2f8f638
Updated 'bypass_tests_for_releases' in amdgpu_family_matrix.py
aravind-ravi1206 Aug 28, 2025
576595c
Adding missed return '' in configure_target_run.py and updated job name
aravind-ravi1206 Aug 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 98 additions & 5 deletions .github/workflows/build_portable_linux_pytorch_wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,18 @@ on:
description: S3 subdirectory, not including the GPU-family
required: true
type: string
s3_staging_subdir:
description: S3 staging subdirectory, not including the GPU-family
required: true
type: string
cloudfront_url:
description: CloudFront URL pointing to Python index
required: true
type: string
cloudfront_staging_url:
description: CloudFront base URL pointing to staging Python index
required: true
type: string
rocm_version:
description: ROCm version to pip install
type: string
Expand Down Expand Up @@ -55,10 +63,18 @@ on:
description: S3 subdirectory, not including the GPU-family
type: string
default: "v2"
s3_staging_subdir:
description: S3 staging subdirectory, not including the GPU-family
type: string
default: "v2-staging"
cloudfront_url:
description: CloudFront base URL pointing to Python index
type: string
default: "https://d25kgig7rdsyks.cloudfront.net/v2"
cloudfront_staging_url:
description: CloudFront base URL pointing to staging Python index
type: string
default: "https://d25kgig7rdsyks.cloudfront.net/v2-staging"
rocm_version:
description: ROCm version to pip install
type: string
Expand Down Expand Up @@ -89,7 +105,11 @@ jobs:
S3_BUCKET_PY: "therock-${{ inputs.release_type }}-python"
optional_build_prod_arguments: ""
outputs:
cp_version: ${{ env.cp_version }}
torch_version: ${{ steps.build-pytorch-wheels.outputs.torch_version }}
torchaudio_version: ${{ steps.build-pytorch-wheels.outputs.torchaudio_version }}
torchvision_version: ${{ steps.build-pytorch-wheels.outputs.torchvision_version }}
triton_version: ${{ steps.build-pytorch-wheels.outputs.triton_version }}
steps:
- name: Checkout
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
Expand Down Expand Up @@ -167,23 +187,24 @@ jobs:
run: |
python external-builds/pytorch/sanity_check_wheel.py ${{ env.PACKAGE_DIST_DIR }}/

- name: Upload wheels to S3
- name: Upload wheels to S3 staging
if: ${{ github.repository_owner == 'ROCm' }}
run: |
aws s3 cp ${{ env.PACKAGE_DIST_DIR }}/ s3://${{ env.S3_BUCKET_PY }}/${{ inputs.s3_subdir }}/${{ inputs.amdgpu_family }}/ \
aws s3 cp ${{ env.PACKAGE_DIST_DIR }}/ s3://${{ env.S3_BUCKET_PY }}/${{ inputs.s3_staging_subdir }}/${{ inputs.amdgpu_family }}/ \
--recursive --exclude "*" --include "*.whl"

- name: (Re-)Generate Python package release index
- name: (Re-)Generate Python package release index for staging
if: ${{ github.repository_owner == 'ROCm' }}
run: |
pip install boto3 packaging
python ./build_tools/third_party/s3_management/manage.py ${{ inputs.s3_subdir }}/${{ inputs.amdgpu_family }}
python ./build_tools/third_party/s3_management/manage.py ${{ inputs.s3_staging_subdir }}/${{ inputs.amdgpu_family }}

generate_target_to_run:
name: Generate target_to_run
runs-on: ubuntu-24.04
outputs:
test_runs_on: ${{ steps.configure.outputs.test-runs-on }}
bypass_tests_for_releases: ${{ steps.configure.outputs.bypass_tests_for_releases }}
steps:
- name: Checking out repository
uses: actions/checkout@08c6903cd8c0fde910a37f88322edcfb5dd907a8 # v5.0.0
Expand All @@ -203,7 +224,79 @@ jobs:
with:
amdgpu_family: ${{ inputs.amdgpu_family }}
test_runs_on: ${{ needs.generate_target_to_run.outputs.test_runs_on }}
cloudfront_url: ${{ inputs.cloudfront_url }}
cloudfront_url: ${{ inputs.cloudfront_staging_url }}
python_version: ${{ inputs.python_version }}
torch_version: ${{ needs.build_pytorch_wheels.outputs.torch_version }}
pytorch_version: ${{ inputs.pytorch_version }}

upload_pytorch_wheels:
name: Release PyTorch Wheels to S3
needs: [build_pytorch_wheels, generate_target_to_run, test_pytorch_wheels]
if: always()
runs-on: ubuntu-24.04
env:
S3_BUCKET_PY: "therock-${{ inputs.release_type }}-python"
CP_VERSION: "${{ needs.build_pytorch_wheels.outputs.cp_version }}"
TORCH_VERSION: "${{ needs.build_pytorch_wheels.outputs.torch_version }}"
TORCHAUDIO_VERSION: "${{ needs.build_pytorch_wheels.outputs.torchaudio_version }}"
TORCHVISION_VERSION: "${{ needs.build_pytorch_wheels.outputs.torchvision_version }}"
Comment thread
araravik-psd marked this conversation as resolved.
TRITON_VERSION: "${{ needs.build_pytorch_wheels.outputs.triton_version }}"

steps:
- name: Checkout
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

- name: Configure AWS Credentials
if: always()
uses: aws-actions/configure-aws-credentials@7474bc4690e29a8392af63c5b98e7449536d5c3a # v4.3.1
Comment on lines +246 to +251
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this if: always() condition isn't adding anything here - if "checkout" fails then there is no point in continuing. That code pattern got copy/pasted a few times since we wanted logs uploading to complete even if builds failed, but what we should have is either:

  • a single "upload artifacts" step (or reusable workflow?) that bundles the AWS credentials setup with the upload scripts that need it.
  • credentials automatically configured on our self-hosted runners

with:
aws-region: us-east-2
role-to-assume: arn:aws:iam::692859939525:role/therock-${{ inputs.release_type }}-releases


- name: Determine upload flag
env:
BUILD_RESULT: ${{ needs.build_pytorch_wheels.result }}
TEST_RESULT: ${{ needs.test_pytorch_wheels.result }}
TEST_RUNS_ON: ${{ needs.generate_target_to_run.outputs.test_runs_on }}
BYPASS_TESTS_FOR_RELEASES: ${{ needs.generate_target_to_run.outputs.bypass_tests_for_releases }}
run: |
# 1) If the build failed → upload=false
if [[ "$BUILD_RESULT" != "success" ]]; then
echo "::warning::Build failed. Skipping upload."
echo "upload=false" >> "$GITHUB_ENV"

# 2) Else if there was a test runner AND tests failed or were skipped → upload=false
elif [[ -n "$TEST_RUNS_ON" && ( "$TEST_RESULT" == "failure" || "$TEST_RESULT" == "skipped" ) ]]; then
echo "::warning::Tests failed or were skipped (runner present). Skipping upload."
echo "upload=false" >> "$GITHUB_ENV"

# 3) Else if BYPASS_TESTS_FOR_RELEASES is not set and there was no test runner → upload=false
elif [[ -z "$BYPASS_TESTS_FOR_RELEASES" && -z "$TEST_RUNS_ON" ]]; then
echo "::warning::No test runner and BYPASS_TESTS_FOR_RELEASES not set. Skipping upload."
echo "upload=false" >> "$GITHUB_ENV"

# 4) Otherwise → upload=true
else
echo "upload=true" >> "$GITHUB_ENV"
fi

- name: Copy PyTorch wheels from staging to release S3
if: ${{ env.upload == 'true' }}
run: |
echo "Copying exact tested wheels to release S3 bucket..."
aws s3 cp \
s3://${S3_BUCKET_PY}/${{ inputs.s3_staging_subdir }}/${{ inputs.amdgpu_family }}/ \
s3://${S3_BUCKET_PY}/${{ inputs.s3_subdir }}/${{ inputs.amdgpu_family }}/ \
--recursive \
--exclude "*" \
--include "torch-${TORCH_VERSION}-${CP_VERSION}-linux_x86_64.whl" \
--include "torchaudio-${TORCHAUDIO_VERSION}-${CP_VERSION}-linux_x86_64.whl" \
--include "torchvision-${TORCHVISION_VERSION}-${CP_VERSION}-linux_x86_64.whl" \
--include "pytorch_triton_rocm-${TRITON_VERSION}-${CP_VERSION}-linux_x86_64.whl"

- name: (Re-)Generate Python package release index
if: ${{ env.upload == 'true' }}
run: |
pip install boto3 packaging
python ./build_tools/third_party/s3_management/manage.py ${{ inputs.s3_subdir }}/${{ inputs.amdgpu_family }}
37 changes: 31 additions & 6 deletions .github/workflows/release_portable_linux_packages.yml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(future work)

Let's make sure these changes are carried over to the Windows release workflows too. Moving steps into scripts instead of inlined commands in yml will help with that.

Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ on:
description: "Subdirectory to push the Python packages"
type: string
default: "v2"
s3_staging_subdir:
description: "Staging subdirectory to push the Python packages"
type: string
default: "v2-staging"
# Trigger manually (typically to test the workflow or manually build a release [candidate])
workflow_dispatch:
inputs:
Expand All @@ -27,6 +31,10 @@ on:
description: "Subdirectory to push the Python packages"
type: string
default: "v2"
s3_staging_subdir:
description: "Staging subdirectory to push the Python packages"
type: string
default: "v2-staging"
families:
description: "Comma separated list of AMD GPU families, e.g. `gfx94X,gfx103x`"
type: string
Expand All @@ -44,6 +52,7 @@ jobs:
runs-on: ubuntu-24.04
env:
S3_SUBDIR: ${{ inputs.s3_subdir || 'v2' }}
S3_STAGING_SUBDIR: ${{ inputs.s3_staging_subdir || 'v2-staging' }}
release_type: ${{ inputs.release_type || 'nightly' }}
outputs:
version: ${{ steps.release_information.outputs.version }}
Expand Down Expand Up @@ -109,6 +118,7 @@ jobs:
S3_BUCKET_TAR: "therock-${{ needs.setup_metadata.outputs.release_type }}-tarball"
S3_BUCKET_PY: "therock-${{ needs.setup_metadata.outputs.release_type }}-python"
S3_SUBDIR: ${{ inputs.s3_subdir || 'v2' }}
S3_STAGING_SUBDIR: ${{ inputs.s3_staging_subdir || 'v2-staging' }}

steps:
- name: "Checking out repository"
Expand Down Expand Up @@ -147,6 +157,11 @@ jobs:
echo "Building ${{ env.DIST_ARCHIVE }}"
tar cfz "${{ env.DIST_ARCHIVE }}" .

- name: Setup Python
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: 3.12

- name: Build Python Packages
run: |
./build_tools/linux_portable_build.py \
Expand All @@ -171,6 +186,22 @@ jobs:
aws-region: us-east-2
role-to-assume: arn:aws:iam::692859939525:role/therock-${{ env.RELEASE_TYPE }}-releases

- name: Upload Releases to staging S3
if: ${{ github.repository_owner == 'ROCm' }}
run: |
aws s3 cp ${{ env.OUTPUT_DIR }}/packages/dist/ s3://${{ env.S3_BUCKET_PY }}/${{ env.S3_STAGING_SUBDIR }}/${{ matrix.target_bundle.amdgpu_family }}/ \
--recursive --no-follow-symlinks \
--exclude "*" \
--include "*.whl" \
--include "*.tar.gz"

- name: (Re-)Generate Python package release index for staging
if: ${{ github.repository_owner == 'ROCm' }}
Comment thread
araravik-psd marked this conversation as resolved.
run: |
pip install boto3 packaging
python ./build_tools/third_party/s3_management/manage.py ${{ env.S3_STAGING_SUBDIR }}/${{ matrix.target_bundle.amdgpu_family }}

## TODO: Restrict uploading to the non-staging S3 directory until ROCm sanity checks and all validation tests have successfully passed.
- name: Upload Releases to S3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(future work)

Out of curiosity, is this "upload from local to cloud" slower or faster than the "copy from cloud to cloud" that the Copy PyTorch wheels from staging to release S3 step in .github/workflows/build_portable_linux_pytorch_wheels.yml?

I think we could script these uploads/copies so we aren't inlining as much code into .yml files, in which case we could have a "copy release from staging to tested" mode on the script that we can use for both pytorch wheels and rocm wheels.

if: ${{ github.repository_owner == 'ROCm' }}
run: |
Expand All @@ -181,12 +212,6 @@ jobs:
--include "*.whl" \
--include "*.tar.gz"

- name: Setup Python
if: ${{ github.repository_owner == 'ROCm' }}
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: 3.12

- name: (Re-)Generate Python package release index
if: ${{ github.repository_owner == 'ROCm' }}
run: |
Expand Down
5 changes: 5 additions & 0 deletions build_tools/github_actions/amdgpu_family_matrix.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,12 @@
"linux": {
"test-runs-on": "",
"family": "gfx110X-dgpu",
"bypass_tests_for_releases": True,
},
"windows": {
"test-runs-on": "",
"family": "gfx110X-dgpu",
"bypass_tests_for_releases": True,
},
},
}
Expand All @@ -32,6 +34,7 @@
"linux": {
"test-runs-on": "",
"family": "gfx1151",
"bypass_tests_for_releases": True,
},
"windows": {
"test-runs-on": "windows-strix-halo-gpu-rocm",
Expand All @@ -42,10 +45,12 @@
"linux": {
"test-runs-on": "", # removed due to machine issues, label is "linux-rx9070-gpu-rocm"
"family": "gfx120X-all",
"bypass_tests_for_releases": True,
},
"windows": {
"test-runs-on": "",
"family": "gfx120X-all",
"bypass_tests_for_releases": True,
},
},
}
Expand Down
36 changes: 36 additions & 0 deletions build_tools/github_actions/configure_target_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,50 @@ def get_runner_label(target: str, platform: str) -> str:
if test_runs_on_machine:
print(f" Found runner: '{test_runs_on_machine}'")
return test_runs_on_machine
return ""


def get_upload_label(target: str, platform: str) -> str:
print(f"Searching for a runner for target '{target}' on platform '{platform}'")
amdgpu_family_info_matrix = (
amdgpu_family_info_matrix_presubmit | amdgpu_family_info_matrix_postsubmit
)
Comment on lines +54 to +56
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine for now since this is following the existing code patterns, but I just edited some of this code in 1376958.

We should be able to replace this code pattern with amdgpu_family_info_matrix_all now, in case we add runners for targets that are in the "nightly" (previously "xfail") group.

for key, info_for_key in amdgpu_family_info_matrix.items():
print(f"Cheecking key '{key}' with info:\n {info_for_key}")
platform_for_key = info_for_key.get(platform)

if not platform_for_key:
# Some AMDGPU families are only supported on certain platforms.
print(f" Skipping since this entry has no platform '{platform}'")
continue

# Check against both the inner "family" and the outer "key". If neither
# match then skip. Workflows are expected to use the inner "family"
# but manually triggered runs may use the outer "key" instead, so we'll
# be a bit lenient here.
# This needs a rework, see https://github.com/ROCm/TheRock/issues/1097.
family_for_platform = platform_for_key.get("family")
if target != family_for_platform and key not in target.lower():
print(
f" Skipping since the target '{target}' does not match the family '{family_for_platform}'"
)
continue

# If there is no test machine available and bypass_tests_for_releases flag is True for GPU family and platform, output bypass_tests_for_releases as True
bypass_tests_for_releases = platform_for_key.get("bypass_tests_for_releases")
if bypass_tests_for_releases:
print(f" bypass_tests_for_releases: True")
return bypass_tests_for_releases
return ""


def main(target: str, platform: str):
runner_label = get_runner_label(target, platform)
if runner_label:
gha_set_output({"test-runs-on": runner_label})
upload_label = get_upload_label(target, platform)
if upload_label:
gha_set_output({"bypass_tests_for_releases": upload_label})


if __name__ == "__main__":
Expand Down
3 changes: 0 additions & 3 deletions build_tools/github_actions/write_torch_versions.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
import os
import glob
import platform

from github_actions_utils import *


Expand Down Expand Up @@ -100,10 +99,8 @@ def main(argv: list[str]):
help="Path where wheels are located",
)
args = p.parse_args(argv)

if not args.dist_dir.exists():
raise FileNotFoundError(f"Dist dir '{args.dist_dir}' does not exist")

all_versions = get_all_wheel_versions(args.dist_dir)
_log("")
gha_set_output(all_versions)
Expand Down
5 changes: 5 additions & 0 deletions build_tools/third_party/s3_management/manage.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,11 @@
"v2/gfx120X-all",
"v2/gfx94X-dcgpu",
"v2/gfx950-dcgpu",
"v2-staging/gfx110X-dgpu",
"v2-staging/gfx1151",
"v2-staging/gfx120X-all",
"v2-staging/gfx94X-dcgpu",
"v2-staging/gfx950-dcgpu"
]

CUSTOM_PREFIX = getenv('CUSTOM_PREFIX')
Expand Down
Loading