Add sccache support for PyTorch builds#3171
Conversation
…ncatenation instead of Path() for relative links
… symlink handling and path resolution.
…ce CMake launcher setup for ROCm builds
lld doesn't work with mixed GCC/Clang builds - Triton uses GCC which doesn't support -fuse-ld=/path/to/lld syntax. Only Clang supports full path linker specification.
…pport in sccache;
… error handling and improve binary management.
…ws handling for ROCm builds; remove HIP compiler launcher due to compatibility issues.
ScottTodd
left a comment
There was a problem hiding this comment.
Cool, I think this is heading in a good direction.
| - name: Configure AWS Credentials for sccache | ||
| if: ${{ github.repository_owner == 'ROCm' }} | ||
| uses: aws-actions/configure-aws-credentials@61815dcd50bd041e203e49132bacad1fd04d2708 # v5.1.1 | ||
| with: | ||
| aws-region: us-east-2 | ||
| role-to-assume: arn:aws:iam::692859939525:role/therock-${{ inputs.release_type }} |
There was a problem hiding this comment.
These are the same roles as we use for uploading release files (python packages, artifacts). Do we want a separate role for using sccache?
Is the therock-pytorch-sccache bucket public read but private write, or private for both?
cc @marbre
There was a problem hiding this comment.
Bucket is private and blocking all the public access but only accessible throught these role only role-to-assume: arn:aws:iam::692859939525:role/therock-${{ inputs.release_type }}
Using the existing release role because:
The sccache operations (S3 GetObject/PutObject) are a subset of what the arn:aws:iam::692859939525:role/therock-${{ inputs.release_type } roles already have
There was a problem hiding this comment.
Need @marbre and @amd-shiraz to weigh in here (and perhaps @amd-justchen too, given his prior work on the ccache server we use for building ROCm).
We need a clear policy written down for how cache buckets and access is handled. On other projects we've made these decisions:
- CI cache buckets are world readable so developers can benefit from the CI cache
- workflows running on
scheduleorpushcan read and write to the cache - workflows running on
pull_requestcan only read from the cache - (what about
workflow_dispatch?)
I'd like to apply the same policies for PyTorch and ROCm builds, so we aren't dealing with an explosion of different settings when we also enable caching for JAX and other projects.
Note that I also have #3303 open which creates a new workflow for building pytorch on CI. That will be the main place that a build cache will be needed. Having a cache for dev or nightly release builds is more of a nice-to-have given the reduced job frequency and lower bar for build cache integrity.
There was a problem hiding this comment.
Also another reminder: keep PRs small and focused. This takes weeks to review because the change does multiple things at once, and each related piece has open design questions.
PR sequence:
- Add sccache to dockerfiles
- Set workflows to use new dockerfiles
- Add sccache support to build scripts
- Have workflows use the new sccache support
Each of those would have significantly shorter review turnaround time.
| - name: Install sccache | ||
| run: | | ||
| pip install sccache | ||
| sccache --version |
There was a problem hiding this comment.
Let's get sccache into our build dockerfile, similar to ccache:
TheRock/dockerfiles/build_manylinux_x86_64.Dockerfile
Lines 27 to 30 in 9e9c726
- https://github.com/ROCm/TheRock/blob/main/dockerfiles/install_ccache.sh
- https://github.com/ROCm/TheRock/blob/main/dockerfiles/README.md#updating-images-used-by-github-actions-workflows
I don't trust a pip install in this workflow prior to the two steps below that select a python version and put that python version on PATH.
There was a problem hiding this comment.
I agree we should get this into our base image
There was a problem hiding this comment.
Done. Added sccache installation to the Docker image via dockerfiles/install_sccache.sh (similar pattern to ccache installation).
| S3_BUCKET_PY: "therock-${{ inputs.release_type }}-python" | ||
| optional_build_prod_arguments: "" | ||
| # sccache configuration for ROCm compiler caching with S3 backend | ||
| SCCACHE_BUCKET: therock-pytorch-sccache |
There was a problem hiding this comment.
We may want separate cache buckets (or namespaces) for dev, nightly, and stable releases.
There was a problem hiding this comment.
Updated to use separate buckets per environment:
- therock-dev-pytorch-sccache
- therock-nightly-pytorch-sccache
- therock-prerelease-pytorch-sccache
Each environment's IAM role (therock-dev, therock-nightly, therock-prerelease) has access only to its corresponding bucket.
There was a problem hiding this comment.
Check with @marbre for these bucket configurations and role settings. This TheRock repository will retain access to the "dev" role but nightly and prerelease are moving to https://github.com/ROCm/rockrel.
There was a problem hiding this comment.
@marbre
Each environment's IAM role (therock-dev, therock-nightly, therock-prerelease) has access only to its corresponding bucket.
- therock-dev-pytorch-sccache
- therock-nightly-pytorch-sccache
- therock-prerelease-pytorch-sccache
Attached the dev role policy screenshot. Do we need any changes here ?
| except Exception as e: | ||
| print(f"ERROR: sccache setup failed: {e}") | ||
| print("Falling back to ccache for host code compilation...") | ||
| args.use_sccache = False | ||
| args.use_ccache = True | ||
| env["CMAKE_C_COMPILER_LAUNCHER"] = "ccache" | ||
| env["CMAKE_CXX_COMPILER_LAUNCHER"] = "ccache" | ||
| try: | ||
| run_command(["ccache", "--zero-stats"], cwd=tempfile.gettempdir()) | ||
| except Exception as ccache_error: | ||
| print(f"WARNING: ccache fallback also failed: {ccache_error}") | ||
| print("Continuing without compiler caching...") | ||
| args.use_ccache = False |
There was a problem hiding this comment.
The diffs in this file are difficult to review due to the changes to indentation to accomodate more exception handling. It might help to first pull some of these sections into functions in one PR/commit and then have another PR/commit wrap with sccache setup.
| except Exception as e: | ||
| print(f"ERROR: sccache setup failed: {e}") | ||
| print("Falling back to ccache for host code compilation...") | ||
| args.use_sccache = False | ||
| args.use_ccache = True |
There was a problem hiding this comment.
I'd rather we respect the user's choice here and hard fail instead of falling back to something the user didn't request.
- If
--use-sccacheis set but sccache couldn't be set up for some reason, fail. - Same for
--use-ccache
There was a problem hiding this comment.
Can we proceed with out any cache settings we dont find the sccache instead of falling back to ccache
There was a problem hiding this comment.
Always prefer visible errors - fail fast: https://github.com/ROCm/TheRock/blob/main/docs/development/style_guides/python_style_guide.md#fail-fast-behavior. We don't want to discover that we've been running for months without a functional cache due to an environment configuration issue that trips the fallback path.
There was a problem hiding this comment.
Done. Changed to hard fail - now raises RuntimeError.
To address Akash's comment, if we want a way to build without cache, Introduced cache_type input for both Linux and Windows workflows to specify the compiler cache type (sccache, ccache, or none).
| build_p.add_argument( | ||
| "--use-ccache", | ||
| action=argparse.BooleanOptionalAction, | ||
| help="Use ccache as the compiler launcher", | ||
| help="Use ccache as the compiler launcher (for host code only)", | ||
| ) | ||
| build_p.add_argument( | ||
| "--use-sccache", | ||
| action=argparse.BooleanOptionalAction, | ||
| help="Use sccache with ROCm compiler wrapping (comprehensive caching for HIP code)", | ||
| ) |
There was a problem hiding this comment.
Let's make --use-ccache and --use-sccache mututally exclusive.
https://docs.python.org/3/library/argparse.html#mutual-exclusion
There was a problem hiding this comment.
Done. Updated to use argparse.add_mutually_exclusive_group()
| def main(): | ||
| parser = argparse.ArgumentParser( | ||
| description="Setup sccache to wrap ROCm compilers for PyTorch builds" | ||
| ) |
There was a problem hiding this comment.
I don't see any references to "torch" in this file outside of comments. Can we either
- Use scripts provided by pytorch itself
- Move this to
build_tools/and share with multiple project builds. We can model the file after https://github.com/ROCm/TheRock/blob/main/build_tools/setup_ccache.py
| - name: Report sccache stats | ||
| if: ${{ !cancelled() }} | ||
| run: | | ||
| echo "sccache Stats:" | ||
| echo "--------------" | ||
| sccache --show-stats || true |
There was a problem hiding this comment.
This is okay for now, but relating to my other comment about making the sccache setup script more generic (and not specific to pytorch), we have a common pattern for "setup cache" and "report cache stats".
See how build_tools/health_status.py is run here:
TheRock/.github/workflows/build_portable_linux_artifacts.yml
Lines 95 to 106 in 9e9c726
We could add sccache to the env check, like
TheRock/build_tools/hack/env_check/check_tools.py
Lines 328 to 332 in 9e9c726
TheRock/build_tools/hack/env_check/find_tools.py
Lines 191 to 195 in 9e9c726
TheRock/build_tools/hack/env_check/device.py
Lines 421 to 450 in 9e9c726
(quite a lot of boilerplate that way though...)
Then, on the post-build side of the workflows, we have this code now that could be moved to a similar script:
TheRock/.github/workflows/build_portable_linux_artifacts.yml
Lines 132 to 154 in 9e9c726
There was a problem hiding this comment.
agreed. Keeping the inline approach for now to limit scope. Created #3189 to track adding sccache to env_check tooling and unifying cache stats reporting as a follow-up.
| def install_sccache() -> Path: | ||
| """Install sccache if not available.""" | ||
| sccache_path = find_sccache() | ||
| if sccache_path: | ||
| print(f"Found sccache at: {sccache_path}") | ||
| return sccache_path | ||
|
|
||
| print("sccache not found, attempting to install...") | ||
|
|
||
| if is_windows: | ||
| # Try cargo install | ||
| try: | ||
| subprocess.check_call(["cargo", "install", "sccache"]) | ||
| sccache_path = Path.home() / ".cargo" / "bin" / "sccache.exe" | ||
| if sccache_path.exists(): | ||
| return sccache_path | ||
| except (subprocess.CalledProcessError, FileNotFoundError): | ||
| pass | ||
|
|
||
| raise RuntimeError( | ||
| "Could not install sccache. Please install it manually:\n" | ||
| " choco install sccache\n" | ||
| " or: cargo install sccache" | ||
| ) | ||
| else: | ||
| # Try pip install (sccache is available on PyPI) | ||
| try: | ||
| subprocess.check_call([sys.executable, "-m", "pip", "install", "sccache"]) | ||
| sccache_path = find_sccache() | ||
| if sccache_path: | ||
| return sccache_path | ||
| except subprocess.CalledProcessError: | ||
| pass | ||
|
|
||
| # Try cargo install as fallback | ||
| try: | ||
| subprocess.check_call(["cargo", "install", "sccache"]) | ||
| sccache_path = Path.home() / ".cargo" / "bin" / "sccache" | ||
| if sccache_path.exists(): | ||
| return sccache_path | ||
| except (subprocess.CalledProcessError, FileNotFoundError): | ||
| pass | ||
|
|
||
| raise RuntimeError( | ||
| "Could not install sccache. Please install it manually:\n" | ||
| " pip install sccache\n" | ||
| " or: cargo install sccache" | ||
| ) |
There was a problem hiding this comment.
I don't think this script should do any installing on its own. Our other scripts don't do that, and we should have
- Predictable tool installs in our base build environments
- Script that fail if the environment is not configured as expected
There was a problem hiding this comment.
Done. Removed the install_sccache() function. The script now:
- Uses
find_sccache()to locate the binary - Fails with
RuntimeErrorif sccache is not found
sccache is now pre-installed via:
- Linux: Docker image (install_sccache.sh)
- Windows: choco install sccache in workflow
There was a problem hiding this comment.
Only a 17% improvement of build time on Windows is interesting 🤔
In my local builds back in August I was able to get from 40-60 minutes down to 6 minutes with ccache.
There was a problem hiding this comment.
What i tried:
- CMAKE_HIP_COMPILER_LAUNCHER=sccache → "Compiler not supported" error
- HIP_CLANG_LAUNCHER=sccache → No improvement
- Wrapper scripts (like Linux) → Doesn't work on Windows due to toolchain differences
Do you remember the ccache configuration from August? Specifically:
- Any special environment variables or flags?
- Local cache or remote storage?
There was a problem hiding this comment.
Local cache with just the --use-ccache option to this script, no extra tuning or settings. I didn't run detailed experiments at the time, but I posted as a footnote on pytorch/pytorch#159520 (comment)
By the way, on my machine with ccache, through those build scripts I'm seeing about 40-60 minutes for a cold cache build, 6 minutes on a clean build with 95.80% cache hits, and 1 minute on a rebuild (existing build directory + warm cache).
- Simplify stats output: just use sccache --show-stats - Make --use-ccache and --use-sccache mutually exclusive - Remove parse_sccache_stats and print_sccache_stats function
… specify the compiler cache type (sccache, ccache, or none).
…Torch wheels to a user-specific version
8aa2314 to
9c95c75
Compare
… to include release type
…ndows to improve consistency
…Torch wheels to include sccache and Add TODO for SHA pinning after merge
…Torch wheels to a specific SHA and refine TODO for future updates
| image: ghcr.io/rocm/therock_build_manylinux_x86_64@sha256:db2b63f938941dde2abc80b734e64b45b9995a282896d513a0f3525d4591d6cb | ||
| # TODO(follow-up PR): Update SHA to main image after Dockerfile changes merge | ||
| image: ghcr.io/rocm/therock_build_manylinux_x86_64@sha256:6e7d49caefd37cdda93487bafde973a683f372d517ca7e5bbb4232ebdcfaca30 |
There was a problem hiding this comment.
Sequence these changes to the dockerfile as their own PRs, following these instructions: https://github.com/ROCm/TheRock/tree/main/dockerfiles#updating-images-used-by-github-actions-workflows
(I only have time to review PRs that are "ready", and this can't be ready by design - it could be marked as draft until the sequence of changes lands)
There was a problem hiding this comment.
(see my other comment) Move the Dockerfile changes to their own PR and land them first
…3303) ## Motivation Progress on #3291. This adds a new `build_portable_linux_pytorch_wheels_ci.yml` workflow forked from [`build_portable_linux_pytorch_wheels.yml`](https://github.com/ROCm/TheRock/blob/main/.github/workflows/build_portable_linux_pytorch_wheels.yml). This new workflow is run as part of our CI pipeline and will help catch when changes to ROCm break PyTorch source builds. Future work will expand this to also build other packages, upload the built packages to S3, and run tests. This workflow code would have caught the build break reported at #3042. ## Technical Details > [!NOTE] > See #3291 and https://github.com/ScottTodd/claude-rocm-workspace/blob/main/tasks/active/pytorch-ci.md for other design considerations. I'm starting with a narrow scope here to provide _some_ value without blowing our budget or delaying while we refactor related workflows and infrastructure code (e.g. moving index page generation server-side, generating commit manifests at the _start_ of workflows instead of computing them after the fact and plumbing them through partway through the jobs) Specifics: * Linux only (as a start) * Non-configurable, always runs (as a start) * Included for all GPU architectures where `expect_pytorch_failure` is not set * Python 3.12 (not full matrix) * PyTorch release/2.10 branch (not full matrix) * Only builds 'torch', not 'torchaudio', 'torchvision', 'triton', or other packages * Does not upload packages yet * Does not run tests yet (beyond package sanity checks that `import torch` works on the build machine) The build jobs add about 30 minutes of CI time per GPU architecture, and we are not currently using ccache or sccache (#3171 will change that) ## Test Plan * Tested on a known-broken commit (4497f66) * https://github.com/ROCm/TheRock/actions/runs/21768200125/job/62810358116 (failed as expected) * Test on a known-working commit (a001047) * https://github.com/ROCm/TheRock/actions/runs/21768071862/job/62813030260 (passed as expected) * CI jobs on this PR itself, e.g. https://github.com/ROCm/TheRock/actions/runs/21846117572/job/63050058601?pr=3303 ``` [41](https://github.com/ROCm/TheRock/actions/runs/21846117572/job/63049474316?pr=3303#step:11:78642) Found built wheel: /__w/TheRock/TheRock/external-builds/pytorch/pytorch/dist/torch-2.10.0+devrocm7.12.0.dev0.09ac57fcd4e7258046fff2824dc0614384cb1c85-cp312-cp312-linux_x86_64.whl ++ Copy /__w/TheRock/TheRock/external-builds/pytorch/pytorch/dist/torch-2.10.0+devrocm7.12.0.dev0.09ac57fcd4e7258046fff2824dc0614384cb1c85-cp312-cp312-linux_x86_64.whl -> /home/runner/_work/TheRock/TheRock/output/packages/dist +++ Installing built torch: ++ Exec [/tmp]$ /opt/python/cp312-cp312/bin/python -m pip install /__w/TheRock/TheRock/external-builds/pytorch/pytorch/dist/torch-2.10.0+devrocm7.12.0.dev0.09ac57fcd4e7258046fff2824dc0614384cb1c85-cp312-cp312-linux_x86_64.whl +++ Sanity checking installed torch (unavailable is okay on CPU machines): ++ Capture [/tmp]$ /opt/python/cp312-cp312/bin/python -c 'import torch; print(torch.cuda.is_available())' Sanity check output: False --- Not build pytorch-audio (no --pytorch-audio-dir) --- Not build pytorch-vision (no --pytorch-vision-dir) --- Not build apex (no --apex-dir) --- Builds all completed ``` ``` Valid wheel: torch-2.10.0+devrocm7.12.0.dev0.09ac57fcd4e7258046fff2824dc0614384cb1c85-cp312-cp312-linux_x86_64.whl (222812153 bytes) ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude <noreply@anthropic.com>
…3303) ## Motivation Progress on #3291. This adds a new `build_portable_linux_pytorch_wheels_ci.yml` workflow forked from [`build_portable_linux_pytorch_wheels.yml`](https://github.com/ROCm/TheRock/blob/main/.github/workflows/build_portable_linux_pytorch_wheels.yml). This new workflow is run as part of our CI pipeline and will help catch when changes to ROCm break PyTorch source builds. Future work will expand this to also build other packages, upload the built packages to S3, and run tests. This workflow code would have caught the build break reported at #3042. ## Technical Details > [!NOTE] > See #3291 and https://github.com/ScottTodd/claude-rocm-workspace/blob/main/tasks/active/pytorch-ci.md for other design considerations. I'm starting with a narrow scope here to provide _some_ value without blowing our budget or delaying while we refactor related workflows and infrastructure code (e.g. moving index page generation server-side, generating commit manifests at the _start_ of workflows instead of computing them after the fact and plumbing them through partway through the jobs) Specifics: * Linux only (as a start) * Non-configurable, always runs (as a start) * Included for all GPU architectures where `expect_pytorch_failure` is not set * Python 3.12 (not full matrix) * PyTorch release/2.10 branch (not full matrix) * Only builds 'torch', not 'torchaudio', 'torchvision', 'triton', or other packages * Does not upload packages yet * Does not run tests yet (beyond package sanity checks that `import torch` works on the build machine) The build jobs add about 30 minutes of CI time per GPU architecture, and we are not currently using ccache or sccache (#3171 will change that) ## Test Plan * Tested on a known-broken commit (4497f66) * https://github.com/ROCm/TheRock/actions/runs/21768200125/job/62810358116 (failed as expected) * Test on a known-working commit (a001047) * https://github.com/ROCm/TheRock/actions/runs/21768071862/job/62813030260 (passed as expected) * CI jobs on this PR itself, e.g. https://github.com/ROCm/TheRock/actions/runs/21846117572/job/63050058601?pr=3303 ``` [41](https://github.com/ROCm/TheRock/actions/runs/21846117572/job/63049474316?pr=3303#step:11:78642) Found built wheel: /__w/TheRock/TheRock/external-builds/pytorch/pytorch/dist/torch-2.10.0+devrocm7.12.0.dev0.09ac57fcd4e7258046fff2824dc0614384cb1c85-cp312-cp312-linux_x86_64.whl ++ Copy /__w/TheRock/TheRock/external-builds/pytorch/pytorch/dist/torch-2.10.0+devrocm7.12.0.dev0.09ac57fcd4e7258046fff2824dc0614384cb1c85-cp312-cp312-linux_x86_64.whl -> /home/runner/_work/TheRock/TheRock/output/packages/dist +++ Installing built torch: ++ Exec [/tmp]$ /opt/python/cp312-cp312/bin/python -m pip install /__w/TheRock/TheRock/external-builds/pytorch/pytorch/dist/torch-2.10.0+devrocm7.12.0.dev0.09ac57fcd4e7258046fff2824dc0614384cb1c85-cp312-cp312-linux_x86_64.whl +++ Sanity checking installed torch (unavailable is okay on CPU machines): ++ Capture [/tmp]$ /opt/python/cp312-cp312/bin/python -c 'import torch; print(torch.cuda.is_available())' Sanity check output: False --- Not build pytorch-audio (no --pytorch-audio-dir) --- Not build pytorch-vision (no --pytorch-vision-dir) --- Not build apex (no --apex-dir) --- Builds all completed ``` ``` Valid wheel: torch-2.10.0+devrocm7.12.0.dev0.09ac57fcd4e7258046fff2824dc0614384cb1c85-cp312-cp312-linux_x86_64.whl (222812153 bytes) ``` ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Claude <noreply@anthropic.com>
## Motivation Preparatory refactor for sccache integration ([PR #3171](#3171 (comment))). Addresses [reviewer feedback](#3171 (comment)) on `build_prod_wheels.py` being hard to review due to a single large `do_build()` function. ## Technical Details - Extract core build steps (env setup, Triton, PyTorch, Audio, Vision, Apex, ccache stats) from `do_build()` into new `_do_build_wheels_core()` helper. - `do_build()` now handles only setup/orchestration and delegates to the helper. - Replace two redundant `get_rocm_path("root")` calls with the `rocm_dir` parameter. - **Pure refactor** — no new args, no sccache logic, no behavioral changes. ## Test Result No functional changes — refactored code follows the same execution path as before. - https://github.com/ROCm/TheRock/actions/runs/21945223080 After dedicated `_setup_common_build_env()` function: - https://github.com/ROCm/TheRock/actions/runs/22062404175 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation Preparatory refactor for sccache integration ([PR #3171](#3171 (comment))). Addresses [reviewer feedback](#3171 (comment)) on `build_prod_wheels.py` being hard to review due to a single large `do_build()` function. ## Technical Details - Extract core build steps (env setup, Triton, PyTorch, Audio, Vision, Apex, ccache stats) from `do_build()` into new `_do_build_wheels_core()` helper. - `do_build()` now handles only setup/orchestration and delegates to the helper. - Replace two redundant `get_rocm_path("root")` calls with the `rocm_dir` parameter. - **Pure refactor** — no new args, no sccache logic, no behavioral changes. ## Test Result No functional changes — refactored code follows the same execution path as before. - https://github.com/ROCm/TheRock/actions/runs/21945223080 After dedicated `_setup_common_build_env()` function: - https://github.com/ROCm/TheRock/actions/runs/22062404175 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation Add sccache support to PyTorch wheel builds for S3-backed distributed caching. Script placed in `build_tools/` per [reviewer feedback](#3171 (comment)), modeled after `build_tools/setup_ccache.py`. Part of sccache PR sequence: [#3369](#3369) → [#3389](#3389) → **this** → workflow wiring. ## Technical Details - **New: `build_tools/setup_sccache_rocm.py`** — generic sccache ROCm helper (CLI + importable): - `find_sccache()` — locate binary; hard fail if missing - `setup_rocm_sccache()` — wrap clang/clang++ with sccache stubs (Linux only) - `restore_rocm_compilers()` — undo wrapping - **Modified: `external-builds/pytorch/build_prod_wheels.py`**: - `--use-ccache` / `--use-sccache` mutually exclusive args - Both hard-fail with `RuntimeError` if the requested cache tool is not found ([per review](#3171 (comment))) — no silent fallback - Added explicit ccache availability check (previously would fail with an unclear subprocess error) - sccache: wrap compilers → set CMAKE launchers → `try`/`finally` around build for guaranteed compiler restore + stats - Moved ccache stats reporting into `finally` block for consistent reporting on both success and failure ## Test Result No workflow changes — sccache wired but not yet invoked by CI (next PR adds `cache_type` input + AWS config). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Summary Adds `sccache` with S3 remote storage to all four PyTorch wheel build workflows, significantly reducing build times through distributed compiler caching. **PR sequence:** #3369 → #3306 → #3389 → #3482 → **this** → #3189 ([based on Reviewer's Feedback](#3171 (comment))) ## How It Works | | Linux | Windows | |---|---|---| | **Host C/C++** | CMake compiler launchers | CMake compiler launchers | | **HIP device code** | Wraps ROCm `clang`/`clang++` with sccache | Not supported | | **Cleanup** | Restores original compilers via try/finally | N/A | Cache is stored in the `therock-<workflow>-pytorch-sccache` S3 bucket, keyed by `<os>/<arch>/` prefix. ## S3 Cache Configuration Each workflow uses a dedicated S3 bucket and IAM role, keyed by `<os>/<arch>/` prefix: | Workflow | S3 Bucket | IAM Role | |----------|-----------|----------| | Linux CI | `therock-ci-pytorch-sccache` | `therock-ci` | | Windows CI | `therock-ci-pytorch-sccache` | `therock-ci` | | Linux Release | `therock-{release_type}-pytorch-sccache` | `therock-{release_type}` | | Windows Release | `therock-{release_type}-pytorch-sccache` | `therock-{release_type}` | Where `release_type` is one of: `dev`, `nightly`, `prerelease`. ## Impact | Platform | Cold → Warm | Improvement | |----------|------------|-------------| | Linux | ~70m → ~37m | **~49%** | | Windows | ~42m → ~26m | **~38%** | Windows is lower — sccache cannot wrap HIP device compilation on Windows, only host C/C++ via CMAKE launchers. ## Tests ### Linux: - [Linux (Cache Population)](https://github.com/ROCm/TheRock/actions/runs/22226347964/job/64293924748) - 70 mins - [Linux (Cache Hit)](https://github.com/ROCm/TheRock/actions/runs/22231743387/job/64312966557) - 37 mins ### Windows: - [Windows (Cache Population)](https://github.com/ROCm/TheRock/actions/runs/22219252671/job/64280583887) - 42 mins - [Windows (Cache Hit)](https://github.com/ROCm/TheRock/actions/runs/22223608689/job/64284721704) - 26 mins ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.' > Forks: S3 caching is only active for ROCm-owned runs. Fork users can set cache_type to ccache or none, or leave the default — sccache will work locally without S3 access. ---------
Summary
Adds
sccachewith AWS S3 remote storage for PyTorch wheel builds, significantly reducing build times through distributed compiler caching.Key Features
therock-pytorch-sccachebucketlinux/<arch>/andwindows/<arch>/prefixesclang/clang++in ROCm SDK with sccache for HIP compilation cachingCMAKE_C_COMPILER_LAUNCHERandCMAKE_CXX_COMPILER_LAUNCHERfor host code cachingHow It Works
Linux
sccachebinaryclang/clang++with sccache wrapper scriptsWindows
sccache.exeConfiguration
Environment variables (set in workflow):
SCCACHE_BUCKET: S3 bucket nameSCCACHE_REGION: AWS regionSCCACHE_S3_KEY_PREFIX: Cache key prefix (os/arch)SCCACHE_S3_SERVER_SIDE_ENCRYPTION: EnabledSCCACHE_LOG: Set towarnfor error/warning visibilityFiles Changed
.github/workflows/build_portable_linux_pytorch_wheels.yml- Linux workflow with sccache config.github/workflows/build_windows_pytorch_wheels.yml- Windows workflow with sccache configexternal-builds/pytorch/build_prod_wheels.py- Build script with sccache integrationexternal-builds/pytorch/setup_sccache_rocm.py- New module for sccache setup and compiler wrappingTesting
Known Limitations
https://github.com/ROCm/TheRock/actions/runs/21508193319
Run 1 ( Cache Population )
Run 2 ( Cache Hit )
Linux PyTorch Build Times
Linux Average: ~48% improvement
Windows PyTorch Build Times
Windows Average: ~17% improvement (release builds only)
Summary: Build Time Improvements
Times vary based on cache hit rate and code changes
Submission Checklist