Refactor do_build in build_prod_wheels.py for reviewability#3389
Conversation
do_build in build_prod_wheels.py for reviewability**do_build in build_prod_wheels.py for reviewability
| "PYTORCH_ROCM_ARCH": pytorch_rocm_arch, | ||
| "USE_KINETO": os.environ.get("USE_KINETO", "ON" if not is_windows else "OFF"), |
There was a problem hiding this comment.
FYI, I don't see either of these variables referenced in https://github.com/pytorch/vision/ or https://github.com/pytorch/audio/, but including them in the env used for all builds is probably harmless.
We could try moving some of these env vars out of "common" and into do_build_pytorch around here:
TheRock/external-builds/pytorch/build_prod_wheels.py
Lines 742 to 747 in 0c6d06a
(fine to keep this PR as a pure code move, without changing behavior, then attempt that later)
There was a problem hiding this comment.
Agreed — both are pytorch-only but PYTORCH_ROCM_ARCH is read from env in do_build_pytorch() for flash attention checks, so moving it needs a param change. Keeping as-is for pure code move; tracked in #3425.
| env: dict[str, str] = { | ||
| "PYTHONUTF8": "1", # Some build files use utf8 characters, force IO encoding | ||
| "CMAKE_PREFIX_PATH": str(cmake_prefix), | ||
| "ROCM_HOME": str(rocm_dir), | ||
| "ROCM_PATH": str(rocm_dir), |
There was a problem hiding this comment.
The line is pretty blurry here for what env setup goes in do_build() or _do_build_wheels_core(). Maybe we could add a dedicated setup_common_build_env() function that would contain all of this code?
TheRock/external-builds/pytorch/build_prod_wheels.py
Lines 392 to 478 in 0c6d06a
There was a problem hiding this comment.
Updated. Extracted all env construction into a dedicated _setup_common_build_env() function.
…able configuration for wheel builds.
|
Code changes seem fine but I want to wait until #3432 clears and we get a clean CI run before approving. See failures at https://github.com/ROCm/TheRock/actions/runs/22062019834/job/63760343692?pr=3389. |
|
PyTorch builds are working again after 8940752. Can you sync the branch on this PR so checks run again with the latest code? (Or I can click the "update branch" button myself) |
ScottTodd
left a comment
There was a problem hiding this comment.
So far so good on pytorch builds at https://github.com/ROCm/TheRock/actions/runs/22112997797/job/63922273929?pr=3389
## Motivation Preparatory refactor for sccache integration ([PR #3171](#3171 (comment))). Addresses [reviewer feedback](#3171 (comment)) on `build_prod_wheels.py` being hard to review due to a single large `do_build()` function. ## Technical Details - Extract core build steps (env setup, Triton, PyTorch, Audio, Vision, Apex, ccache stats) from `do_build()` into new `_do_build_wheels_core()` helper. - `do_build()` now handles only setup/orchestration and delegates to the helper. - Replace two redundant `get_rocm_path("root")` calls with the `rocm_dir` parameter. - **Pure refactor** — no new args, no sccache logic, no behavioral changes. ## Test Result No functional changes — refactored code follows the same execution path as before. - https://github.com/ROCm/TheRock/actions/runs/21945223080 After dedicated `_setup_common_build_env()` function: - https://github.com/ROCm/TheRock/actions/runs/22062404175 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Motivation Add sccache support to PyTorch wheel builds for S3-backed distributed caching. Script placed in `build_tools/` per [reviewer feedback](#3171 (comment)), modeled after `build_tools/setup_ccache.py`. Part of sccache PR sequence: [#3369](#3369) → [#3389](#3389) → **this** → workflow wiring. ## Technical Details - **New: `build_tools/setup_sccache_rocm.py`** — generic sccache ROCm helper (CLI + importable): - `find_sccache()` — locate binary; hard fail if missing - `setup_rocm_sccache()` — wrap clang/clang++ with sccache stubs (Linux only) - `restore_rocm_compilers()` — undo wrapping - **Modified: `external-builds/pytorch/build_prod_wheels.py`**: - `--use-ccache` / `--use-sccache` mutually exclusive args - Both hard-fail with `RuntimeError` if the requested cache tool is not found ([per review](#3171 (comment))) — no silent fallback - Added explicit ccache availability check (previously would fail with an unclear subprocess error) - sccache: wrap compilers → set CMAKE launchers → `try`/`finally` around build for guaranteed compiler restore + stats - Moved ccache stats reporting into `finally` block for consistent reporting on both success and failure ## Test Result No workflow changes — sccache wired but not yet invoked by CI (next PR adds `cache_type` input + AWS config). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Summary Adds `sccache` with S3 remote storage to all four PyTorch wheel build workflows, significantly reducing build times through distributed compiler caching. **PR sequence:** #3369 → #3306 → #3389 → #3482 → **this** → #3189 ([based on Reviewer's Feedback](#3171 (comment))) ## How It Works | | Linux | Windows | |---|---|---| | **Host C/C++** | CMake compiler launchers | CMake compiler launchers | | **HIP device code** | Wraps ROCm `clang`/`clang++` with sccache | Not supported | | **Cleanup** | Restores original compilers via try/finally | N/A | Cache is stored in the `therock-<workflow>-pytorch-sccache` S3 bucket, keyed by `<os>/<arch>/` prefix. ## S3 Cache Configuration Each workflow uses a dedicated S3 bucket and IAM role, keyed by `<os>/<arch>/` prefix: | Workflow | S3 Bucket | IAM Role | |----------|-----------|----------| | Linux CI | `therock-ci-pytorch-sccache` | `therock-ci` | | Windows CI | `therock-ci-pytorch-sccache` | `therock-ci` | | Linux Release | `therock-{release_type}-pytorch-sccache` | `therock-{release_type}` | | Windows Release | `therock-{release_type}-pytorch-sccache` | `therock-{release_type}` | Where `release_type` is one of: `dev`, `nightly`, `prerelease`. ## Impact | Platform | Cold → Warm | Improvement | |----------|------------|-------------| | Linux | ~70m → ~37m | **~49%** | | Windows | ~42m → ~26m | **~38%** | Windows is lower — sccache cannot wrap HIP device compilation on Windows, only host C/C++ via CMAKE launchers. ## Tests ### Linux: - [Linux (Cache Population)](https://github.com/ROCm/TheRock/actions/runs/22226347964/job/64293924748) - 70 mins - [Linux (Cache Hit)](https://github.com/ROCm/TheRock/actions/runs/22231743387/job/64312966557) - 37 mins ### Windows: - [Windows (Cache Population)](https://github.com/ROCm/TheRock/actions/runs/22219252671/job/64280583887) - 42 mins - [Windows (Cache Hit)](https://github.com/ROCm/TheRock/actions/runs/22223608689/job/64284721704) - 26 mins ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.' > Forks: S3 caching is only active for ROCm-owned runs. Fork users can set cache_type to ccache or none, or leave the default — sccache will work locally without S3 access. ---------
Motivation
Preparatory refactor for sccache integration (PR #3171). Addresses reviewer feedback on
build_prod_wheels.pybeing hard to review due to a single largedo_build()function.Technical Details
do_build()into new_do_build_wheels_core()helper.do_build()now handles only setup/orchestration and delegates to the helper.get_rocm_path("root")calls with therocm_dirparameter.Test Result
No functional changes — refactored code follows the same execution path as before.
After dedicated
_setup_common_build_env()function:Submission Checklist