Use 128GB runners for gfx1151 pytorch CI on Windows#4613
Conversation
5cca2db to
d98d5d7
Compare
d98d5d7 to
cb396d3
Compare
Signed-off-by: zichguan-amd <zichuan.guan@amd.com>
cb396d3 to
372864b
Compare
There was a problem hiding this comment.
Pull request overview
Updates GitHub Actions runner selection so Windows gfx1151 PyTorch wheel CI uses 128GB runners to avoid OOM-related flakiness stemming from a known driver memory carveout issue (#3724).
Changes:
- Added a
pytorch-ci-test-runs-onrunner label override for Windowsgfx1151in the AMDGPU family matrix. - Updated
configure_target_run.get_runner_label()to select the PyTorch-specific runner only forgfx1151/Windows when running from a*pytorch_wheels*workflow. - Added a unit test covering the PyTorch-workflow runner override behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| build_tools/github_actions/tests/configure_target_run_test.py | Adds test coverage for PyTorch-workflow-specific runner selection on Windows gfx1151. |
| build_tools/github_actions/configure_target_run.py | Adds workflow detection and conditional runner override logic for Windows gfx1151 PyTorch wheel workflows. |
| build_tools/github_actions/amdgpu_family_matrix.py | Introduces pytorch-ci-test-runs-on runner label override for Windows gfx1151. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| test_runs_on_machine = platform_for_key.get("test-runs-on") | ||
| # `pytorch-ci-test-runs-on` is used only for Windows gfx1151 when the workflow | ||
| # is a `*pytorch_wheels*.yml` job; all other families use `test-runs-on`. | ||
| use_pytorch_ci_windows_gfx1151 = ( |
There was a problem hiding this comment.
this is too restrictive. the label pytorch-ci-test-runs-on should work on any platform and arch when some pytorch workflow is run
There was a problem hiding this comment.
Currently only gfx1151 on Windows faces system config issue, I could generalize this to all targets, so we don't need to check for platform or arch. Tho I'm not sure if we want to commit to have the extra label for the long run.
There was a problem hiding this comment.
The check now is platform & target agnostic. Use dedicated label as long as there is a project-specific runs-on label and the workflow asks for it, otherwise falls back to default runs-on label so we don't need to add the pytorch specific label to all arches
There was a problem hiding this comment.
thanks. code wise it is ok for me now
| def is_pytorch_wheel_workflow() -> bool: | ||
| """True when this process runs from a *pytorch_wheels*.yml GitHub Actions workflow. | ||
|
|
||
| Matches the workflow file path in ``GITHUB_WORKFLOW_REF`` (stable) | ||
| """ | ||
| ref = os.getenv("GITHUB_WORKFLOW_REF", "") | ||
| return "pytorch_wheels" in ref.replace("\\", "/").lower() |
There was a problem hiding this comment.
Checking the workflow ref against a fixed name like "pytorch_wheels" is too brittle. We could change the workflow name to "pytorch_packages" and this would break silently.
How about adding an argument to this script (or an environment variable) to make it explicit? This code would change:
TheRock/.github/workflows/build_windows_pytorch_wheels.yml
Lines 365 to 370 in 407bf1b
- name: Generating target to run
id: configure
env:
TARGET: ${{ inputs.amdgpu_family }}
PLATFORM: "windows"
+ TEST_PROJECT_NAME: "pytorch"
run: python ./build_tools/github_actions/configure_target_run.py - name: Generating target to run
id: configure
env:
TARGET: ${{ inputs.amdgpu_family }}
PLATFORM: "windows"
- run: python ./build_tools/github_actions/configure_target_run.py
+ run: python ./build_tools/github_actions/configure_target_run.py \
+ --test-project-name=pytorchThere was a problem hiding this comment.
Added command line argument
| "test-runs-on": "windows-gfx1151-gpu-rocm", | ||
| "pytorch-ci-test-runs-on": "windows-strix-halo-gpu-rocm-128gb", |
There was a problem hiding this comment.
@geomin12 / @amd-shiraz / @amd-justchen
What's our spread of test runners for this windows-gfx1151-gpu-rocm label?
We should either:
- Have all machines using the same specific runner label have the same system configuration
- Have all workflows requesting the same generic runner label pass tests
RIght now windows-gfx1151-gpu-rocm seems to include runners with multiple different system configurations and the tests are not passing.
There was a problem hiding this comment.
I think they are still including all of the machines from 16, 32, 64, 128gb of total RAM. There was a point where I started adding runner labels for minimum amount of RAM for tests to select. Plumbing needs to be in place for that though, @geomin12 thoughts?
There was a problem hiding this comment.
From my experience there are 128gb models and 64gb models, all configured to the maximum carveout sizes (96gb and 48gb iirc).
There was a problem hiding this comment.
any update here? does the runner label exist?
code wise the pr looks good now. but do not know what the runner status is
There was a problem hiding this comment.
The label windows-strix-halo-gpu-rocm-128gb exists and currently has 6 runners. Do we want something else?
There was a problem hiding this comment.
@geomin12 / @amd-shiraz / @amd-justchen
any opinion? from my side it looks good to merge.
50f8cec to
f642345
Compare
Signed-off-by: zichguan-amd <zichuan.guan@amd.com>
f642345 to
14971ff
Compare
Motivation
Fixes #3724. Due to non-power-of-2 memory carveout issue, run pytorch CI only on 128GB runners. 64GB runners fail way too many tests with the driver issue. The driver issue should be fixed in the April Adrenalin release, this PR fixes the CI until we update the runners with new drivers.
Technical Details
Add
pytorch-ci-test-runs-onfield inamdgpu_family_matrix.pyfor gfx1151 Windows only.In
configure_target_run.py, add checks for gfx1151 target running any workflows containingpytorch_wheelsin the name to use the newpytorch-ci-test-runs-onfield instead. Other workflows and target archs should still use thetest-runs-onfield.Test Plan
Release workflow: https://github.com/ROCm/TheRock/actions/runs/24581100431 fails with other errors
Successful run: https://github.com/ROCm/TheRock/actions/runs/24679533598/job/72189231710
Test Result
CI is still flaky for gfx1151 on Windows.
Submission Checklist