Skip to content

Use 128GB runners for gfx1151 pytorch CI on Windows#4613

Open
zichguan-amd wants to merge 2 commits into
mainfrom
users/zichguan/pytorch-carveout
Open

Use 128GB runners for gfx1151 pytorch CI on Windows#4613
zichguan-amd wants to merge 2 commits into
mainfrom
users/zichguan/pytorch-carveout

Conversation

@zichguan-amd
Copy link
Copy Markdown
Contributor

@zichguan-amd zichguan-amd commented Apr 16, 2026

Motivation

Fixes #3724. Due to non-power-of-2 memory carveout issue, run pytorch CI only on 128GB runners. 64GB runners fail way too many tests with the driver issue. The driver issue should be fixed in the April Adrenalin release, this PR fixes the CI until we update the runners with new drivers.

Technical Details

Add pytorch-ci-test-runs-on field in amdgpu_family_matrix.py for gfx1151 Windows only.

In configure_target_run.py, add checks for gfx1151 target running any workflows containing pytorch_wheels in the name to use the new pytorch-ci-test-runs-on field instead. Other workflows and target archs should still use the test-runs-on field.

Test Plan

Release workflow: https://github.com/ROCm/TheRock/actions/runs/24581100431 fails with other errors
Successful run: https://github.com/ROCm/TheRock/actions/runs/24679533598/job/72189231710

Test Result

CI is still flaky for gfx1151 on Windows.

Submission Checklist

@zichguan-amd zichguan-amd force-pushed the users/zichguan/pytorch-carveout branch from 5cca2db to d98d5d7 Compare April 16, 2026 21:25
@zichguan-amd zichguan-amd changed the title Use 128GB runners for gfx1151 on Windows Use 128GB runners for gfx1151 pytorch CI on Windows Apr 16, 2026
@zichguan-amd zichguan-amd force-pushed the users/zichguan/pytorch-carveout branch from d98d5d7 to cb396d3 Compare April 16, 2026 21:32
Signed-off-by: zichguan-amd <zichuan.guan@amd.com>
@zichguan-amd zichguan-amd force-pushed the users/zichguan/pytorch-carveout branch from cb396d3 to 372864b Compare April 16, 2026 21:34
@zichguan-amd zichguan-amd marked this pull request as ready for review April 21, 2026 18:31
@zichguan-amd zichguan-amd requested review from HereThereBeDragons, ScottTodd and Copilot and removed request for HereThereBeDragons and Copilot April 23, 2026 15:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates GitHub Actions runner selection so Windows gfx1151 PyTorch wheel CI uses 128GB runners to avoid OOM-related flakiness stemming from a known driver memory carveout issue (#3724).

Changes:

  • Added a pytorch-ci-test-runs-on runner label override for Windows gfx1151 in the AMDGPU family matrix.
  • Updated configure_target_run.get_runner_label() to select the PyTorch-specific runner only for gfx1151/Windows when running from a *pytorch_wheels* workflow.
  • Added a unit test covering the PyTorch-workflow runner override behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
build_tools/github_actions/tests/configure_target_run_test.py Adds test coverage for PyTorch-workflow-specific runner selection on Windows gfx1151.
build_tools/github_actions/configure_target_run.py Adds workflow detection and conditional runner override logic for Windows gfx1151 PyTorch wheel workflows.
build_tools/github_actions/amdgpu_family_matrix.py Introduces pytorch-ci-test-runs-on runner label override for Windows gfx1151.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread build_tools/github_actions/configure_target_run.py Outdated
Comment thread build_tools/github_actions/amdgpu_family_matrix.py
Comment thread build_tools/github_actions/tests/configure_target_run_test.py
test_runs_on_machine = platform_for_key.get("test-runs-on")
# `pytorch-ci-test-runs-on` is used only for Windows gfx1151 when the workflow
# is a `*pytorch_wheels*.yml` job; all other families use `test-runs-on`.
use_pytorch_ci_windows_gfx1151 = (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too restrictive. the label pytorch-ci-test-runs-on should work on any platform and arch when some pytorch workflow is run

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently only gfx1151 on Windows faces system config issue, I could generalize this to all targets, so we don't need to check for platform or arch. Tho I'm not sure if we want to commit to have the extra label for the long run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check now is platform & target agnostic. Use dedicated label as long as there is a project-specific runs-on label and the workflow asks for it, otherwise falls back to default runs-on label so we don't need to add the pytorch specific label to all arches

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. code wise it is ok for me now

Comment on lines +18 to +24
def is_pytorch_wheel_workflow() -> bool:
"""True when this process runs from a *pytorch_wheels*.yml GitHub Actions workflow.

Matches the workflow file path in ``GITHUB_WORKFLOW_REF`` (stable)
"""
ref = os.getenv("GITHUB_WORKFLOW_REF", "")
return "pytorch_wheels" in ref.replace("\\", "/").lower()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the workflow ref against a fixed name like "pytorch_wheels" is too brittle. We could change the workflow name to "pytorch_packages" and this would break silently.

How about adding an argument to this script (or an environment variable) to make it explicit? This code would change:

- name: Generating target to run
id: configure
env:
TARGET: ${{ inputs.amdgpu_family }}
PLATFORM: "windows"
run: python ./build_tools/github_actions/configure_target_run.py

      - name: Generating target to run
        id: configure
        env:
          TARGET: ${{ inputs.amdgpu_family }}
          PLATFORM: "windows"
+         TEST_PROJECT_NAME: "pytorch"
        run: python ./build_tools/github_actions/configure_target_run.py
      - name: Generating target to run
        id: configure
        env:
          TARGET: ${{ inputs.amdgpu_family }}
          PLATFORM: "windows"
-       run: python ./build_tools/github_actions/configure_target_run.py
+       run: python ./build_tools/github_actions/configure_target_run.py \
+           --test-project-name=pytorch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added command line argument

Comment on lines 124 to +125
"test-runs-on": "windows-gfx1151-gpu-rocm",
"pytorch-ci-test-runs-on": "windows-strix-halo-gpu-rocm-128gb",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geomin12 / @amd-shiraz / @amd-justchen

What's our spread of test runners for this windows-gfx1151-gpu-rocm label?

We should either:

  1. Have all machines using the same specific runner label have the same system configuration
  2. Have all workflows requesting the same generic runner label pass tests

RIght now windows-gfx1151-gpu-rocm seems to include runners with multiple different system configurations and the tests are not passing.

Copy link
Copy Markdown
Contributor

@amd-justchen amd-justchen Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are still including all of the machines from 16, 32, 64, 128gb of total RAM. There was a point where I started adding runner labels for minimum amount of RAM for tests to select. Plumbing needs to be in place for that though, @geomin12 thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience there are 128gb models and 64gb models, all configured to the maximum carveout sizes (96gb and 48gb iirc).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any update here? does the runner label exist?
code wise the pr looks good now. but do not know what the runner status is

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The label windows-strix-halo-gpu-rocm-128gb exists and currently has 6 runners. Do we want something else?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geomin12 / @amd-shiraz / @amd-justchen
any opinion? from my side it looks good to merge.

@zichguan-amd zichguan-amd force-pushed the users/zichguan/pytorch-carveout branch from 50f8cec to f642345 Compare April 29, 2026 18:23
Signed-off-by: zichguan-amd <zichuan.guan@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: TODO

Development

Successfully merging this pull request may close these issues.

[Issue] Out-of-memory (OOM) errors during PyTorch tests on ROCm with HIP on Windows - gfx1151

5 participants