Use 128GB runners for gfx1151 pytorch CI on Windows by zichguan-amd · Pull Request #4613 · ROCm/TheRock

zichguan-amd · 2026-04-16T18:50:10Z

Motivation

Fixes #3724. Due to non-power-of-2 memory carveout issue, run pytorch CI only on 128GB runners. 64GB runners fail way too many tests with the driver issue. The driver issue should be fixed in the April Adrenalin release, this PR fixes the CI until we update the runners with new drivers.

Technical Details

Add pytorch-ci-test-runs-on field in amdgpu_family_matrix.py for gfx1151 Windows only.

In configure_target_run.py, add checks for gfx1151 target running any workflows containing pytorch_wheels in the name to use the new pytorch-ci-test-runs-on field instead. Other workflows and target archs should still use the test-runs-on field.

Test Plan

Release workflow: https://github.com/ROCm/TheRock/actions/runs/24581100431 fails with other errors
Successful run: https://github.com/ROCm/TheRock/actions/runs/24679533598/job/72189231710

Test Result

CI is still flaky for gfx1151 on Windows.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: zichguan-amd <zichuan.guan@amd.com>

Copilot

Pull request overview

Updates GitHub Actions runner selection so Windows gfx1151 PyTorch wheel CI uses 128GB runners to avoid OOM-related flakiness stemming from a known driver memory carveout issue (#3724).

Changes:

Added a pytorch-ci-test-runs-on runner label override for Windows gfx1151 in the AMDGPU family matrix.
Updated configure_target_run.get_runner_label() to select the PyTorch-specific runner only for gfx1151/Windows when running from a *pytorch_wheels* workflow.
Added a unit test covering the PyTorch-workflow runner override behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
build_tools/github_actions/tests/configure_target_run_test.py	Adds test coverage for PyTorch-workflow-specific runner selection on Windows gfx1151.
build_tools/github_actions/configure_target_run.py	Adds workflow detection and conditional runner override logic for Windows gfx1151 PyTorch wheel workflows.
build_tools/github_actions/amdgpu_family_matrix.py	Introduces `pytorch-ci-test-runs-on` runner label override for Windows gfx1151.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

HereThereBeDragons · 2026-04-23T16:20:48Z

-        test_runs_on_machine = platform_for_key.get("test-runs-on")
+        # `pytorch-ci-test-runs-on` is used only for Windows gfx1151 when the workflow
+        # is a `*pytorch_wheels*.yml` job; all other families use `test-runs-on`.
+        use_pytorch_ci_windows_gfx1151 = (


this is too restrictive. the label pytorch-ci-test-runs-on should work on any platform and arch when some pytorch workflow is run

Currently only gfx1151 on Windows faces system config issue, I could generalize this to all targets, so we don't need to check for platform or arch. Tho I'm not sure if we want to commit to have the extra label for the long run.

The check now is platform & target agnostic. Use dedicated label as long as there is a project-specific runs-on label and the workflow asks for it, otherwise falls back to default runs-on label so we don't need to add the pytorch specific label to all arches

thanks. code wise it is ok for me now

ScottTodd · 2026-04-27T18:35:57Z

+def is_pytorch_wheel_workflow() -> bool:
+    """True when this process runs from a *pytorch_wheels*.yml GitHub Actions workflow.
+
+    Matches the workflow file path in ``GITHUB_WORKFLOW_REF`` (stable)
+    """
+    ref = os.getenv("GITHUB_WORKFLOW_REF", "")
+    return "pytorch_wheels" in ref.replace("\\", "/").lower()


Checking the workflow ref against a fixed name like "pytorch_wheels" is too brittle. We could change the workflow name to "pytorch_packages" and this would break silently.

How about adding an argument to this script (or an environment variable) to make it explicit? This code would change:

TheRock/.github/workflows/build_windows_pytorch_wheels.yml

Lines 365 to 370 in 407bf1b

- name: Generating target to run

id: configure

env:

TARGET: ${{ inputs.amdgpu_family }}

PLATFORM: "windows"

run: python ./build_tools/github_actions/configure_target_run.py

- name: Generating target to run id: configure env: TARGET: ${{ inputs.amdgpu_family }} PLATFORM: "windows" + TEST_PROJECT_NAME: "pytorch" run: python ./build_tools/github_actions/configure_target_run.py

- name: Generating target to run id: configure env: TARGET: ${{ inputs.amdgpu_family }} PLATFORM: "windows" - run: python ./build_tools/github_actions/configure_target_run.py + run: python ./build_tools/github_actions/configure_target_run.py \ + --test-project-name=pytorch

Added command line argument

ScottTodd · 2026-04-27T18:39:16Z

            "test-runs-on": "windows-gfx1151-gpu-rocm",
+            "pytorch-ci-test-runs-on": "windows-strix-halo-gpu-rocm-128gb",


@geomin12 / @amd-shiraz / @amd-justchen

What's our spread of test runners for this windows-gfx1151-gpu-rocm label?

We should either:

Have all machines using the same specific runner label have the same system configuration

Have all workflows requesting the same generic runner label pass tests

RIght now windows-gfx1151-gpu-rocm seems to include runners with multiple different system configurations and the tests are not passing.

I think they are still including all of the machines from 16, 32, 64, 128gb of total RAM. There was a point where I started adding runner labels for minimum amount of RAM for tests to select. Plumbing needs to be in place for that though, @geomin12 thoughts?

From my experience there are 128gb models and 64gb models, all configured to the maximum carveout sizes (96gb and 48gb iirc).

any update here? does the runner label exist?
code wise the pr looks good now. but do not know what the runner status is

The label windows-strix-halo-gpu-rocm-128gb exists and currently has 6 runners. Do we want something else?

@geomin12 / @amd-shiraz / @amd-justchen
any opinion? from my side it looks good to merge.

Signed-off-by: zichguan-amd <zichuan.guan@amd.com>

github-project-automation Bot added this to TheRock Triage Apr 16, 2026

github-project-automation Bot moved this to TODO in TheRock Triage Apr 16, 2026

zichguan-amd force-pushed the users/zichguan/pytorch-carveout branch from 5cca2db to d98d5d7 Compare April 16, 2026 21:25

zichguan-amd changed the title ~~Use 128GB runners for gfx1151 on Windows~~ Use 128GB runners for gfx1151 pytorch CI on Windows Apr 16, 2026

zichguan-amd force-pushed the users/zichguan/pytorch-carveout branch from d98d5d7 to cb396d3 Compare April 16, 2026 21:32

Use 128GB runners for gfx1151 pytorch CI on Windows

372864b

Signed-off-by: zichguan-amd <zichuan.guan@amd.com>

zichguan-amd force-pushed the users/zichguan/pytorch-carveout branch from cb396d3 to 372864b Compare April 16, 2026 21:34

zichguan-amd marked this pull request as ready for review April 21, 2026 18:31

zichguan-amd requested review from HereThereBeDragons, ScottTodd and Copilot and removed request for HereThereBeDragons and Copilot April 23, 2026 15:38

Copilot started reviewing on behalf of zichguan-amd April 23, 2026 15:40 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread build_tools/github_actions/configure_target_run.py Outdated

Comment thread build_tools/github_actions/amdgpu_family_matrix.py

Comment thread build_tools/github_actions/tests/configure_target_run_test.py

HereThereBeDragons reviewed Apr 23, 2026

View reviewed changes

ScottTodd reviewed Apr 27, 2026

View reviewed changes

zichguan-amd force-pushed the users/zichguan/pytorch-carveout branch from 50f8cec to f642345 Compare April 29, 2026 18:23

Use workflow provided argument to determine runner label

14971ff

Signed-off-by: zichguan-amd <zichuan.guan@amd.com>

zichguan-amd force-pushed the users/zichguan/pytorch-carveout branch from f642345 to 14971ff Compare April 29, 2026 18:45

HereThereBeDragons mentioned this pull request Apr 29, 2026

Refactor new_amdgpu_family_matrix to use data classes instead of dictionary #3653

Open

	- name: Generating target to run
	id: configure
	env:
	TARGET: ${{ inputs.amdgpu_family }}
	PLATFORM: "windows"
	run: python ./build_tools/github_actions/configure_target_run.py

		"test-runs-on": "windows-gfx1151-gpu-rocm",
		"pytorch-ci-test-runs-on": "windows-strix-halo-gpu-rocm-128gb",

Conversation

zichguan-amd commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amd-justchen Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zichguan-amd commented Apr 16, 2026 •

edited

Loading

amd-justchen Apr 29, 2026 •

edited

Loading