Skip to content

Add timeouts to test_rocm_wheels.yml#3233

Merged
ScottTodd merged 1 commit into
mainfrom
users/scotttodd/python-packages-test-timeouts
Feb 3, 2026
Merged

Add timeouts to test_rocm_wheels.yml#3233
ScottTodd merged 1 commit into
mainfrom
users/scotttodd/python-packages-test-timeouts

Conversation

@ScottTodd
Copy link
Copy Markdown
Member

Motivation

Progress on #1559. When adding Python package tests to CI workflows on #3182 we noticed that this job hit a 6 hour timeout on the linux-strix-halo-gpu-rocm-5 runner: https://github.com/ROCm/TheRock/actions/runs/21533668625/job/62060067637?pr=3182.

Technical Details

I chose a 30 minute timeout for the overall job, to catch slow network issues and then a 5 minute timeout for just the test step, to give enough time for the packages to initialize and then catch any hung subprocesses.

Note that we currently skip tests on Linux gfx1151 here:

  • "gfx1151": {
    "linux": {
    "test-runs-on": "linux-gfx1151-gpu-rocm",
    "test-runs-on-kernel": {
    "oem": "linux-strix-halo-gpu-rocm-oem",
    },
    "family": "gfx1151",
    "bypass_tests_for_releases": True,
    "build_variants": ["release"],
    "sanity_check_only_for_family": True,
    },
  • AMDGPU_FAMILIES = os.getenv("AMDGPU_FAMILIES")
    # TODO(#2964): Remove gfx950-dcgpu once amdsmi static does not timeout
    unsupported_amdsmi_families = ["gfx1151", "gfx950-dcgpu"]

We may want to similarly filter Linux gfx1151 from parts of rocm-sdk test.

Test Plan and Results

Triggered a test run with ROCm version 7.12.0a20260202: https://github.com/ROCm/TheRock/actions/runs/21644961000

All tests passed in 1 minute, though the linux-strix-halo-gpu-rocm-7 runner was used for those runs.

Submission Checklist

Prevent hung test jobs from running until the default 6-hour timeout,
as we've observed on gfx1151 Linux runners.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ScottTodd ScottTodd merged commit 3761265 into main Feb 3, 2026
10 of 12 checks passed
@ScottTodd ScottTodd deleted the users/scotttodd/python-packages-test-timeouts branch February 3, 2026 22:08
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants