Skip to content

[ci] Adjust expect_failure and expect_pytorch_failure logic#4500

Draft
ScottTodd wants to merge 5 commits into
ROCm:mainfrom
ScottTodd:multi-arch-expect-failure
Draft

[ci] Adjust expect_failure and expect_pytorch_failure logic#4500
ScottTodd wants to merge 5 commits into
ROCm:mainfrom
ScottTodd:multi-arch-expect-failure

Conversation

@ScottTodd
Copy link
Copy Markdown
Member

@ScottTodd ScottTodd commented Apr 13, 2026

Motivation

Handling a few related issues that will complicate multi-arch release pipeline work (#3334):

1. expect_failure plumbing

expect_failure was optionally set on a build variant (e.g. release/asan/tsan). As of 3c528c0, it is never set now.

We were passing expect_failure all the way through to multi_arch_build_portable_linux.yml where stages are built, but it was ignored there. That workflow doesn't have a single job that it could mark as continue-on-error as the job that it replaces:

jobs:
build_portable_linux_artifacts:
name: Build (xfail ${{ inputs.expect_failure }})
# azure-linux-scale-rocm are used for regular CI builds
# azure-linux-scale-rocm-heavy are used for CI builds that require more resources (ex: ASAN and TSAN builds)
runs-on: ${{ (contains(inputs.build_variant_label, 'asan') || contains(inputs.build_variant_label, 'tsan')) && 'azure-linux-scale-rocm-heavy-ramdisk' || 'azure-linux-scale-rocm' }}
continue-on-error: ${{ inputs.expect_failure }}

Solution: stop handling expect_failure altogether. We can add it back later as needed (somehow).

2. expect_pytorch_failure plumbing

expect_pytorch_failure was being read from the build variant when it actually exists on a GPU family:

"windows": {
"test-runs-on": "",
"family": "gfx900",
"fetch-gfx-targets": [],
"build_variants": ["release"],
"expect_pytorch_failure": True,
},
expect_pytorch_failure = variant_config.get("expect_pytorch_failure", False)

In multi-arch CI we currently build for all GPU families in a single workflow, with a matrix in build_pytorch_wheels_per_family. We could do two layers of filtering here: one for "should this workflow run build pytorch at all?" and another for "does the pytorch build for this GPU family work?"

Solution: correct the configuration plumbing, add TODO to handle in the matrix somehow

Technical Details

I included a few other loosely related / incidental changes here (could move to a separate PR on request):

  • Reordered arguments to expand_build_configs
  • Folded TestDualLabelRunnerSelection tests into TestExpandBuildConfigs where they belong (and can reuse helper functions)

Test Plan

  • Updated unit tests
  • Watch CI runs, including with a target like gfx906 that fails to build pytorch (would fail before, should still fail)

Submission Checklist

ScottTodd added a commit that referenced this pull request Apr 14, 2026
## Motivation

These inputs are unused and do not belong in build jobs. Removing them
will help with:
* #3336
* #3334

## Technical Details

These other inputs are also candidates for cleanup:

input | notes
-- | --
`expect_failure` | See #4500
`artifact_group` | Previously used here, may need to line up with
`build_variant_suffix`:
https://github.com/ROCm/TheRock/blob/15558f4240876c7b4eb667f20182db4e3673e4e6/.github/workflows/build_portable_linux_artifacts.yml#L172-L177
`build_variant_label` | Previously used here, may be useful later:
https://github.com/ROCm/TheRock/blob/15558f4240876c7b4eb667f20182db4e3673e4e6/.github/workflows/build_portable_linux_artifacts.yml#L66
(but see also #4415)
`build_variant_suffix` | Partially handled:
https://github.com/ROCm/TheRock/blob/15558f4240876c7b4eb667f20182db4e3673e4e6/build_tools/github_actions/configure_multi_arch_ci.py#L862-L869
https://github.com/ROCm/TheRock/blob/15558f4240876c7b4eb667f20182db4e3673e4e6/.github/workflows/multi_arch_ci_linux.yml#L120-L126

## Test Plan

* CI run should include expected build/test jobs

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
@ScottTodd
Copy link
Copy Markdown
Member Author

Test Plan

  • Watch CI runs, including with a target like gfx906 that fails to build pytorch (would fail before, should still fail)

Uh, it succeeded: https://github.com/ROCm/TheRock/actions/runs/24469430596/job/71558513727?pr=4500 . Maybe #1927 is fixed now?

# TODO(#1927): Resolve error generating file `torch_hip_generated_int4mm.hip.obj`, to enable PyTorch builds
"windows": {
"test-runs-on": "",
"family": "gfx906",
"fetch-gfx-targets": [],
"build_variants": ["release"],
"expect_pytorch_failure": True,
},

This PR also has merge conflicts now. I'll probably need to split it into smaller PRs.

ScottTodd added a commit that referenced this pull request Apr 29, 2026
## Motivation

This dev release in rockrel:
https://github.com/ROCm/rockrel/actions/runs/25100364100/attempts/1 hit
errors
`Multi-Arch Release
Error when evaluating 'strategy' for job 'test_components'.
30fc752
(Line: 176, Col: 21): Error from function 'fromJSON': empty input,
30fc752
(Line: 176, Col: 21): Unexpected value ''`

## Technical Details

I think we missed this in CI workflows since multi_arch_ci_linux.yml has
a `expect_failure` condition:
https://github.com/ROCm/TheRock/blob/88425ee26eb1259292089c723432e4594e3bbb20/.github/workflows/multi_arch_ci_linux.yml#L99-L103

I did not add that condition to multi_arch_release_linux.yml:
https://github.com/ROCm/TheRock/blob/7161bc7968a7bae56be9ea4658b6261831d14d8e/.github/workflows/multi_arch_release_linux.yml#L117-L120

I'd like to remove that `expect_failure` entirely since it isn't
actually working right now in multi-arch CI, see
#4500

## Test Plan

We'll need to run a release workflow until it reaches the test step,
which takes hours. Might as well merge and test via an actual dev
release.

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant