Skip to content

[ci] Creating dual label feature for mi325#4381

Merged
geomin12 merged 4 commits into
mainfrom
users/geomin12/dual-label
Apr 8, 2026
Merged

[ci] Creating dual label feature for mi325#4381
geomin12 merged 4 commits into
mainfrom
users/geomin12/dual-label

Conversation

@geomin12
Copy link
Copy Markdown
Contributor

@geomin12 geomin12 commented Apr 7, 2026

As we are bringing up mi325 machines from both vultr and cirrascale, unfortunately, we cannot provide the same label for both 1-gpu mi325 machines. Due to this limitation, we are adding this "alternative" runner selection until we get a full dedicated set

geomin12 and others added 4 commits April 6, 2026 11:43
Implements weighted random selection to distribute CI load across multiple
runner pools with different capacities. For gfx94x, this distributes jobs
80/20 between the primary pool (136 runners) and CCS pool (32 runners),
proportional to their capacity.

Changes:
- Add test-runs-on-alternate and test-runs-on-alternate-weight fields to
  amdgpu_family_matrix.py for gfx94x configuration
- Implement random selection logic in both configure_multi_arch_ci.py
  and configure_ci.py (legacy single-arch pipeline)
- Add comprehensive unit tests for dual-label selection behavior
- Update documentation to explain the load balancing approach

The feature is extensible - any family can add dual-label support by
setting the alternate fields without code changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@amd-justchen amd-justchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if things pass! Also.. Maybe we can also use this to test 20% of workloads on EKS vs AKS?

@geomin12 geomin12 merged commit 97bb80e into main Apr 8, 2026
137 of 147 checks passed
@geomin12 geomin12 deleted the users/geomin12/dual-label branch April 8, 2026 15:12
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Apr 8, 2026
@AmosLewis
Copy link
Copy Markdown
Contributor

I notice there are failure in the checks, but the patch merged, are we going to xfail these check?

Linux::release / Test gfx94X-dcgpu / Test rocblas / Test rocblas (shard 1/1) (gfx94X-dcgpu)
Linux::release / Test gfx94X-dcgpu / Test rccl / Test rccl (shard 1/1) (gfx94X-dcgpu)
Linux::release / Test gfx1151 / Test Sanity Check / Test sanity (shard 1/1) (gfx1151)
Linux::release / Build PyTorch | gfx94X-dcgpu / Build PyTorch | gfx94X-dcgpu | torch release/2.10 |
Linux::release / Build PyTorch | gfx110X-all / Build PyTorch | gfx110X-all | torch release/2.10 |
Linux::release / Build PyTorch | gfx1151 / Build PyTorch | gfx1151 | torch release/2.10 |
Linux::release / Build PyTorch | gfx120X-all / Build PyTorch | gfx120X-all | torch release/2.10 | 

Windows::release / Test gfx110X-all / Test hipblas / Test hipblas (shard 1/1) (gfx110X-all)
Windows::release / Test gfx1151 / Test hipblas / Test hipblas (shard 1/1) (gfx1151)

@ScottTodd
Copy link
Copy Markdown
Member

unfortunately, we cannot provide the same label for both 1-gpu mi325 machines

Why? This feels like a self-inflicted limitation... runners can have multiple labels and we can choose to use either a generic label or a specific label depending on the workflow / repository / etc.

@geomin12
Copy link
Copy Markdown
Contributor Author

geomin12 commented Apr 8, 2026

unfortunately, we cannot provide the same label for both 1-gpu mi325 machines

Why? This feels like a self-inflicted limitation... runners can have multiple labels and we can choose to use either a generic label or a specific label depending on the workflow / repository / etc.

limitation from OSSCI :( since it is two clusters! i tried to get it to be the same label but apparently it cannot be done

@ScottTodd
Copy link
Copy Markdown
Member

Bummer... is there a feature request issue somewhere we can track it?

I see a few potential (hah) issues with probabilistic label selection:

  • Re-running a workflow run will use the same label. If runners for that label are offline (e.g. already migrated away to a different label) or are saturated, that run will be stuck.
  • There is no way to explicitly choose which label to use, either on the frontend (pull requests, push events, etc.) or the backend (adding new runners to either pool won't adjust the probability in code)

@geomin12
Copy link
Copy Markdown
Contributor Author

geomin12 commented Apr 8, 2026

Bummer... is there a feature request issue somewhere we can track it?

I see a few potential (hah) issues with probabilistic label selection:

  • Re-running a workflow run will use the same label. If runners for that label are offline (e.g. already migrated away to a different label) or are saturated, that run will be stuck.
  • There is no way to explicitly choose which label to use, either on the frontend (pull requests, push events, etc.) or the backend (adding new runners to either pool won't adjust the probability in code)

i'll open a tix!

rahulc-gh pushed a commit that referenced this pull request Apr 9, 2026
As we are bringing up mi325 machines from both `vultr` and `cirrascale`,
unfortunately, we cannot provide the same label for both 1-gpu mi325
machines. Due to this limitation, we are adding this "alternative"
runner selection until we get a full dedicated set

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +900 to +910
# ---------------------------------------------------------------------------
# Dual-label runner selection
# ---------------------------------------------------------------------------


class TestDualLabelRunnerSelection(unittest.TestCase):
"""Test weighted random selection of dual-label runner configurations."""

def test_gfx94x_has_dual_label_config(self):
"""Verify gfx94x has the dual-label configuration."""
from amdgpu_family_matrix import get_all_families_for_trigger_types
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This small bit of logic doesn't need its own test class, it should just go in TestExpandBuildConfigs. I'm moving it there in #4500

geomin12 added a commit that referenced this pull request Apr 21, 2026
Reverting #4381 as we can use same label now across CSPs (worked with
OSSCI)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants