[ci] Creating dual label feature for mi325#4381
Conversation
Implements weighted random selection to distribute CI load across multiple runner pools with different capacities. For gfx94x, this distributes jobs 80/20 between the primary pool (136 runners) and CCS pool (32 runners), proportional to their capacity. Changes: - Add test-runs-on-alternate and test-runs-on-alternate-weight fields to amdgpu_family_matrix.py for gfx94x configuration - Implement random selection logic in both configure_multi_arch_ci.py and configure_ci.py (legacy single-arch pipeline) - Add comprehensive unit tests for dual-label selection behavior - Update documentation to explain the load balancing approach The feature is extensible - any family can add dual-label support by setting the alternate fields without code changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
amd-justchen
left a comment
There was a problem hiding this comment.
LGTM if things pass! Also.. Maybe we can also use this to test 20% of workloads on EKS vs AKS?
|
I notice there are failure in the checks, but the patch merged, are we going to xfail these check? |
Why? This feels like a self-inflicted limitation... runners can have multiple labels and we can choose to use either a generic label or a specific label depending on the workflow / repository / etc. |
limitation from OSSCI :( since it is two clusters! i tried to get it to be the same label but apparently it cannot be done |
|
Bummer... is there a feature request issue somewhere we can track it? I see a few potential (hah) issues with probabilistic label selection:
|
i'll open a tix! |
As we are bringing up mi325 machines from both `vultr` and `cirrascale`, unfortunately, we cannot provide the same label for both 1-gpu mi325 machines. Due to this limitation, we are adding this "alternative" runner selection until we get a full dedicated set --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
| # --------------------------------------------------------------------------- | ||
| # Dual-label runner selection | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| class TestDualLabelRunnerSelection(unittest.TestCase): | ||
| """Test weighted random selection of dual-label runner configurations.""" | ||
|
|
||
| def test_gfx94x_has_dual_label_config(self): | ||
| """Verify gfx94x has the dual-label configuration.""" | ||
| from amdgpu_family_matrix import get_all_families_for_trigger_types |
There was a problem hiding this comment.
This small bit of logic doesn't need its own test class, it should just go in TestExpandBuildConfigs. I'm moving it there in #4500
Reverting #4381 as we can use same label now across CSPs (worked with OSSCI)
As we are bringing up mi325 machines from both
vultrandcirrascale, unfortunately, we cannot provide the same label for both 1-gpu mi325 machines. Due to this limitation, we are adding this "alternative" runner selection until we get a full dedicated set