rapidsai · bdice · Oct 22, 2025 · Oct 22, 2025 · Oct 22, 2025 · Oct 22, 2025
@@ -0,0 +1,134 @@
+# CI Test Matrix Guidelines
+
+## Overview
+
+This document describes the principles and practices for determining the CI test matrix configuration in `.github/workflows/` test jobs. These guidelines balance comprehensive coverage against limited GPU resources to maximize confidence in code quality while keeping CI time and cost manageable.
+
+## Core Constraint
+
+**Limited GPU resources are the primary constraint that forces us to keep the CI matrix limited.** GPU runners are a shared, finite resource across all RAPIDS projects. Every matrix entry that requires a GPU runner consumes time on this shared pool. We must strive for maximum coverage within the GPU resource budget.
+
+## Matrix Dimensions
+
+Test matrices span multiple dimensions:
+
+- **CPU Architecture** (`ARCH`): `amd64`, `arm64`
+- **CUDA Version** (`CUDA_VER`): e.g., `12.2.2`, `12.9.1`, `13.0.1`
+- **Python Version** (`PY_VER`): e.g., `3.10`, `3.11`, `3.12`, `3.13`
+- **GPU Architecture** (`GPU`): `l4`, `a100`, `h100`
+- **Driver Version** (`DRIVER`): `earliest`, `latest`
+- **Linux Distribution** (`LINUX_VER`): `rockylinux8`, `ubuntu22.04`, `ubuntu24.04`
+- **Dependencies** (`DEPENDENCIES`): `oldest`, `latest`
+
+## Two-Tier Matrix Strategy
+
+### Pull Request (PR) Matrix
+**Goal**: Fast feedback with focused coverage
+
+- **Minimal size** to provide quick CI results
+- Try to keep the matrix size constant when adding new versions
+  - e.g. if adding a new CUDA or new Python, rearrange existing jobs to maximize coverage while keeping the number of jobs constant
+- Focus on:
+  - Endpoints
+    - Latest versions of everything (newest Python, CUDA, driver, dependencies)
+    - Earliest supported versions of everything (oldest Python, CUDA, driver, dependencies)
+  - "Off-diagonal" elements
+    - We want to test combinations like "oldest CUDA, newest Python" and vice versa
+    - Use different matrices for conda and wheel CI jobs -- we often use wheel jobs to hit some of these "edge" configurations
+  - Broad coverage
+    - Both CPU architectures (`amd64` and `arm64`)
+    - Cover all possible GPU architectures, while respecting the relative pool sizes of the runners
+      - Allocate fewer CI jobs to the pools with fewer runners
+    - Cover oldest and latest drivers, but make sure the driver/CUDA versions are supported by the desired operating system
+
+### Nightly Matrix
+**Goal**: Comprehensive coverage across all supported configurations
+
+- **Expanded matrix** to catch edge cases and combinations not tested in PRs
+- We have a weak preference for nightlies to be a *superset* of the PR matrix, meaning it is the same with additional elements
+- Otherwise, same rules as for the PR matrix: hit the endpoints, off-diagonal elements, and shoot for broad coverage
+
+## Coverage Priorities
+
+When trading off coverage for resource utilization, prioritize in this order:
+
+### 1. CPU Architecture Coverage
+**Goal**: Every PR and nightly build must test both `amd64` and `arm64`
+
+- Both architectures must be represented in PR tests
+- Failures on either architecture are equally important
+
+### 2. CUDA Version Coverage
+**Goal**: Test minimum supported, the latest stable version, and another intermediate version
+
+- **Previous major, minimum supported version**
+- **Latest major.minor version**
+- **Latest major, earliest minor version**
+  - e.g. if 13.1 is the latest, use 13.0
+  - If this is the same as the latest major.minor, use the latest minor of the previous major (e.g. if 13.0 is the latest, use 12.9)
+- If resources allow, also test the latest minor of the previous major
+
+### 3. Driver Version Coverage
+**Goal**: Validate compatibility across the driver support range
+
+- **Earliest supported driver**: Always test with oldest CUDA version (typically PR and nightly)
+- **Latest driver**: Test with all CUDA versions (PR and nightly)
+- These combinations catch driver compatibility issues and forward compatibility
+
+### 4. Python Version Coverage
+**Goal**: Test oldest and newest, sample intermediate versions in nightly
+
+- **Oldest supported Python**
+- **Newest supported Python**
+- **Intermediate versions**: lower priority for coverage than oldest/newest, sprinkle these through the matrix
+
+### 5. GPU Architecture Coverage
+Different GPU families should be distributed across the matrix to validate portability.
+Make sure to test on CUDA versions new enough to support that hardware.
+
+### 6. Dependency Version Coverage
+**Goal**: Validate against both oldest and latest dependencies
+
+- **Oldest dependencies**: Use at least once in the matrix
+- **Latest dependencies**: Use latest dependencies in most jobs
+- It is up to each repository to use `rapids-dependency-file-generator` and `dependencies.yaml` to define its oldest supported dependencies, this just sets an environment variable that can be interpreted by the CI scripts
+
+### 7. Linux Distribution Coverage
+**Goal**: Validate across enterprise and modern distributions
+
+- Use operating systems with a mixture of glibc versions from oldest to newest
+- Linux distribution diversity is secondary to other factors
+
+## Rollout Strategy for Major Changes
+
+### Adding new versions to the matrix
+
+When making matrix changes (new CUDA/Python/OS versions, runner type changes):
+1. **Create a long-lived branch**: Create a feature branch in `shared-workflows`. 1. **Create a long-lived branch**: Create a feature branch in `shared-workflows`.
+> [!IMPORTANT]
+> You must use the `rapidsai/shared-workflows` repo and not a fork.
+> Using a fork in downstream repositories will not allow actions to run, for security reasons.
+2. **Add the new versions**: Modify build/test matrices to use the new version in some jobs
+3. **Update RAPIDS repos**: Update projects one-by-one to use `@feature-branch` in their `.github/workflows/*.yaml` files.
+4. **Merge and switch back**: Merge feature branch, then update projects back to `@main`
+
+This allows incremental matrix expansion across RAPIDS and provides rollback capability if issues arise.
+
+See examples in:
+- [PR #413 (Add CUDA 13.0)](https://github.com/rapidsai/shared-workflows/pull/413)
+- [PR #412 (Add conda CUDA 13 workflows)](https://github.com/rapidsai/shared-workflows/pull/412)
+
+### Modifying or removing versions from the matrix
+
+When modifying or removing matrix elements, it may not be necessary to do the full rollout procedure above.
+That process is really only needed for adding new builds that don't yet exist, and need to be created in RAPIDS dependency order.
+
+1. **Announce deprecation**: Publish a RAPIDS Support Notice if needed (e.g., [RSN 54](https://docs.rapids.ai/notices/rsn0054/))
+2. **Update the matrix**: Modify/remove build and test matrices
+3. **Validate one repo**: Open a test PR downstream in representative project like rmm or cudf.
+> [!NOTE]
+> This ensures the workflow syntax is valid, the requested CI runners are operational, and the requested CI images exist
+4. **Merge matrix changes**: Merge the workflow changes and close the test PR
+
+See examples in:
+- [PR #431 (Drop CUDA 12.0)](https://github.com/rapidsai/shared-workflows/pull/431)