ensure 'torch' CUDA wheels are installed in CI#2279
ensure 'torch' CUDA wheels are installed in CI#2279rapids-bot[bot] merged 9 commits intorapidsai:release/26.04from
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
| # avoid pulling in 'torch' in places like DLFW builds that prefer to install it other ways | ||
| - matrix: | ||
| no_pytorch: "true" | ||
| packages: |
There was a problem hiding this comment.
This follows the pattern @trxcllnt has been introducing across RAPIDS: rapidsai/cugraph-gnn#421
Think rmm never needed patches for DLFW and so was missed in that round of PRs because its - depends_on_pytorch group doesn't end up in test_python or similar commonly-used lists.
|
/ok to test |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a new CI script to download CUDA-specific PyTorch wheels, updates CI test scripts to use an environment-driven PIP constraint and to download/use CUDA wheels for PyTorch tests, and restructures Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can generate a title for your PR based on the changes with custom instructions.Set the |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
ci/download-torch-wheels.sh (1)
27-40: Keep the generated constraint file insideTORCH_WHEEL_DIR.Writing
torch-constraints.txtto./leaves shared state in the working tree even though the caller already gives this helper a per-run temp directory. Keeping the file under${TORCH_WHEEL_DIR}makes the whole download step self-contained and avoids cross-run collisions.♻️ Suggested refactor
+TORCH_CONSTRAINTS="${TORCH_WHEEL_DIR}/torch-constraints.txt" + rapids-dependency-file-generator \ --output requirements \ --file-key "torch_only" \ --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION};dependencies=${RAPIDS_DEPENDENCIES};require_gpu_pytorch=true" \ -| tee ./torch-constraints.txt +| tee "${TORCH_CONSTRAINTS}" rapids-pip-retry download \ --isolated \ --prefer-binary \ --no-deps \ -d "${TORCH_WHEEL_DIR}" \ --constraint "${PIP_CONSTRAINT}" \ - --constraint ./torch-constraints.txt \ + --constraint "${TORCH_CONSTRAINTS}" \ 'torch'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@ci/download-torch-wheels.sh` around lines 27 - 40, The generated constraint file is written to ./torch-constraints.txt which leaves shared state; change the rapids-dependency-file-generator pipeline so tee writes into the per-run directory (use ${TORCH_WHEEL_DIR}/torch-constraints.txt) and update the subsequent rapids-pip-retry download --constraint argument to point to that file instead of ./torch-constraints.txt, keeping all references to torch-constraints.txt, rapids-dependency-file-generator, tee, ${TORCH_WHEEL_DIR}, and the rapids-pip-retry download --constraint option consistent.dependencies.yaml (1)
401-409: Mirror the oldest/latest split on the conda path or explain why conda can use relaxed version constraints.The caller passes
dependencies=${RAPIDS_DEPENDENCIES}to the generator for PyTorch conda, but the conda matrices ignore this selector. The requirements (wheel) path pins specific PyTorch versions fordependencies=oldest(e.g.,torch==2.9.0+cu129), whereas the conda path always uses relaxed constraints (pytorch-gpu>=2.9). Either add the same oldest/latest branching to the conda matrices, or document why conda can rely on the solver to handle any PyTorch >=2.9 version safely while wheels cannot.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dependencies.yaml` around lines 401 - 409, The conda matrices under output_types: conda currently ignore the caller's dependencies selector (dependencies=${RAPIDS_DEPENDENCIES}) and always use relaxed package specs (packages: - pytorch-gpu>=2.9 / - pytorch>=2.9), which diverges from the wheel path that branches on oldest/latest and pins versions (e.g., torch==2.9.0+cu129); update the conda matrices to mirror the oldest/latest split (add separate matrix entries for dependencies=oldest that pin exact pytorch/pytorch-gpu versions and for dependencies=latest that keep >= constraints) or add a clear comment/documentation explaining why conda can safely use relaxed constraints and referencing the matrix keys (matrices, require_gpu_pytorch, packages) and the caller variable RAPIDS_DEPENDENCIES so reviewers can verify the intended behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/test_wheel_integrations.sh`:
- Around line 40-45: The CI skip log in test_wheel_integrations.sh is out of
sync with the gate that checks CUDA_MAJOR/CUDA_MINOR (the conditional using
CUDA_MAJOR and CUDA_MINOR around the if block) — update the skip/message string
(the message printed around line 66) to match the actual gate: indicate "CUDA
12.9+ (for 12.x) or 13.0" instead of "12.6-12.9 or 13.0" so it accurately
reflects the condition in the if that checks CUDA_MAJOR and CUDA_MINOR.
---
Nitpick comments:
In `@ci/download-torch-wheels.sh`:
- Around line 27-40: The generated constraint file is written to
./torch-constraints.txt which leaves shared state; change the
rapids-dependency-file-generator pipeline so tee writes into the per-run
directory (use ${TORCH_WHEEL_DIR}/torch-constraints.txt) and update the
subsequent rapids-pip-retry download --constraint argument to point to that file
instead of ./torch-constraints.txt, keeping all references to
torch-constraints.txt, rapids-dependency-file-generator, tee,
${TORCH_WHEEL_DIR}, and the rapids-pip-retry download --constraint option
consistent.
In `@dependencies.yaml`:
- Around line 401-409: The conda matrices under output_types: conda currently
ignore the caller's dependencies selector (dependencies=${RAPIDS_DEPENDENCIES})
and always use relaxed package specs (packages: - pytorch-gpu>=2.9 / -
pytorch>=2.9), which diverges from the wheel path that branches on oldest/latest
and pins versions (e.g., torch==2.9.0+cu129); update the conda matrices to
mirror the oldest/latest split (add separate matrix entries for
dependencies=oldest that pin exact pytorch/pytorch-gpu versions and for
dependencies=latest that keep >= constraints) or add a clear
comment/documentation explaining why conda can safely use relaxed constraints
and referencing the matrix keys (matrices, require_gpu_pytorch, packages) and
the caller variable RAPIDS_DEPENDENCIES so reviewers can verify the intended
behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0ef2a43d-0a13-4967-a6ef-94f870a18ed8
📒 Files selected for processing (5)
ci/download-torch-wheels.shci/test_python_integrations.shci/test_wheel.shci/test_wheel_integrations.shdependencies.yaml
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ci/test_wheel_integrations.sh`:
- Around line 40-45: The top comment stating "requires CUDA 12.8+" is now
inaccurate relative to the conditional that permits CUDA 12.9+ for 12.x and only
13.0 for 13.x; update that comment above the gating if-block (the block checking
CUDA_MAJOR and CUDA_MINOR) to accurately describe the new policy (e.g., require
CUDA 12.9+ on 12.x and CUDA 13.0 on 13.x) so future triage matches the condition
in the { [ "${CUDA_MAJOR}" -eq 12 ] && [ "${CUDA_MINOR}" -ge 9 ]; } || { [
"${CUDA_MAJOR}" -eq 13 ] && [ "${CUDA_MINOR}" -le 0 ]; } check.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 23206a42-586c-4680-b0ff-0c16836fa1f9
📒 Files selected for processing (1)
ci/test_wheel_integrations.sh
| -v \ | ||
| "${PIP_INSTALL_SHARED_ARGS[@]}" \ | ||
| -r test-pytorch-requirements.txt | ||
| "${TORCH_WHEEL_DIR}"/torch-*.whl |
There was a problem hiding this comment.
It looks to me like this is working and pulling in what we want!
CUDA 12.2.2, Python 3.11, arm64, ubuntu22.04, a100, latest-driver, latest-deps
RAPIDS logger » [03/06/26 19:41:00]
┌──────────────────────────────────────────────────────────────────────────┐
| Skipping PyTorch tests (requires CUDA 12.9+ or 13.0, found 12.2.2) |
└──────────────────────────────────────────────────────────────────────────┘
CUDA 12.9.1, Python 3.11, amd64, ubuntu22.04, l4, latest-driver, oldest-deps
Successfully installed ... torch-2.9.0+cu129 ...
CUDA 12.9.1, Python 3.14, amd64, ubuntu24.04, h100, latest-driver, latest-deps
Successfully installed ... torch-2.10.0+cu129 ...
CUDA 13.0.2, Python 3.12, amd64, ubuntu24.04, l4, latest-driver, latest-deps
Successfully installed ... torch-2.10.0+cu130 ...
CUDA 13.0.2, Python 3.12, arm64, rockylinux8, l4, latest-driver, latest-deps
Successfully installed ... torch-2.10.0+cu130 ...
CUDA 13.1.1, Python 3.13, amd64, rockylinux8, rtxpro6000, latest-driver, latest-deps
RAPIDS logger » [03/06/26 19:35:46]
┌──────────────────────────────────────────────────────────────────────────┐
| Skipping PyTorch tests (requires CUDA 12.9+ or 13.0, found 13.1.1) |
└──────────────────────────────────────────────────────────────────────────┘
CUDA 13.1.1, Python 3.14, amd64, ubuntu24.04, rtxpro6000, latest-driver, latest-deps
RAPIDS logger » [03/06/26 19:34:37]
┌──────────────────────────────────────────────────────────────────────────┐
| Skipping PyTorch tests (requires CUDA 12.9+ or 13.0, found 13.1.1) |
└──────────────────────────────────────────────────────────────────────────┘
CUDA 13.1.1, Python 3.14, arm64, ubuntu24.04, l4, latest-driver, latest-deps
RAPIDS logger » [03/06/26 19:36:06]
┌──────────────────────────────────────────────────────────────────────────┐
| Skipping PyTorch tests (requires CUDA 12.9+ or 13.0, found 13.1.1) |
└──────────────────────────────────────────────────────────────────────────┘
There was a problem hiding this comment.
Why can we support 12.9+ (with the "plus") but not 13.0+? Apologies if this has been covered elsewhere in the discussions leading up to this point.
I see PyTorch indices for cu126, cu128, and cu130 so I'm confused by why we have 12.9 running tests but not 13.1.
There was a problem hiding this comment.
This log message wording and condition were wrong / imprecise.
If a hypothetical CTK 12.10 came out tomorrow, none of this configuration would automatically work for it. It more precisely should read:
requires CUDA 12.9 or 13.0
And the condition guarding it should be
if \
{ [ "${CUDA_MAJOR}" -eq 12 ] && [ "${CUDA_MINOR}" -eq 9 ]; } \
|| { [ "${CUDA_MAJOR}" -eq 13 ] && [ "${CUDA_MINOR}" -eq 0 ]; }; \I'll fix that here.
I see PyTorch indices for cu126, cu128, and cu130 so I'm confused by why we have 12.9 running tests but not 13.1.
There is a cu129 index too, and it's being used in this PR.
--extra-index-url=https://download.pytorch.org/whl/cu129
I didn't go back further than 12.9 because
- there aren't compatible
torchwheels built against CUDA 12.2 ('torch' wheels not compatible with 'cuda-toolkit' (CUDA 12 only) build-planning#255) - in RAPIDS CI we don't test intermediate CTK versions like 12.5 or 12.8
We can't test CUDA 13.1 because:
torchCUDA wheels are==pinned a{major}.{minor}.{patch}CTK release- there are not any published
torchwheels for CUDA 13.1
There was a problem hiding this comment.
I just pushed a3ca93d
Hopefully that clarifies it.
18f60eb to
9efce26
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dependencies.yaml (1)
421-457: Consider consistent YAML anchor naming.Minor style nit: The anchor names use slightly different patterns:
- Line 433:
&torch_cu129_index- Line 446:
&torch_index_cu13Consider aligning them for consistency (e.g.,
&torch_cu129_index/&torch_cu130_indexor&torch_index_cu129/&torch_index_cu130).♻️ Suggested naming alignment
- matrix: cuda: "13.0" dependencies: "oldest" require_gpu: "true" packages: - - &torch_index_cu13 --extra-index-url=https://download.pytorch.org/whl/cu130 + - &torch_cu130_index --extra-index-url=https://download.pytorch.org/whl/cu130 - torch==2.9.0+cu130 - matrix: cuda: "13.0" require_gpu: "true" packages: - - *torch_index_cu13 + - *torch_cu130_index - torch==2.10.0+cu130🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dependencies.yaml` around lines 421 - 457, The YAML uses inconsistent anchor names (&torch_cu129_index vs &torch_index_cu13); pick a consistent pattern (e.g., &torch_cu129_index and &torch_cu130_index or &torch_index_cu129 and &torch_index_cu130), then rename the anchor defined at the CUDA 13.0 block (currently &torch_index_cu13) to the chosen consistent name and update its alias usage (*torch_index_cu13) to the new alias in the corresponding packages entry so both anchor definition and alias references match.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@dependencies.yaml`:
- Around line 421-457: The YAML uses inconsistent anchor names
(&torch_cu129_index vs &torch_index_cu13); pick a consistent pattern (e.g.,
&torch_cu129_index and &torch_cu130_index or &torch_index_cu129 and
&torch_index_cu130), then rename the anchor defined at the CUDA 13.0 block
(currently &torch_index_cu13) to the chosen consistent name and update its alias
usage (*torch_index_cu13) to the new alias in the corresponding packages entry
so both anchor definition and alias references match.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 952553cf-86ed-4b6d-ba44-6b60716a2aa7
📒 Files selected for processing (3)
ci/download-torch-wheels.shci/test_python_integrations.shdependencies.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- ci/test_python_integrations.sh
- ci/download-torch-wheels.sh
|
Thanks so much for looking @bdice . I'm gonna merge this, please |
|
/merge |
…an optional dependency (#425) Contributes to rapidsai/build-planning#256 and #410 ## Ensures that `torch` CUDA wheels are always installed in CI *"unintentionally installed a CPU-only `torch` from PyPI"* is a failure mode I've seen a few times for CI in this project, and each time it's happened it has take a bit of time to figure out that that was the root cause. This PR tries to fix that by: * using a stricter install pattern to guarantee a CUDA `torch` wheel is installed if a compatible one exists * using local versions like `+cu130` to help prevent pulling in packages from PyPI * adding `dependencies.yaml` items for "oldest" dependencies so CI covers a range of supported versions I tested similar patterns in rapidsai/rmm#2279 and saw them work well there. ## Makes `torch` truly optional We want these packages to be installable and importable without `torch`, for use in RAPIDS DLFW builds (where we don't install `torch` alongside RAPIDS because it's build in other processes). I'd started relying on the assumption that they worked that way in this PR, but quickly realized that that isn't true... `torch` is used unconditionally in many ways in these libraries. This PR fixes that. It makes `torch` optional and adds testing to ensure it stays that way: * copying the `import_optional` machinery from `cugraph-pyg` into `pylibwholegraph` * using `import_optional()` for all `torch` imports in the libraries * using `pytest.importorskip("torch")` in tests * deferring some `torch` inputs from import time to run time * adding the `flake8-tidy-imports:banned-api` check from `ruff` to enforce that `import torch` isn't used anywhere in library code or test code * explicitly testing that the libraries are still installable and at least 1 unit test can run successfully after a `pip uninstall torch` * adding a check in `ci/validate_wheel.sh` confirming that `torch` doesn't make it into any wheel metadata *(which could happen due to mistakes in `dependencies.yaml` / `pyproject.toml`)* ## Notes for Reviewers Pulling these changes out of #413 , so CI in this repo can immediately benefit from them and so #419 can be reverted. When this is merged, #413 will have a smaller diff and just be focused on testing against a range of CTKs. Authors: - James Lamb (https://github.com/jameslamb) - Alex Barghi (https://github.com/alexbarghi-nv) Approvers: - Alex Barghi (https://github.com/alexbarghi-nv) - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: #425
Description
Contributes to rapidsai/build-planning#256
Broken out from #2270
Proposes a stricter pattern for installing
torchwheels, to prevent bugs of the form "accidentally used a CPU-onlytorchfrom pypi.org". This should help us to catch compatibility issues, improving release confidence.Other small changes:
require_gpu_pytorchmatrix filter so conda jobs can explicitly requestpytorch-gpu(to similarly ensure solvers don't fall back to the GPU-only variant)rapids-generate-pip-constraintoutput to filePIP_CONSTRAINTpointsChecklist