WIP: wheels CI: stricter torch index selection, test oldest versions of dependencies#413
WIP: wheels CI: stricter torch index selection, test oldest versions of dependencies#413jameslamb wants to merge 17 commits intorapidsai:release/26.04from
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
| # (useful in CI scripts where we want to tightly which indices 'pip' uses). | ||
| - matrix: | ||
| include_torch_extra_index: "false" | ||
| packages: |
There was a problem hiding this comment.
rapids-dependency-file-generator uses the first matching matrix (see https://github.com/rapidsai/dependency-file-generator?tab=readme-ov-file#how-dependency-lists-are-merged).
This will only affect cases where include_torch_extra_index=false is passed (as in CI here). Other cases (like RAPIDS devcontainers) will fall through to othe groups that pull in --extra-index-url lines.
So this should not break any other uses of this file.
| build_type: pull-request | ||
| script: ci/test_wheel_pylibwholegraph.sh | ||
| matrix_filter: map(select(.ARCH == "amd64")) | ||
| matrix_type: 'nightly' |
There was a problem hiding this comment.
| matrix_type: 'nightly' |
TODO: revert this. Just added here for testing, to confirm this will fix the issues we've been seeing in nightlies.
| build_type: pull-request | ||
| script: ci/test_wheel_cugraph-pyg.sh | ||
| matrix_filter: map(select(.ARCH == "amd64")) | ||
| matrix_type: 'nightly' |
There was a problem hiding this comment.
| matrix_type: 'nightly' |
Revert before merging.
This comment was marked as resolved.
This comment was marked as resolved.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
| - output_types: [conda] | ||
| packages: | ||
| - torchdata | ||
| - pydantic |
There was a problem hiding this comment.
Just moving this here so depends_on_pytorch only ever contains torch / pytorch.
This test_python_common group is used everywhere that depends_on_pytorch is.
dependencies.yaml
Outdated
| # 2.6.0 is the oldest version on https://download.pytorch.org/whl/cu126 with CUDA wheels | ||
| - torch==2.6.0 |
There was a problem hiding this comment.
@alexbarghi-nv see this note.
There aren't CUDA 12 wheels available for PyTorch older than 2.6.0.
pip download \
--isolated \
--no-deps \
--index-url=https://download.pytorch.org/whl/cu126 \
'torch==2.3.0'
# ERROR: Could not find a version that satisfies the requirement torch==2.3.0
# (from versions: 2.6.0+cu126, 2.7.0+cu126, 2.7.1+cu126, 2.8.0+cu126, 2.9.0+cu126, 2.9.1+cu126, 2.10.0+cu126)Do you want to bump the floor in dependency metadata here to >=2.6.0? Or to leave it at >=2.3 so that these libraries are still installable alongside older PyTorch releases (for example, if people build PyTorch 2.4 from source)?
Your call.
There was a problem hiding this comment.
I would vote for bumping the floor to >=2.6.0. It's a little over a year old at this point. https://github.com/pytorch/pytorch/releases/tag/v2.6.0
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Nightly CI here has been failing for a couple weeks, and the root cause is "some jobs are installing incorrect `torch` wheels". That's tracked in #410 and being worked on in #413. That work unfortunately uncovered some other significant compatibility issues that will require RAPIDS-wide fixes: * rapidsai/build-planning#256 * rapidsai/build-planning#257 * rapidsai/cugraph#5443 As a short-term patch, this proposes allowing `cugraph-gnn` nightlies to fail for a few more weeks, so regular PR CI can be unblocked while we focus on the more permanent fix. Targeting the more permanent fix (and reverting this back to 7 days) for the 26.04 release (so over the next few weeks). Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Alex Barghi (https://github.com/alexbarghi-nv) - Gil Forsyth (https://github.com/gforsyth) URL: #419
Contributes to #5443 Related to rapidsai/build-planning#143 `libcugraph.so` dynamically links to several CUDA Toolkit libraries ```console $ ldd /pyenv/versions/3.11.14/lib/python3.11/site-packages/libcugraph/lib64/libcugraph.so ... libcusolver.so.12 => /usr/local/cuda/lib64/libcusolver.so.12 (0x00007c616aba7000) libcublas.so.13 => /usr/local/cuda/lib64/libcublas.so.13 (0x00007c61675d5000) libcublasLt.so.13 => /usr/local/cuda/lib64/libcublasLt.so.13 (0x00007c6143c83000) libcusparse.so.12 => /usr/local/cuda/lib64/libcusparse.so.12 (0x00007c613987c000) libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007c6131161000) ... libnvJitLink.so.13 => /usr/local/cuda/lib64/libnvJitLink.so.13 (0x00007c612af5f000) ... ``` This proposes getting them from `cuda-toolkit` wheels, instead of system installations. ## Notes for Reviewers ### Benefits of this change * reduces the risk of multiple copies of the same library being loaded * allows the use of Python package versioning to manage compatibility * consistency with other RAPIDS libraries (see rapidsai/build-planning#35) * reduces the risk of runtime issues with other libraries that use CTK wheels, like `torch` (rapidsai/cugraph-gnn#413 (comment)) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: #5444
|
This PR uncovered a bunch of other issues that have spiraled off into other efforts: #410 (comment) Most of the changes here are now on #425 and the rest will be addressed soon in rapidsai/build-planning#256 Moving this to draft for now, but it might end up getting closed depending on how #425 goes. |
|
Since this effort started, this work has spun out into a bunch of other PRs (#410 (comment)) I'm going to close this. It'll be replaced by:
|
…an optional dependency (#425) Contributes to rapidsai/build-planning#256 and #410 ## Ensures that `torch` CUDA wheels are always installed in CI *"unintentionally installed a CPU-only `torch` from PyPI"* is a failure mode I've seen a few times for CI in this project, and each time it's happened it has take a bit of time to figure out that that was the root cause. This PR tries to fix that by: * using a stricter install pattern to guarantee a CUDA `torch` wheel is installed if a compatible one exists * using local versions like `+cu130` to help prevent pulling in packages from PyPI * adding `dependencies.yaml` items for "oldest" dependencies so CI covers a range of supported versions I tested similar patterns in rapidsai/rmm#2279 and saw them work well there. ## Makes `torch` truly optional We want these packages to be installable and importable without `torch`, for use in RAPIDS DLFW builds (where we don't install `torch` alongside RAPIDS because it's build in other processes). I'd started relying on the assumption that they worked that way in this PR, but quickly realized that that isn't true... `torch` is used unconditionally in many ways in these libraries. This PR fixes that. It makes `torch` optional and adds testing to ensure it stays that way: * copying the `import_optional` machinery from `cugraph-pyg` into `pylibwholegraph` * using `import_optional()` for all `torch` imports in the libraries * using `pytest.importorskip("torch")` in tests * deferring some `torch` inputs from import time to run time * adding the `flake8-tidy-imports:banned-api` check from `ruff` to enforce that `import torch` isn't used anywhere in library code or test code * explicitly testing that the libraries are still installable and at least 1 unit test can run successfully after a `pip uninstall torch` * adding a check in `ci/validate_wheel.sh` confirming that `torch` doesn't make it into any wheel metadata *(which could happen due to mistakes in `dependencies.yaml` / `pyproject.toml`)* ## Notes for Reviewers Pulling these changes out of #413 , so CI in this repo can immediately benefit from them and so #419 can be reverted. When this is merged, #413 will have a smaller diff and just be focused on testing against a range of CTKs. Authors: - James Lamb (https://github.com/jameslamb) - Alex Barghi (https://github.com/alexbarghi-nv) Approvers: - Alex Barghi (https://github.com/alexbarghi-nv) - Kyle Edwards (https://github.com/KyleFromNVIDIA) URL: #425
Fixes #410
There, some nightly wheels tests were failing because CUDA 13 packages were being installed but testing against CUDA 12
pylibwholegraphpackages. This fixes that, along with some other improvements to wheel testing:torchare always installed (no fallback to pypi.org CPU-only packages)torchindex (based on CUDA major version) is used