ci: Use stable Torch Release for cu130#2174
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThese changes update two CUDA 13.0-specific Dockerfiles to use the stable Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @bkryu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request serves as a temporary measure to re-trigger the CI container build process. Its main purpose is to debug an observed anomaly in a prior build, where an incorrect development version of PyTorch was installed. The changes introduce a diagnostic Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request appears to be for debugging a CI issue related to torch installation. The changes in docker/install/install_python_packages.sh add a --dry-run command and remove --force-reinstall. While the --dry-run is useful for debugging, it makes the installation process inefficient by running dependency resolution twice. More importantly, removing --force-reinstall is risky as it may fail to replace a pre-existing, incorrect torch version. I've suggested restoring the single-line installation with --force-reinstall for robustness. As a general note, the underlying issue might also be related to torch being present in requirements.txt, which could cause version conflicts during the subsequent installation step.
| pip3 install torch --index-url https://download.pytorch.org/whl/${CUDA_VERSION} --dry-run | ||
| pip3 install torch --index-url https://download.pytorch.org/whl/${CUDA_VERSION} |
There was a problem hiding this comment.
The use of --dry-run is helpful for debugging, but this change introduces an inefficiency and a potential correctness issue. The pip command is executed twice, leading to redundant dependency resolution. More critically, removing the --force-reinstall flag may prevent the correct CUDA-specific version of torch from being installed if another version is already present in the environment. To ensure the installation is both correct and efficient, I recommend using a single command that includes --force-reinstall.
| pip3 install torch --index-url https://download.pytorch.org/whl/${CUDA_VERSION} --dry-run | |
| pip3 install torch --index-url https://download.pytorch.org/whl/${CUDA_VERSION} | |
| pip3 install --force-reinstall torch --index-url https://download.pytorch.org/whl/${CUDA_VERSION} |
<!-- .github/pull_request_template.md --> ## 📌 Description Previous PR flashinfer-ai#2167 's container build is installing torch 2.10.dev12032025, due to the cu130 Dockerfile installing unstable nightlies. We are seeing unit testing failures for torch 2.10 dev where `torch.einsum` causes a cuBLAS failure: ``` (py312) root@cc6b2de90050:/flashinfer# pip list | grep torch pytorch-triton 3.5.1+gitbfeb0668 torch 2.10.0.dev20251203+cu130 (py312) root@cc6b2de90050:/flashinfer# python3 >>> import torch >>> query = torch.randn(1, 4, 128, device="cuda", dtype=torch.float16) >>> key = torch.randn(1, 4, 128, device="cuda", dtype=torch.float16) >>> scores = torch.einsum("qhd,khd->qkh", query.float(), key.float()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/conda/envs/py312/lib/python3.12/site-packages/torch/functional.py", line 373, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)` ``` As PyTorch started releasing stable 2.9.1 for cu130, there is no longer a need to use nightlies. Local testing with the 2.9.1 resolves `einsum` failures: ``` (py312) root@fd986dc62859:/flashinfer# pip3 install --force-reinstall torch --index-url https://download.pytorch.org/whl/cu130 --no-deps Looking in indexes: https://download.pytorch.org/whl/cu130 Collecting torch Downloading https://download.pytorch.org/whl/cu130/torch-2.9.1%2Bcu130-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (30 kB) Downloading https://download.pytorch.org/whl/cu130/torch-2.9.1%2Bcu130-cp312-cp312-manylinux_2_28_x86_64.whl (612.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 612.6/612.6 MB 215.8 MB/s 0:00:01 Installing collected packages: torch Attempting uninstall: torch Found existing installation: torch 2.10.0.dev20251203+cu130 Uninstalling torch-2.10.0.dev20251203+cu130: Successfully uninstalled torch-2.10.0.dev20251203+cu130 Successfully installed torch-2.9.1+cu130 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. (py312) root@fd986dc62859:/flashinfer# python3 Python 3.12.11 | packaged by conda-forge | (main, Jun 4 2025, 14:45:31) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> query = torch.randn(1, 4, 128, device="cuda", dtype=torch.float16) >>> key = torch.randn(1, 4, 128, device="cuda", dtype=torch.float16) >>> scores = torch.einsum("qhd,khd->qkh", query.float(), key.float()) ``` See [this line](https://github.com/flashinfer-ai/flashinfer/actions/runs/19940898985/job/57178116127?pr=2174#step:6:352) for the release container build job for cu130 where a stable release 2.9.1 is being fetched and installed <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Improved the package installation process for containerized deployments to enhance reliability and consistency. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
📌 Description
Previous PR #2167 's container build is installing torch 2.10.dev12032025, due to the cu130 Dockerfile installing unstable nightlies. We are seeing unit testing failures for torch 2.10 dev where
torch.einsumcauses a cuBLAS failure:As PyTorch started releasing stable 2.9.1 for cu130, there is no longer a need to use nightlies. Local testing with the 2.9.1 resolves
einsumfailures:See this line for the release container build job for cu130 where a stable release 2.9.1 is being fetched and installed
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.