Revert "[release 2.11] Update to torch 2.11" (#34644)#39300
Revert "[release 2.11] Update to torch 2.11" (#34644)#39300vllm-agent wants to merge 1 commit intovllm-project:mainfrom
Conversation
This reverts commit 2111997.
|
Documentation preview: https://vllm--39300.org.readthedocs.build/en/39300/ |
There was a problem hiding this comment.
Code Review
This pull request downgrades the supported versions of PyTorch (from 2.11.0 to 2.10.0), CUDA (from 13.0.0 to 12.9.1), and associated dependencies such as torchvision, torchaudio, and torchao across the build system, Dockerfiles, and CI configurations. Key changes include unskipping tests previously incompatible with PyTorch 2.11 and simplifying the MoE layer name resolution. Review feedback identifies a critical shell syntax error in the Dockerfile caused by comments within a multi-line command and an inconsistency in RISC-V platform support within the CPU build requirements.
| libcublas-${CUDA_VERSION_DASH} \ | ||
| # Fixes nccl_allocator requiring nccl.h at runtime | ||
| # https://github.com/vllm-project/vllm/blob/1336a1ea244fa8bfd7e72751cabbdb5b68a0c11a/vllm/distributed/device_communicators/pynccl_allocator.py#L22 | ||
| libnccl-dev && \ |
There was a problem hiding this comment.
The inclusion of comments within the multi-line apt-get install command breaks the shell execution. The backslash on line 556 causes the shell to continue the command to line 557, which is a comment. This effectively terminates the apt-get install command arguments, and line 559 will be interpreted as a new (and invalid) command, causing the build to fail. Please move these comments outside of the command or place them after the && operator on a separate line.
libcublas-${CUDA_VERSION_DASH} \
libnccl-dev && \
| torch==2.10.0+cpu; platform_machine == "x86_64" or platform_machine == "s390x" | ||
| torch==2.10.0; platform_machine == "aarch64" or platform_system == "Darwin" or platform_machine == "ppc64le" |
There was a problem hiding this comment.
The riscv64 platform is missing from the torch requirements in cpu-build.txt, creating an inconsistency with requirements/cpu.txt where it is present. This will cause build failures on RISC-V systems as the torch dependency will not be satisfied during the build phase.
torch==2.10.0+cpu; platform_machine == "x86_64" or platform_machine == "s390x"
torch==2.10.0; platform_machine == "aarch64" or platform_system == "Darwin" or platform_machine == "ppc64le" or platform_machine == "riscv64"
Revert of #34644
This reverts the merge commit 2111997 from PR #34644.
Reason
5 new CI test failures detected in nightly build #60295:
Changes reverted
This PR updated torch from 2.10 to 2.11, torchao from 0.14.1 to 0.17.0, CUDA from 12.9 to 13.0, and modified
moe_runner_base.py, quantization CI config, and test dependencies. The revert rolls back all of these changes.Auto-generated by CI failure analyzer | Build #60295