[Build] Switch CUDA 13.0 wheel builds to PyTorch manylinux_2_28 base#41416
[Build] Switch CUDA 13.0 wheel builds to PyTorch manylinux_2_28 base#41416
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
There was a problem hiding this comment.
Code Review
This pull request introduces support for manylinux-based builds to ensure compatibility with a glibc 2.28 floor, matching PyTorch's published wheels. Key changes include the addition of a BUILD_OS build argument in the Dockerfile and Buildkite pipeline, conditional logic for package management (switching between apt and dnf), and specialized Python bootstrapping for manylinux environments using uv. I have no feedback to provide.
pytorch/manylinux2_28_aarch64-builder only ships CPU tags. The CUDA aarch64 variant lives at pytorch/manylinuxaarch64-builder (despite the asymmetric naming, AUDITWHEEL_POLICY=manylinux_2_28 is set inside the image). Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
|
Validation building the docker image locally TL;DR |
|
I believe CI failures are unrelated and the release build passed, so I think we are good |
| # Upgrade to GCC 10 to avoid https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92519 | ||
| # as it was causing spam when compiling the CUTLASS kernels | ||
| gcc-10 \ | ||
| g++-10 \ | ||
| && update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 110 --slave /usr/bin/g++ g++ /usr/bin/g++-10 \ |
There was a problem hiding this comment.
I think we can remove this now! This is actually a downgrade since we're not building on 20.04 anymore
There was a problem hiding this comment.
I can remove this in a separate PR as it is unrelated, this is just keeping the previous state
…llm-project#41416) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Joachim Studnia <joachim@mistral.ai>
| queue: cpu_queue_release | ||
| commands: | ||
| - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.2 --build-arg torch_cuda_arch_list=\"${CUDA_ARCH_X86}\" --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.2-devel-ubuntu22.04 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ." | ||
| - "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=13.0.2 --build-arg torch_cuda_arch_list=\"${CUDA_ARCH_X86}\" --build-arg BUILD_OS=manylinux --build-arg BUILD_BASE_IMAGE=pytorch/manylinux2_28-builder:cuda13.0 --tag vllm-ci:build-image --target build --progress plain -f docker/Dockerfile ." |
There was a problem hiding this comment.
images like pytorch/manylinux2_28-builder:cuda13.0 will be regular updated from torch side, this may cause some potential regression.
Should we use a stable image version like pytorch/manylinux2_28-builder:cuda13.0-v2.11.0-rc6 instead?
…llm-project#41416) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
…llm-project#41416) Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
Purpose
Lowers the glibc floor of the published CUDA 13.0 wheels from 2.34 (Ubuntu 22.04) to 2.28 (AlmaLinux 8 via
pytorch/manylinux2_28-builder:cuda13.0), matching what PyTorch ships and restoring compatibility with older OSes (RHEL 8, Ubuntu 20.04, etc.).Context: see slack discussion and #26118.
Changes:
docker/Dockerfile: newBUILD_OSarg (ubuntudefault,manylinuxopt-in) branches the system-deps install (apt → dnf), Python bootstrap (uv-fetched →/opt/python/cpXY-cpXY), and thedevstage's libnuma install. Both paths converge on/opt/venvso downstream stages (csrc-build,extensions-build,build) are unchanged..buildkite/release-pipeline.yaml: cu130 wheel builds (x86 + aarch64) now passBUILD_OS=manylinux+ the pytorch base image.FINAL_BASE_IMAGE, cu129 builds, and the release Docker image build are unchanged — runtime container stays on Ubuntu.TL;DR
Test Plan
--target buildsucceeds for x86_64 cu130auditwheel showon the produced wheel reportsmanylinux_2_28_x86_64Validated in cu130 release build with
arm: https://buildkite.com/vllm/release-v2/builds/1150/canvas?sid=019de0f0-babe-4073-b87a-ee91de535bd5&tab=output
x86 https://buildkite.com/vllm/release-v2/builds/1150/canvas?sid=019de0f0-bac5-4dd5-93d7-3c7ef554f632&tab=output
Test Result
Pending validation.
AI assistance (Claude) was used to draft the Dockerfile branching.
Essential Elements of an Effective PR Description Checklist