[Build] Switch default CUDA to 13.0, update CUDA architecture lists, clean up stale build-args#39878
Conversation
Make CUDA 13.0 the default build variant and CUDA 12.9 the optional variant. This affects: - VLLM_MAIN_CUDA_VERSION: 12.9 -> 13.0 (envs.py) - Default wheel variant alias: cu129 -> cu130 (nightly index) - Release pipeline: CUDA 13.0 images/wheels are now the default (no suffix), CUDA 12.9 becomes the cu129 variant - Release annotations: updated wheel filenames and Docker tags - Ray compatibility: PyTorch index URL updated to cu130 Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Update CUDA architecture lists following PyTorch RELEASE.md (https://github.com/pytorch/pytorch/blob/main/RELEASE.md). Default (CUDA 13.0): x86_64: 7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX aarch64: 8.0 8.7 8.9 9.0 10.0 11.0 12.0+PTX CUDA 12.9 variant: same arches without +PTX forward compat. Notable arch inclusions beyond PyTorch defaults: - SM86 for broader Ampere coverage (e.g. RTX 3060/3070) - SM89 for marlin fp8 support (QMMA SAAS instruction) Define CUDA_ARCH_{X86,AARCH64}{,_CU129} env vars in the release pipeline to reduce repetition across wheel and Docker image builds. Also update Dockerfile default and CI build scripts to match. Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
The FLASHINFER_AOT_COMPILE ARG was removed from the Dockerfile in PR vllm-project#32627, but stale references remained in the release pipeline and docker-bake.hcl. Clean them up. Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
|
@claude review |
There was a problem hiding this comment.
Code Review
This pull request transitions the build and release pipelines to CUDA 13.0 as the primary version, updating environment variables, Docker build arguments, and script defaults accordingly. Feedback indicates that the updated CUDA architecture lists omit support for SM70 (Volta/V100) and SM121 (DGX Spark), which may cause regressions for users on that hardware. Furthermore, a missing manylinux version argument was noted in the nightly wheel upload script for CUDA 12.9 builds.
There was a problem hiding this comment.
Pull request overview
Updates vLLM’s build/release configuration to make CUDA 13.0 the default (CUDA 12.9 becomes the cu129 variant), refreshes CUDA arch lists, and removes stale build-arg usage related to FlashInfer.
Changes:
- Switch default CUDA version from 12.9 → 13.0 across env/config and Buildkite scripts.
- Update
TORCH_CUDA_ARCH_LIST/torch_cuda_arch_listdefaults and pipeline arch lists (including+PTXfor the default variant). - Remove stale
FLASHINFER_AOT_COMPILEbuild-arg references from Docker bake targets and Buildkite build commands.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
vllm/envs.py |
Bumps VLLM_MAIN_CUDA_VERSION default to 13.0. |
docker/versions.json |
Updates default TORCH_CUDA_ARCH_LIST to the new arch set and +PTX. |
docker/docker-bake.hcl |
Removes FLASHINFER_AOT_COMPILE build arg from targets. |
docker/Dockerfile |
Updates torch_cuda_arch_list default used to set TORCH_CUDA_ARCH_LIST in build stages. |
.github/workflows/scripts/build.sh |
Refreshes wheel build TORCH_CUDA_ARCH_LIST used in GHA builds. |
.buildkite/scripts/generate-and-upload-nightly-index.sh |
Changes default nightly index alias from cu129 → cu130. |
.buildkite/scripts/check-ray-compatibility.sh |
Updates PyTorch CUDA index URL default to cu130. |
.buildkite/scripts/annotate-release.sh |
Adjusts annotated wheel/image commands to reflect new default/variant CUDA mapping. |
.buildkite/release-pipeline.yaml |
Reworks release pipeline to default to CUDA 13.0; introduces top-level arch env vars; updates tags/manifests/variant naming. |
.buildkite/image_build/image_build_torch_nightly.sh |
Updates nightly build torch_cuda_arch_list used for the torch-nightly test image. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The upload-nightly-wheels.sh script defaults to manylinux_2_31, so the aarch64 CUDA 12.9 wheel was already getting the right tag. But x86_64 CUDA 12.9 had the arg explicitly while aarch64 did not — make it explicit for consistency. Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
|
Are we completely abandoning Volta, even for CUDA 12? Details |
|
@ehfd PyTorch has dropped volta support from its 12.8 build: pytorch/pytorch#172351 |
| CUDA_ARCH_X86: "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX" | ||
| # aarch64 only architectures: 8.7 for Orin, 11.0 for Thor (since CUDA 13) | ||
| CUDA_ARCH_AARCH64: "8.0 8.7 8.9 9.0 10.0 11.0 12.0+PTX" | ||
| CUDA_ARCH_X86_CU129: "7.5 8.0 8.6 8.9 9.0 10.0 12.0" | ||
| CUDA_ARCH_AARCH64_CU129: "8.0 8.7 8.9 9.0 10.0 12.0" |
There was a problem hiding this comment.
| CUDA_ARCH_X86: "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX" | |
| # aarch64 only architectures: 8.7 for Orin, 11.0 for Thor (since CUDA 13) | |
| CUDA_ARCH_AARCH64: "8.0 8.7 8.9 9.0 10.0 11.0 12.0+PTX" | |
| CUDA_ARCH_X86_CU129: "7.5 8.0 8.6 8.9 9.0 10.0 12.0" | |
| CUDA_ARCH_AARCH64_CU129: "8.0 8.7 8.9 9.0 10.0 12.0" | |
| CUDA_ARCH_X86: "7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX" | |
| # aarch64 only architectures: 8.7 for Orin, 11.0 for Thor (since CUDA 13) | |
| CUDA_ARCH_AARCH64: "8.0 8.6 8.7 8.9 9.0 10.0 10.3 11.0 12.0+PTX" | |
| CUDA_ARCH_X86_CU129: "7.5 8.0 8.6 8.9 9.0 10.0 12.0" | |
| CUDA_ARCH_AARCH64_CU129: "8.0 8.6 8.7 8.9 9.0 10.0 12.0" |
Both 8.6 and 10.3 applicable to x86 and aarch64
see https://developer.nvidia.com/cuda/gpus
8.6 - covers long list of PCIe GPUs
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).
There was a problem hiding this comment.
Thanks!
- We are not compiling for sm_86 since the beginning on aarch64, and IIRC it could use the cubin from
sm_80. - For CUDA 13+, we use family specifier
10.0fso 10.3 is included in kernels that requires tcgen05, and for other kernels10.0is enough to be compatible, I think.
| export MAX_JOBS=1 | ||
| # Make sure release wheels are built for the following architectures | ||
| export TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX" | ||
| export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX" |
There was a problem hiding this comment.
| export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX" | |
| export TORCH_CUDA_ARCH_LIST="7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX" |
see https://developer.nvidia.com/cuda/gpus
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).
There was a problem hiding this comment.
This list is not used anymore since we do not run GitHub actions. It's kept for reference only.
|
|
||
| # install kv_connectors if requested | ||
| ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' | ||
| ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX' |
There was a problem hiding this comment.
| ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX' | |
| ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX' |
see https://developer.nvidia.com/cuda/gpus
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).
There was a problem hiding this comment.
Same as above -- 10.3 is absorbed by 10.0f family on CUDA 13 in supported kernels, and generally can reuse the 10.0 cubin.
| # See https://github.com/pytorch/pytorch/pull/123243 | ||
| # From versions.json: .torch.cuda_arch_list | ||
| ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' | ||
| ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX' |
There was a problem hiding this comment.
| ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX' | |
| ARG torch_cuda_arch_list='7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX' |
see https://developer.nvidia.com/cuda/gpus
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).
There was a problem hiding this comment.
Same as above -- 10.3 is absorbed by 10.0f family on CUDA 13 in supported kernels, and generally can reuse the 10.0 cubin.
| }, | ||
| "TORCH_CUDA_ARCH_LIST": { | ||
| "default": "7.0 7.5 8.0 8.9 9.0 10.0 12.0" | ||
| "default": "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX" |
There was a problem hiding this comment.
| "default": "7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX" | |
| "default": "7.5 8.0 8.6 8.9 9.0 10.0 10.3 12.0+PTX" |
see https://developer.nvidia.com/cuda/gpus
10.3 - B300 (works with x86) and GB300 (ARM-based Grace CPU).
There was a problem hiding this comment.
Same as above -- 10.3 is absorbed by 10.0f family on CUDA 13 in supported kernels, and generally can reuse the 10.0 cubin.
| set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;11.0;12.0") | ||
| elseif(DEFINED CMAKE_CUDA_COMPILER_VERSION AND | ||
| CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8) | ||
| set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0;12.1") | ||
| set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;10.3;12.0;12.1") |
There was a problem hiding this comment.
| set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;11.0;12.0") | |
| elseif(DEFINED CMAKE_CUDA_COMPILER_VERSION AND | |
| CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8) | |
| set(CUDA_SUPPORTED_ARCHS "7.0;7.2;7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;12.0;12.1") | |
| set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;10.3;12.0;12.1") | |
| set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;10.1;10.3;12.0;12.1") | |
| elseif(DEFINED CMAKE_CUDA_COMPILER_VERSION AND | |
| CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8) | |
| set(CUDA_SUPPORTED_ARCHS "7.5;8.0;8.6;8.7;8.9;9.0;10.0;11.0;12.0") |
I think the lists need to be flipped.
Cuda 12.8 doesn't support 10.3, 11.0, 12.1.
see https://docs.nvidia.com/cuda/archive/12.8.2/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list
There was a problem hiding this comment.
We are only using CUDA 12.9, so it should be safe to use thouse architectures. And please see the comments above that explains why we are not specifying 10.3 and 12.1 since CUDA 13.0
There was a problem hiding this comment.
Ack. It could be misleading to users that logic is for cuda 12.8 but in reality it's for cuda 12.9.
BTW 12.9 doesn't support CC 11.0 - see https://docs.nvidia.com/cuda/archive/12.9.1/cuda-compiler-driver-nvcc/index.html#virtual-architecture-feature-list
dmitry-tokarev-nv
left a comment
There was a problem hiding this comment.
posted comments with suggestions
…clean up stale build-args (vllm-project#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…clean up stale build-args (vllm-project#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Adrian <info@zzit.ch>
…clean up stale build-args (vllm-project#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
…clean up stale build-args (vllm-project#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…clean up stale build-args (vllm-project#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…clean up stale build-args (vllm-project#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…clean up stale build-args (vllm-project#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…clean up stale build-args (vllm-project#39878) Signed-off-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
cu129variant; CUDA 13.0 wheels and Docker images are now the unsuffixed default (pip install vllm). This aligns with torch 2.11 release.FLASHINFER_AOT_COMPILEbuild-arg references left over after [CI]: remove unused FLASHINFER_AOT_COMPILE build argument #32627 removed the ARG from the Dockerfile.We need to announce to the users after this is merged.
Details
Default CUDA version swap (12.9 → 13.0)
VLLM_MAIN_CUDA_VERSION12.913.0+cu130+cu129vllm-openai:latest= CUDA 12.9vllm-openai:latest= CUDA 13.0vllm-openai:latest-cu130vllm-openai:latest-cu129cu129cu130Files:
vllm/envs.py,release-pipeline.yaml,annotate-release.sh,generate-and-upload-nightly-index.sh,check-ray-compatibility.shCUDA architecture lists
New arch lists, following PyTorch RELEASE.md:
7.5 8.0 8.6 8.9 9.0 10.0 12.0+PTX7.5 8.0 8.6 8.9 9.0 10.0 12.08.0 8.7 8.9 9.0 10.0 11.0 12.0+PTX8.0 8.7 8.9 9.0 10.0 12.0Volta (SM70 / SM75) architectures are removed since they are no longer supported by PyTorch 2.11 with newer CUDA.
Notable inclusions beyond upstream PyTorch defaults:
QMMA.16832.F32.E4M3.E4M3SAAS instruction, only supported on SM89 and SM120+)CUDA 12.9 builds omit
+PTXsince they are not the forward-compatible default.Arch lists are defined as top-level
env:variables inrelease-pipeline.yaml(CUDA_ARCH_X86,CUDA_ARCH_AARCH64,CUDA_ARCH_X86_CU129,CUDA_ARCH_AARCH64_CU129) and referenced via${...}in all build commands, eliminating 12 hardcoded duplicates.Files:
release-pipeline.yaml,docker/Dockerfile,docker/versions.json,image_build_torch_nightly.sh,.github/workflows/scripts/build.shFLASHINFER_AOT_COMPILE cleanup
The
FLASHINFER_AOT_COMPILEDockerfile ARG was proposed to be removed in #32627. This PR includes it.Test Plan
Will be tested by CI and release pipeline.
Test Result
CI pipeline: https://buildkite.com/vllm/ci/builds/61430/. This PR does not introduce new regressions.
Release pipeline: https://buildkite.com/vllm/release-v2/builds/665/. All passed.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.