-
Notifications
You must be signed in to change notification settings - Fork 689
feat: Test cuda 12.9 #3064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Test cuda 12.9 #3064
Conversation
Signed-off-by: Dillon Cullinan <[email protected]>
WalkthroughCUDA versions were incremented across container build assets. Dockerfile ARGs now reference CUDA 12.9/25.01 images instead of 12.8. In vLLM’s runtime stage, the apt package cuda-command-line-tools was updated from 12-8 to 12-9. No other logic, structure, or build steps changed. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
container/Dockerfile(1 hunks)container/Dockerfile.sglang(1 hunks)container/Dockerfile.vllm(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build and Test - vllm
- GitHub Check: Build and Test - sglang
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
container/Dockerfile.vllm (2)
203-204: Good: update to cuda-command-line-tools-12-9 for DeepGEMM/cuobjdumpThis matches the CUDA 12.9 runtime and unblocks cuobjdump/nvdisasm use in runtime. LGTM.
10-15: Validate image tags and Torch backend compatibility
- Image existence couldn't be confirmed in the sandbox (docker not installed). Verify these registry tags exist and update if missing: nvcr.io/nvidia/cuda-dl-base:25.09-cuda12.9-devel-ubuntu24.04 and nvcr.io/nvidia/cuda:12.9.1-runtime-ubuntu24.04.
- TORCH_BACKEND is pinned to cu128 and install_vllm.sh installs torch==2.7.1+cu128 (see container/deps/vllm/install_vllm.sh:33,137 and container/Dockerfile.vllm:18,145). Confirm cu128 wheels are compatible with a CUDA 12.9 toolchain; if your environment requires cu129, update TORCH_BACKEND and the pinned wheel versions.
- Recommendation: add ARG CUDA_VERSION=12.9 and ARG CUDA_PKG_VER=12-9 and reference them in image tags/package names to reduce drift.
| # for details and reproducer to manually test if the image | ||
| # can be updated to later versions. | ||
| ARG BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04" | ||
| ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Tag likely invalid: cuda-dl-base 25.01 + CUDA 12.9 combo probably doesn’t exist
25.01 predates CUDA 12.9; nvcr.io/nvidia/cuda-dl-base is unlikely to publish 25.01-cuda12.9-*. This will 404 at build time. Align the HPC track to the first release that actually ships CUDA 12.9 (e.g., 25.09+), or drop the 25.01 constraint if you’re unblocking tests.
Run to confirm the exact valid tag (may require NGC auth):
Optional example fix (adjust to the tag you find):
-ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04"
+ARG BASE_IMAGE_TAG="25.09-cuda12.9-devel-ubuntu24.04"🏁 Script executed:
#!/bin/bash
set -euo pipefail
REPO="nvcr.io/nvidia/cuda-dl-base"
for track in 25.01 25.05 25.07 25.08 25.09 25.10 25.11 25.12; do
tag="${track}-cuda12.9-devel-ubuntu24.04"
echo "Checking $REPO:$tag ..."
if docker manifest inspect "$REPO:$tag" >/dev/null 2>&1; then
echo "FOUND: $REPO:$tag"
fi
doneLength of output: 591
Invalid base image tag — fix before merge
container/Dockerfile:15 — ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04"
The supplied manifest check shows none of the tested nvcr.io/nvidia/cuda-dl-base:*-cuda12.9-devel-ubuntu24.04 tags exist (docker manifest inspect returned no matches). This will 404 at build time.
Action: replace the ARG with a published nvcr.io/nvidia/cuda-dl-base tag that actually includes CUDA 12.9 (or remove the 25.01 track constraint) and verify with docker manifest inspect or the NGC registry before merging.
🤖 Prompt for AI Agents
In container/Dockerfile around line 15, ARG
BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04" references a non-existent
nvcr.io/nvidia/cuda-dl-base tag and will 404 at build time; replace that ARG
with a published nvcr.io/nvidia/cuda-dl-base tag that includes CUDA 12.9 (or
remove the "25.01" track constraint) and then verify the chosen tag exists by
running docker manifest inspect nvcr.io/nvidia/cuda-dl-base:<TAG> (or check the
NGC registry) before merging.
| ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04" | ||
| ARG RELEASE_BUILD | ||
| ARG RUNTIME_IMAGE="nvcr.io/nvidia/cuda" | ||
| ARG RUNTIME_IMAGE_TAG="12.8.1-runtime-ubuntu24.04" | ||
| ARG RUNTIME_IMAGE_TAG="12.9.1-runtime-ubuntu24.04" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Same base/runtime tag check as vLLM; 25.01 + CUDA 12.9 likely invalid
Ensure ${BASE_IMAGE}:${BASE_IMAGE_TAG} actually exists; 25.01 probably doesn’t publish CUDA 12.9. Runtime tag 12.9.1-runtime-ubuntu24.04 is plausible but verify.
If you confirm a newer HPC track (e.g., 25.09) is required for 12.9, update both this file and vLLM to the same track to keep builds consistent.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
docker manifest inspect nvcr.io/nvidia/cuda-dl-base:25.01-cuda12.9-devel-ubuntu24.04 >/dev/null \
&& echo "FOUND base tag" || echo "MISSING base tag"
docker manifest inspect nvcr.io/nvidia/cuda:12.9.1-runtime-ubuntu24.04 >/dev/null \
&& echo "FOUND runtime tag" || echo "MISSING runtime tag"Length of output: 126
Base/runtime tag mismatch — 25.01 doesn't ship CUDA 12.9; update both to a matching 25.xx track
Local manifest check couldn't run here (docker not installed); NVIDIA release notes show 25.01 images are built with CUDA 12.8 while CUDA 12.9.1 is provided in later CUDA‑DL releases (25.06). (docs.nvidia.com)
nvcr.io/nvidia/cuda-dl-base:25.01-cuda12.9-devel-ubuntu24.04 is therefore likely invalid — change BASE_IMAGE_TAG to nvcr.io/nvidia/cuda-dl-base:25.06-cuda12.9-devel-ubuntu24.04 (or revert CUDA to 12.8 for 25.01) and ensure the runtime tag is on the same 25.xx track; 12.9.1 CUDA images are available on Docker Hub / NGC. (docker.aityp.com)
File: container/Dockerfile.sglang (lines 9–12).
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Overview:
Testing Cuda 12.9
Summary by CodeRabbit