feat: Test cuda 12.9 #3064

dillon-cullinan · 2025-09-16T19:20:41Z

Overview:

Testing Cuda 12.9

Summary by CodeRabbit

Chores
- Upgraded container base and runtime images to CUDA 12.9.x across environments.
- Aligned CUDA command-line tools with the new runtime version.
- Improves compatibility with newer NVIDIA drivers and includes stability/security updates.
- No functional changes to the application; builds and deployments remain unchanged.
- Applies to all relevant container variants; no user configuration changes required.

Signed-off-by: Dillon Cullinan <[email protected]>

coderabbitai · 2025-09-16T19:27:59Z

Walkthrough

CUDA versions were incremented across container build assets. Dockerfile ARGs now reference CUDA 12.9/25.01 images instead of 12.8. In vLLM’s runtime stage, the apt package cuda-command-line-tools was updated from 12-8 to 12-9. No other logic, structure, or build steps changed.

Changes

Cohort / File(s)	Summary of edits
CUDA base/runtime tag bumps `container/Dockerfile`, `container/Dockerfile.sglang`, `container/Dockerfile.vllm`	Updated BASE_IMAGE_TAG from 25.01-cuda12.8-devel-ubuntu24.04 to 25.01-cuda12.9-devel-ubuntu24.04; updated RUNTIME_IMAGE_TAG in sglang and vllm from 12.8.1-runtime-ubuntu24.04 to 12.9.1-runtime-ubuntu24.04.
vLLM runtime tooling package `container/Dockerfile.vllm`	Replaced apt package cuda-command-line-tools-12-8 with cuda-command-line-tools-12-9 in runtime stage.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

refactor: replace vllm with vllm_v1 container #1953 — Also modifies container/Dockerfile.vllm, refactoring vLLM install and related Dockerfile args/env/entrypoint, overlapping with image tag handling.

Poem

I hopped through layers, crisp and fine,
From 12.8 to 12.9 I twine—
New tags aligned, no fuss, no fray,
Tools updated for the CUDA way.
In shiny jars my images shine—
Build, run, and nibble—everything’s divine! 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description is largely incomplete compared to the required template. While the Overview section is present with "Testing Cuda 12.9," the three other required sections—Details (describing the changes), Where should the reviewer start (calling out specific files), and Related Issues (with action keywords)—are entirely missing. The current description provides minimal context and fails to follow the structured format outlined in the repository template.	The author should expand the pull request description to include the missing sections. Add a Details section explaining that CUDA base images and runtime images are being updated from 12.8 to 12.9 across the three Dockerfiles, specify which files should be reviewed first, and link any related GitHub issues or discussions using action keywords like "Closes," "Fixes," or "Relates to."

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.
Title Check	✅ Passed	The pull request title "feat: Test cuda 12.9" directly captures the main objective of the changeset. The PR updates CUDA versions from 12.8 to 12.9 across three Dockerfiles (container/Dockerfile, container/Dockerfile.sglang, and container/Dockerfile.vllm), and the title accurately reflects this primary change. The title is concise, clear, and specific enough that a teammate reviewing the history would understand the essential modification being made.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87e6e05 and 83ff5fa.

📒 Files selected for processing (3)

container/Dockerfile (1 hunks)
container/Dockerfile.sglang (1 hunks)
container/Dockerfile.vllm (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Build and Test - vllm
GitHub Check: Build and Test - sglang
GitHub Check: Build and Test - dynamo

🔇 Additional comments (2)

container/Dockerfile.vllm (2)

203-204: Good: update to cuda-command-line-tools-12-9 for DeepGEMM/cuobjdump

This matches the CUDA 12.9 runtime and unblocks cuobjdump/nvdisasm use in runtime. LGTM.

10-15: Validate image tags and Torch backend compatibility

Image existence couldn't be confirmed in the sandbox (docker not installed). Verify these registry tags exist and update if missing: nvcr.io/nvidia/cuda-dl-base:25.09-cuda12.9-devel-ubuntu24.04 and nvcr.io/nvidia/cuda:12.9.1-runtime-ubuntu24.04.

TORCH_BACKEND is pinned to cu128 and install_vllm.sh installs torch==2.7.1+cu128 (see container/deps/vllm/install_vllm.sh:33,137 and container/Dockerfile.vllm:18,145). Confirm cu128 wheels are compatible with a CUDA 12.9 toolchain; if your environment requires cu129, update TORCH_BACKEND and the pinned wheel versions.

Recommendation: add ARG CUDA_VERSION=12.9 and ARG CUDA_PKG_VER=12-9 and reference them in image tags/package names to reduce drift.

coderabbitai · 2025-09-16T19:28:02Z

container/Dockerfile

 # for details and reproducer to manually test if the image
 # can be updated to later versions.
-ARG BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04"
+ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04"


⚠️ Potential issue

🧩 Analysis chain

Tag likely invalid: cuda-dl-base 25.01 + CUDA 12.9 combo probably doesn’t exist

25.01 predates CUDA 12.9; nvcr.io/nvidia/cuda-dl-base is unlikely to publish 25.01-cuda12.9-*. This will 404 at build time. Align the HPC track to the first release that actually ships CUDA 12.9 (e.g., 25.09+), or drop the 25.01 constraint if you’re unblocking tests.

Run to confirm the exact valid tag (may require NGC auth):

Optional example fix (adjust to the tag you find):

-ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04" +ARG BASE_IMAGE_TAG="25.09-cuda12.9-devel-ubuntu24.04"

🏁 Script executed:

#!/bin/bash set -euo pipefail REPO="nvcr.io/nvidia/cuda-dl-base" for track in 25.01 25.05 25.07 25.08 25.09 25.10 25.11 25.12; do tag="${track}-cuda12.9-devel-ubuntu24.04" echo "Checking $REPO:$tag ..." if docker manifest inspect "$REPO:$tag" >/dev/null 2>&1; then echo "FOUND: $REPO:$tag" fi done

Length of output: 591

Invalid base image tag — fix before merge

container/Dockerfile:15 — ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04"

The supplied manifest check shows none of the tested nvcr.io/nvidia/cuda-dl-base:*-cuda12.9-devel-ubuntu24.04 tags exist (docker manifest inspect returned no matches). This will 404 at build time.

Action: replace the ARG with a published nvcr.io/nvidia/cuda-dl-base tag that actually includes CUDA 12.9 (or remove the 25.01 track constraint) and verify with docker manifest inspect or the NGC registry before merging.

🤖 Prompt for AI Agents

In container/Dockerfile around line 15, ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04" references a non-existent nvcr.io/nvidia/cuda-dl-base tag and will 404 at build time; replace that ARG with a published nvcr.io/nvidia/cuda-dl-base tag that includes CUDA 12.9 (or remove the "25.01" track constraint) and then verify the chosen tag exists by running docker manifest inspect nvcr.io/nvidia/cuda-dl-base:<TAG> (or check the NGC registry) before merging.

coderabbitai · 2025-09-16T19:28:02Z

container/Dockerfile.sglang

+ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04"
 ARG RELEASE_BUILD
 ARG RUNTIME_IMAGE="nvcr.io/nvidia/cuda"
-ARG RUNTIME_IMAGE_TAG="12.8.1-runtime-ubuntu24.04"
+ARG RUNTIME_IMAGE_TAG="12.9.1-runtime-ubuntu24.04"


⚠️ Potential issue

🧩 Analysis chain

Same base/runtime tag check as vLLM; 25.01 + CUDA 12.9 likely invalid

Ensure ${BASE_IMAGE}:${BASE_IMAGE_TAG} actually exists; 25.01 probably doesn’t publish CUDA 12.9. Runtime tag 12.9.1-runtime-ubuntu24.04 is plausible but verify.

If you confirm a newer HPC track (e.g., 25.09) is required for 12.9, update both this file and vLLM to the same track to keep builds consistent.

🏁 Script executed:

#!/bin/bash set -euo pipefail docker manifest inspect nvcr.io/nvidia/cuda-dl-base:25.01-cuda12.9-devel-ubuntu24.04 >/dev/null \ && echo "FOUND base tag" || echo "MISSING base tag" docker manifest inspect nvcr.io/nvidia/cuda:12.9.1-runtime-ubuntu24.04 >/dev/null \ && echo "FOUND runtime tag" || echo "MISSING runtime tag"

Length of output: 126

Base/runtime tag mismatch — 25.01 doesn't ship CUDA 12.9; update both to a matching 25.xx track

Local manifest check couldn't run here (docker not installed); NVIDIA release notes show 25.01 images are built with CUDA 12.8 while CUDA 12.9.1 is provided in later CUDA‑DL releases (25.06). (docs.nvidia.com)

nvcr.io/nvidia/cuda-dl-base:25.01-cuda12.9-devel-ubuntu24.04 is therefore likely invalid — change BASE_IMAGE_TAG to nvcr.io/nvidia/cuda-dl-base:25.06-cuda12.9-devel-ubuntu24.04 (or revert CUDA to 12.8 for 25.01) and ensure the runtime tag is on the same 25.xx track; 12.9.1 CUDA images are available on Docker Hub / NGC. (docker.aityp.com)

File: container/Dockerfile.sglang (lines 9–12).

github-actions · 2025-10-17T09:34:24Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Test cuda 12.9

83ff5fa

Signed-off-by: Dillon Cullinan <[email protected]>

dillon-cullinan requested review from a team, alec-flowers, ishandhanani, nnshah1, ptarasiewiczNV, richardhuo-nv, rmccorm4 and tanmayv25 as code owners September 16, 2025 19:20

pull-request-size bot added the size/S label Sep 16, 2025

dillon-cullinan changed the title ~~Test cuda 12.9~~ feat: Test cuda 12.9 Sep 16, 2025

github-actions bot added the feat label Sep 16, 2025

dillon-cullinan marked this pull request as draft September 16, 2025 19:21

coderabbitai bot reviewed Sep 16, 2025

View reviewed changes

github-actions bot added Stale and removed Stale labels Oct 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Test cuda 12.9 #3064

feat: Test cuda 12.9 #3064

Uh oh!

dillon-cullinan commented Sep 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 16, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 16, 2025

Uh oh!

coderabbitai bot Sep 16, 2025

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Test cuda 12.9 #3064

Are you sure you want to change the base?

feat: Test cuda 12.9 #3064

Uh oh!

Conversation

dillon-cullinan commented Sep 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dillon-cullinan commented Sep 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 16, 2025 •

edited

Loading