Skip to content

Conversation

@dillon-cullinan
Copy link
Contributor

@dillon-cullinan dillon-cullinan commented Sep 16, 2025

Overview:

Testing Cuda 12.9

Summary by CodeRabbit

  • Chores
    • Upgraded container base and runtime images to CUDA 12.9.x across environments.
    • Aligned CUDA command-line tools with the new runtime version.
    • Improves compatibility with newer NVIDIA drivers and includes stability/security updates.
    • No functional changes to the application; builds and deployments remain unchanged.
    • Applies to all relevant container variants; no user configuration changes required.

Signed-off-by: Dillon Cullinan <[email protected]>
@dillon-cullinan dillon-cullinan changed the title Test cuda 12.9 feat: Test cuda 12.9 Sep 16, 2025
@github-actions github-actions bot added the feat label Sep 16, 2025
@dillon-cullinan dillon-cullinan marked this pull request as draft September 16, 2025 19:21
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 16, 2025

Walkthrough

CUDA versions were incremented across container build assets. Dockerfile ARGs now reference CUDA 12.9/25.01 images instead of 12.8. In vLLM’s runtime stage, the apt package cuda-command-line-tools was updated from 12-8 to 12-9. No other logic, structure, or build steps changed.

Changes

Cohort / File(s) Summary of edits
CUDA base/runtime tag bumps
container/Dockerfile, container/Dockerfile.sglang, container/Dockerfile.vllm
Updated BASE_IMAGE_TAG from 25.01-cuda12.8-devel-ubuntu24.04 to 25.01-cuda12.9-devel-ubuntu24.04; updated RUNTIME_IMAGE_TAG in sglang and vllm from 12.8.1-runtime-ubuntu24.04 to 12.9.1-runtime-ubuntu24.04.
vLLM runtime tooling package
container/Dockerfile.vllm
Replaced apt package cuda-command-line-tools-12-8 with cuda-command-line-tools-12-9 in runtime stage.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

Poem

I hopped through layers, crisp and fine,
From 12.8 to 12.9 I twine—
New tags aligned, no fuss, no fray,
Tools updated for the CUDA way.
In shiny jars my images shine—
Build, run, and nibble—everything’s divine! 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description is largely incomplete compared to the required template. While the Overview section is present with "Testing Cuda 12.9," the three other required sections—Details (describing the changes), Where should the reviewer start (calling out specific files), and Related Issues (with action keywords)—are entirely missing. The current description provides minimal context and fails to follow the structured format outlined in the repository template. The author should expand the pull request description to include the missing sections. Add a Details section explaining that CUDA base images and runtime images are being updated from 12.8 to 12.9 across the three Dockerfiles, specify which files should be reviewed first, and link any related GitHub issues or discussions using action keywords like "Closes," "Fixes," or "Relates to."
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Title Check ✅ Passed The pull request title "feat: Test cuda 12.9" directly captures the main objective of the changeset. The PR updates CUDA versions from 12.8 to 12.9 across three Dockerfiles (container/Dockerfile, container/Dockerfile.sglang, and container/Dockerfile.vllm), and the title accurately reflects this primary change. The title is concise, clear, and specific enough that a teammate reviewing the history would understand the essential modification being made.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87e6e05 and 83ff5fa.

📒 Files selected for processing (3)
  • container/Dockerfile (1 hunks)
  • container/Dockerfile.sglang (1 hunks)
  • container/Dockerfile.vllm (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build and Test - vllm
  • GitHub Check: Build and Test - sglang
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
container/Dockerfile.vllm (2)

203-204: Good: update to cuda-command-line-tools-12-9 for DeepGEMM/cuobjdump

This matches the CUDA 12.9 runtime and unblocks cuobjdump/nvdisasm use in runtime. LGTM.


10-15: Validate image tags and Torch backend compatibility

  • Image existence couldn't be confirmed in the sandbox (docker not installed). Verify these registry tags exist and update if missing: nvcr.io/nvidia/cuda-dl-base:25.09-cuda12.9-devel-ubuntu24.04 and nvcr.io/nvidia/cuda:12.9.1-runtime-ubuntu24.04.
  • TORCH_BACKEND is pinned to cu128 and install_vllm.sh installs torch==2.7.1+cu128 (see container/deps/vllm/install_vllm.sh:33,137 and container/Dockerfile.vllm:18,145). Confirm cu128 wheels are compatible with a CUDA 12.9 toolchain; if your environment requires cu129, update TORCH_BACKEND and the pinned wheel versions.
  • Recommendation: add ARG CUDA_VERSION=12.9 and ARG CUDA_PKG_VER=12-9 and reference them in image tags/package names to reduce drift.

# for details and reproducer to manually test if the image
# can be updated to later versions.
ARG BASE_IMAGE_TAG="25.01-cuda12.8-devel-ubuntu24.04"
ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Tag likely invalid: cuda-dl-base 25.01 + CUDA 12.9 combo probably doesn’t exist

25.01 predates CUDA 12.9; nvcr.io/nvidia/cuda-dl-base is unlikely to publish 25.01-cuda12.9-*. This will 404 at build time. Align the HPC track to the first release that actually ships CUDA 12.9 (e.g., 25.09+), or drop the 25.01 constraint if you’re unblocking tests.

Run to confirm the exact valid tag (may require NGC auth):

Optional example fix (adjust to the tag you find):

-ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04"
+ARG BASE_IMAGE_TAG="25.09-cuda12.9-devel-ubuntu24.04"

🏁 Script executed:

#!/bin/bash
set -euo pipefail
REPO="nvcr.io/nvidia/cuda-dl-base"
for track in 25.01 25.05 25.07 25.08 25.09 25.10 25.11 25.12; do
  tag="${track}-cuda12.9-devel-ubuntu24.04"
  echo "Checking $REPO:$tag ..."
  if docker manifest inspect "$REPO:$tag" >/dev/null 2>&1; then
    echo "FOUND: $REPO:$tag"
  fi
done

Length of output: 591


Invalid base image tag — fix before merge

container/Dockerfile:15 — ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04"

The supplied manifest check shows none of the tested nvcr.io/nvidia/cuda-dl-base:*-cuda12.9-devel-ubuntu24.04 tags exist (docker manifest inspect returned no matches). This will 404 at build time.

Action: replace the ARG with a published nvcr.io/nvidia/cuda-dl-base tag that actually includes CUDA 12.9 (or remove the 25.01 track constraint) and verify with docker manifest inspect or the NGC registry before merging.

🤖 Prompt for AI Agents
In container/Dockerfile around line 15, ARG
BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04" references a non-existent
nvcr.io/nvidia/cuda-dl-base tag and will 404 at build time; replace that ARG
with a published nvcr.io/nvidia/cuda-dl-base tag that includes CUDA 12.9 (or
remove the "25.01" track constraint) and then verify the chosen tag exists by
running docker manifest inspect nvcr.io/nvidia/cuda-dl-base:<TAG> (or check the
NGC registry) before merging.

Comment on lines +9 to +12
ARG BASE_IMAGE_TAG="25.01-cuda12.9-devel-ubuntu24.04"
ARG RELEASE_BUILD
ARG RUNTIME_IMAGE="nvcr.io/nvidia/cuda"
ARG RUNTIME_IMAGE_TAG="12.8.1-runtime-ubuntu24.04"
ARG RUNTIME_IMAGE_TAG="12.9.1-runtime-ubuntu24.04"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Same base/runtime tag check as vLLM; 25.01 + CUDA 12.9 likely invalid

Ensure ${BASE_IMAGE}:${BASE_IMAGE_TAG} actually exists; 25.01 probably doesn’t publish CUDA 12.9. Runtime tag 12.9.1-runtime-ubuntu24.04 is plausible but verify.

If you confirm a newer HPC track (e.g., 25.09) is required for 12.9, update both this file and vLLM to the same track to keep builds consistent.


🏁 Script executed:

#!/bin/bash
set -euo pipefail
docker manifest inspect nvcr.io/nvidia/cuda-dl-base:25.01-cuda12.9-devel-ubuntu24.04 >/dev/null \
  && echo "FOUND base tag" || echo "MISSING base tag"
docker manifest inspect nvcr.io/nvidia/cuda:12.9.1-runtime-ubuntu24.04 >/dev/null \
  && echo "FOUND runtime tag" || echo "MISSING runtime tag"

Length of output: 126


Base/runtime tag mismatch — 25.01 doesn't ship CUDA 12.9; update both to a matching 25.xx track

Local manifest check couldn't run here (docker not installed); NVIDIA release notes show 25.01 images are built with CUDA 12.8 while CUDA 12.9.1 is provided in later CUDA‑DL releases (25.06). (docs.nvidia.com)

nvcr.io/nvidia/cuda-dl-base:25.01-cuda12.9-devel-ubuntu24.04 is therefore likely invalid — change BASE_IMAGE_TAG to nvcr.io/nvidia/cuda-dl-base:25.06-cuda12.9-devel-ubuntu24.04 (or revert CUDA to 12.8 for 25.01) and ensure the runtime tag is on the same 25.xx track; 12.9.1 CUDA images are available on Docker Hub / NGC. (docker.aityp.com)

File: container/Dockerfile.sglang (lines 9–12).

@github-actions
Copy link

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added Stale and removed Stale labels Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants