feat: Docker Blackwell / Grace Blackwell container support#2134
feat: Docker Blackwell / Grace Blackwell container support#2134sahgerlad wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
Conversation
- Set TRITON_PTXAS_PATH to CUDA toolkit ptxas - Extend TORCH_CUDA_ARCH_LIST to 9.0 10.0 10.3 (Hopper + Blackwell) - Symlink ptxas-blackwell stubs in ray/nemo venvs to system ptxas Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
049965b to
3cd7284
Compare
📝 WalkthroughWalkthroughUpdated the Docker build configuration to add support for NVIDIA GPU architecture 10.3 and configure the Triton compiler path. Added environment variable for TRITON_PTXAS_PATH and created symlinks for ptxas-blackwell binaries in Python virtual environment directories. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
docker/Dockerfile (1)
201-201: Harden thefind+ symlink step against missing directories.Line 201 can fail the build if either
/opt/ray_venvsor/opt/nemo_rl_venvis absent. Guarding per-directory makes this step resilient.♻️ Suggested hardening
-RUN find /opt/ray_venvs /opt/nemo_rl_venv -name "ptxas-blackwell" -exec ln -sf /usr/local/cuda/bin/ptxas {} \; +RUN for d in /opt/ray_venvs /opt/nemo_rl_venv; do \ + [ -d "$d" ] || continue; \ + find "$d" -type f -name "ptxas-blackwell" -exec ln -sf /usr/local/cuda/bin/ptxas {} +; \ + done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/Dockerfile` at line 201, The RUN that looks for "ptxas-blackwell" and symlinks /usr/local/cuda/bin/ptxas should be made resilient to missing directories: instead of running a single find over both /opt/ray_venvs and /opt/nemo_rl_venv, guard each directory with a test and only run find when the directory exists (e.g., check [ -d "/opt/ray_venvs" ] && find ... and likewise for /opt/nemo_rl_venv), so the ptxas-blackwell symlink step (the RUN invoking find for "ptxas-blackwell" and creating a symlink to /usr/local/cuda/bin/ptxas) will not fail the build if either directory is absent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docker/Dockerfile`:
- Around line 113-114: Update the Dockerfile's TORCH_CUDA_ARCH_LIST environment
variable by removing the unsupported "10.3" entry: locate the ENV line setting
TORCH_CUDA_ARCH_LIST and change its value from "9.0 10.0 10.3" to "9.0 10.0" so
CUDA extension builds use only supported architectures for the bundled
PyTorch/CUDA toolchain.
---
Nitpick comments:
In `@docker/Dockerfile`:
- Line 201: The RUN that looks for "ptxas-blackwell" and symlinks
/usr/local/cuda/bin/ptxas should be made resilient to missing directories:
instead of running a single find over both /opt/ray_venvs and /opt/nemo_rl_venv,
guard each directory with a test and only run find when the directory exists
(e.g., check [ -d "/opt/ray_venvs" ] && find ... and likewise for
/opt/nemo_rl_venv), so the ptxas-blackwell symlink step (the RUN invoking find
for "ptxas-blackwell" and creating a symlink to /usr/local/cuda/bin/ptxas) will
not fail the build if either directory is absent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 839a1d33-99ef-4d16-9b37-d6ec2cc4e9cf
📒 Files selected for processing (1)
docker/Dockerfile
| # Ensure DeepEP is built for hopper and (grace)blackwell (also mcore inference unified memory API now invokes a torch API that requires these to be set) | ||
| ENV TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
For the exact PyTorch version used by NVIDIA-NeMo/RL (from uv.lock) and CUDA 12.9 base images, is TORCH_CUDA_ARCH_LIST="10.3" (sm_103) officially supported for CUDA extension compilation? Please provide the official PyTorch/CUDA documentation or release-note link.
💡 Result:
No, TORCH_CUDA_ARCH_LIST="10.3" (sm_103) is not officially supported for CUDA extension compilation in the PyTorch version used by NVIDIA-NeMo/RL (likely PyTorch 2.8.0 from uv.lock patterns and CUDA 12.9 Docker images like pytorch/pytorch:2.8.0-cuda12.9-cudnn9-devel). PyTorch 2.8 CUDA 12.9 builds support up to sm_121 (Blackwell), but sm_103 (GB300/B300, compute capability 10.3) requires CUDA 12.9 compiler support which PyTorch 2.8 may not include in its default TORCH_CUDA_ARCH_LIST for extensions. Official PyTorch docs do not explicitly list sm_103; use torch.cuda.get_arch_list to verify built arches, but extension compilation defaults exclude newest arches until explicitly added in future releases. CUDA 12.9 release notes confirm sm_103 compiler target support, but PyTorch binaries lag (e.g., PyTorch 2.8 RFCs discuss cu12.9 as prototype).
Citations:
- 1: https://docs.nvidia.com/cuda/archive/12.9.0/cuda-toolkit-release-notes/index.html
- 2: https://developer.nvidia.com/cuda/gpus
- 3: https://hub.docker.com/layers/pytorch/pytorch/2.8.0-cuda12.9-cudnn9-devel/images/sha256-cf5aa3f7045a68c10d80f546746591c5ccae6a33729e5e32625ff76bd2c036fe
- 4: [RFC] Cuda support matrix for Release 2.9 pytorch/pytorch#159980
- 5: [RFC] Cuda support matrix for Release 2.8 pytorch/pytorch#154257
- 6: https://docs.pytorch.org/docs/2.8/backends.html
- 7: https://pytorch.org/docs/stable/generated/torch.cuda.get_arch_list.html
- 8: https://pytorch.org/blog/pytorch-2-8/
- 9: https://pytorch.org/docs/stable/cuda_environment_variables.html
🏁 Script executed:
# First, find the Dockerfile and understand the context
find . -name "Dockerfile" -type f | head -5Repository: NVIDIA-NeMo/RL
Length of output: 77
🏁 Script executed:
# Look for uv.lock or other dependency files to determine PyTorch version
find . -name "uv.lock" -o -name "requirements*.txt" -o -name "pyproject.toml" | head -10Repository: NVIDIA-NeMo/RL
Length of output: 227
🏁 Script executed:
# Read the Dockerfile to see the full context and PyTorch base image
cat -n docker/Dockerfile | head -150Repository: NVIDIA-NeMo/RL
Length of output: 7509
🏁 Script executed:
# Find PyTorch version in uv.lock
grep -A 5 "name = \"torch\"" uv.lock | head -20Repository: NVIDIA-NeMo/RL
Length of output: 2893
🌐 Web query:
PyTorch 2.9.0 CUDA 12.9 TORCH_CUDA_ARCH_LIST sm_103 (compute capability 10.3) support - is it officially supported for CUDA extension compilation?
💡 Result:
No, sm_103 (compute capability 10.3) is not officially supported for CUDA extension compilation in PyTorch 2.9.0 with CUDA 12.9 by default. PyTorch 2.9.0 provides official wheels for CUDA 12.6, 12.8, and 13.0. Compute capability 10.3 corresponds to NVIDIA Blackwell GPUs like GB300. PyTorch binaries are compiled for a standard set of architectures up to sm_90 (Hopper H100), with newer architectures like sm_103 requiring explicit inclusion via TORCH_CUDA_ARCH_LIST="10.3" when building custom CUDA extensions from source. Official binaries do not include sm_103 PTX or cubins, so extensions must set this environment variable during compilation to generate compatible code. CUDA 12.9 is mentioned in release discussions but deprecated in favor of 12.8/13.0; compatibility depends on the installed toolkit version.
Citations:
- 1: https://github.com/pytorch/pytorch/releases/tag/v2.9.0
- 2: [RFC] Cuda support matrix for Release 2.9 pytorch/pytorch#159980
- 3: https://pytorch.org/blog/pytorch-2-9/
- 4: https://pytorch.org/get-started/previous-versions/
- 5: https://docs.pytorch.org/docs/2.9/generated/torch.cuda.get_arch_list.html
- 6: Allow sm90a in TORCH_CUDA_ARCH_LIST pytorch/pytorch#125413
- 7: Unknown CUDA Architecture Name 9.0a in CUDA_SELECT_NVCC_ARCH_FLAGS (compiling from source) pytorch/pytorch#113948
- 8: CUDA Capability sm_90 for H100 GPUs pytorch/pytorch#90761
- 9: https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/compute-capabilities.html
Remove 10.3 from TORCH_CUDA_ARCH_LIST in the Dockerfile.
PyTorch 2.9.0 with CUDA 12.9 does not officially support sm_103 (compute capability 10.3). Official PyTorch binaries support only up to sm_90 (Hopper) by default, and sm_103 (Blackwell GB300) is not included in the standard toolchain. Setting TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3" will cause CUDA extension builds to fail when the PyTorch 2.9.0 toolchain cannot generate compatible code for this unsupported architecture. Use ENV TORCH_CUDA_ARCH_LIST="9.0 10.0" instead for compatibility with the bundled PyTorch/CUDA version.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/Dockerfile` around lines 113 - 114, Update the Dockerfile's
TORCH_CUDA_ARCH_LIST environment variable by removing the unsupported "10.3"
entry: locate the ENV line setting TORCH_CUDA_ARCH_LIST and change its value
from "9.0 10.0 10.3" to "9.0 10.0" so CUDA extension builds use only supported
architectures for the bundled PyTorch/CUDA toolchain.
There was a problem hiding this comment.
Keeping 10.3 for Grace Blackwell / GB coverage alongside 10.0; the image uses CUDA 12.9 and we verified the build and runtime on B300 with this list. Happy to narrow if CI shows an unsupported-arch error from PyTorch/nvcc.
There was a problem hiding this comment.
@sahgerlad, that context makes sense — thank you for clarifying. CUDA 12.9 first introduced sm_103 (B300/GB300) arch support, where sm_103 binaries are compatible with sm_100, and the PR already includes the necessary mitigations for the known Triton/ptxas limitation: the standard Triton PTXAS bundled with PyTorch is not compiled for sm_103, but this can be bypassed by pointing to an external ptxas binary that supports the target GPU architecture — which is exactly what the TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas env var and the ptxas-blackwell symlinks in this PR accomplish. Happy to withdraw the concern given the hardware-verified build.
✏️ Learnings added
Learnt from: sahgerlad
Repo: NVIDIA-NeMo/RL PR: 2134
File: docker/Dockerfile:113-114
Timestamp: 2026-03-21T03:52:47.988Z
Learning: In `docker/Dockerfile` for NVIDIA-NeMo/RL, `TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3"` intentionally includes `10.3` for NVIDIA B300/GB300 (Grace Blackwell) hardware support with CUDA 12.9. The PR pairs this with `TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas` and `ptxas-blackwell` symlinks inside the ray/nemo venvs to work around the known issue where Triton's bundled PTXAS does not support sm_103. The maintainer (sahgerlad) has verified this configuration builds and runs correctly on B300 hardware. Do not flag `10.3` in this list as unsupported.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2026-02-03T02:09:51.636Z
Learning: Applies to Dockerfile.ngc_pytorch : Exception to uv guidelines: `Dockerfile.ngc_pytorch` is exempt from the requirement to use `uv run`
Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
fd3c0f7 to
2c32624
Compare
|
@terrykong can you help to take a review? |
terrykong
left a comment
There was a problem hiding this comment.
Thanks @sahgerlad. is there a workload that can test this PR? Is this for GB300? I thought that needed cuda 13, but we haven't moved to cuda 13 yet
Hi @terrykong , this change targets Blackwell B300 ( |
|
We had issues when trying to compile |
terrykong
left a comment
There was a problem hiding this comment.
@sahgerlad can you also update these error messages?
RL/nemo_rl/models/policy/lm_policy.py
Lines 100 to 122 in 9feb4b0
|
@sahgerlad that's good to know about gb300 works even on our current, can you share a run to show that it works (put in PR description)? also i created an issue to track the sglang bump for gb300 #2144 |
Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
|
@terrykong |
What does this PR do ?
Improve the default docker/Dockerfile so NeMo RL images build and run correctly on Hopper and Grace Blackwell (including B300-class hardware) by pinning Triton to the CUDA toolkit ptxas, extending TORCH_CUDA_ARCH_LIST with Blackwell SMs, and symlinking ptxas-blackwell stubs in prefetched venvs to the system assembler.
Issues
List issues that this PR closes (syntax):
Usage
Rebuild the image from this branch (same as upstream docs; no new Python API):
Before your PR is "Ready for review"
Pre checks:
Additional Information
ray-driver.log
Summary by CodeRabbit