Skip to content

feat: Docker Blackwell / Grace Blackwell container support#2134

Open
sahgerlad wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
sahgerlad:feat/docker-blackwell-support
Open

feat: Docker Blackwell / Grace Blackwell container support#2134
sahgerlad wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
sahgerlad:feat/docker-blackwell-support

Conversation

@sahgerlad
Copy link
Contributor

@sahgerlad sahgerlad commented Mar 21, 2026

What does this PR do ?

Improve the default docker/Dockerfile so NeMo RL images build and run correctly on Hopper and Grace Blackwell (including B300-class hardware) by pinning Triton to the CUDA toolkit ptxas, extending TORCH_CUDA_ARCH_LIST with Blackwell SMs, and symlinking ptxas-blackwell stubs in prefetched venvs to the system assembler.

  • Set TRITON_PTXAS_PATH to CUDA toolkit ptxas
  • Extend TORCH_CUDA_ARCH_LIST to 9.0 10.0 10.3 (Hopper + Blackwell)
  • Symlink ptxas-blackwell stubs in ray/nemo venvs to system ptxas

Issues

List issues that this PR closes (syntax):

Usage

Rebuild the image from this branch (same as upstream docs; no new Python API):

# Example: build with local NeMo RL checkout
docker buildx build --build-context nemo-rl=. -f docker/Dockerfile \
  -t nemo-rl:blackwell-test --load .

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

ray-driver.log

Summary by CodeRabbit

  • Chores
    • Updated container configuration to support additional GPU architectures for improved hardware compatibility.
    • Enhanced GPU tooling integration for optimized performance across virtual environments.

@sahgerlad sahgerlad requested a review from a team as a code owner March 21, 2026 03:37
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- Set TRITON_PTXAS_PATH to CUDA toolkit ptxas
- Extend TORCH_CUDA_ARCH_LIST to 9.0 10.0 10.3 (Hopper + Blackwell)
- Symlink ptxas-blackwell stubs in ray/nemo venvs to system ptxas

Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
@sahgerlad sahgerlad force-pushed the feat/docker-blackwell-support branch from 049965b to 3cd7284 Compare March 21, 2026 03:38
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

📝 Walkthrough

Walkthrough

Updated the Docker build configuration to add support for NVIDIA GPU architecture 10.3 and configure the Triton compiler path. Added environment variable for TRITON_PTXAS_PATH and created symlinks for ptxas-blackwell binaries in Python virtual environment directories.

Changes

Cohort / File(s) Summary
CUDA and Triton Configuration
docker/Dockerfile
Added TRITON_PTXAS_PATH environment variable pointing to CUDA ptxas binary; expanded TORCH_CUDA_ARCH_LIST to include GPU architecture 10.3; added post-build step to create symlinks for ptxas-blackwell in ray and nemo-rl virtual environment directories.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR adds support for new hardware architectures (Blackwell/Grace Blackwell) but lacks test results, build validation, or evidence of successful builds. Include Docker build validation results, evidence the image builds successfully, resolution of PyTorch 2.9.0/sm_103 compatibility issue, and hardware-specific test results before merging.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Docker Blackwell / Grace Blackwell container support' directly and accurately reflects the main changes in the PR, which add Blackwell and Grace Blackwell GPU architecture support to the Dockerfile.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docker/Dockerfile (1)

201-201: Harden the find + symlink step against missing directories.

Line 201 can fail the build if either /opt/ray_venvs or /opt/nemo_rl_venv is absent. Guarding per-directory makes this step resilient.

♻️ Suggested hardening
-RUN find /opt/ray_venvs /opt/nemo_rl_venv -name "ptxas-blackwell" -exec ln -sf /usr/local/cuda/bin/ptxas {} \;
+RUN for d in /opt/ray_venvs /opt/nemo_rl_venv; do \
+      [ -d "$d" ] || continue; \
+      find "$d" -type f -name "ptxas-blackwell" -exec ln -sf /usr/local/cuda/bin/ptxas {} +; \
+    done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/Dockerfile` at line 201, The RUN that looks for "ptxas-blackwell" and
symlinks /usr/local/cuda/bin/ptxas should be made resilient to missing
directories: instead of running a single find over both /opt/ray_venvs and
/opt/nemo_rl_venv, guard each directory with a test and only run find when the
directory exists (e.g., check [ -d "/opt/ray_venvs" ] && find ... and likewise
for /opt/nemo_rl_venv), so the ptxas-blackwell symlink step (the RUN invoking
find for "ptxas-blackwell" and creating a symlink to /usr/local/cuda/bin/ptxas)
will not fail the build if either directory is absent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docker/Dockerfile`:
- Around line 113-114: Update the Dockerfile's TORCH_CUDA_ARCH_LIST environment
variable by removing the unsupported "10.3" entry: locate the ENV line setting
TORCH_CUDA_ARCH_LIST and change its value from "9.0 10.0 10.3" to "9.0 10.0" so
CUDA extension builds use only supported architectures for the bundled
PyTorch/CUDA toolchain.

---

Nitpick comments:
In `@docker/Dockerfile`:
- Line 201: The RUN that looks for "ptxas-blackwell" and symlinks
/usr/local/cuda/bin/ptxas should be made resilient to missing directories:
instead of running a single find over both /opt/ray_venvs and /opt/nemo_rl_venv,
guard each directory with a test and only run find when the directory exists
(e.g., check [ -d "/opt/ray_venvs" ] && find ... and likewise for
/opt/nemo_rl_venv), so the ptxas-blackwell symlink step (the RUN invoking find
for "ptxas-blackwell" and creating a symlink to /usr/local/cuda/bin/ptxas) will
not fail the build if either directory is absent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 839a1d33-99ef-4d16-9b37-d6ec2cc4e9cf

📥 Commits

Reviewing files that changed from the base of the PR and between 2c1e5e0 and 3cd7284.

📒 Files selected for processing (1)
  • docker/Dockerfile

Comment on lines +113 to +114
# Ensure DeepEP is built for hopper and (grace)blackwell (also mcore inference unified memory API now invokes a torch API that requires these to be set)
ENV TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3"
Copy link
Contributor

@coderabbitai coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For the exact PyTorch version used by NVIDIA-NeMo/RL (from uv.lock) and CUDA 12.9 base images, is TORCH_CUDA_ARCH_LIST="10.3" (sm_103) officially supported for CUDA extension compilation? Please provide the official PyTorch/CUDA documentation or release-note link.

💡 Result:

No, TORCH_CUDA_ARCH_LIST="10.3" (sm_103) is not officially supported for CUDA extension compilation in the PyTorch version used by NVIDIA-NeMo/RL (likely PyTorch 2.8.0 from uv.lock patterns and CUDA 12.9 Docker images like pytorch/pytorch:2.8.0-cuda12.9-cudnn9-devel). PyTorch 2.8 CUDA 12.9 builds support up to sm_121 (Blackwell), but sm_103 (GB300/B300, compute capability 10.3) requires CUDA 12.9 compiler support which PyTorch 2.8 may not include in its default TORCH_CUDA_ARCH_LIST for extensions. Official PyTorch docs do not explicitly list sm_103; use torch.cuda.get_arch_list to verify built arches, but extension compilation defaults exclude newest arches until explicitly added in future releases. CUDA 12.9 release notes confirm sm_103 compiler target support, but PyTorch binaries lag (e.g., PyTorch 2.8 RFCs discuss cu12.9 as prototype).

Citations:


🏁 Script executed:

# First, find the Dockerfile and understand the context
find . -name "Dockerfile" -type f | head -5

Repository: NVIDIA-NeMo/RL

Length of output: 77


🏁 Script executed:

# Look for uv.lock or other dependency files to determine PyTorch version
find . -name "uv.lock" -o -name "requirements*.txt" -o -name "pyproject.toml" | head -10

Repository: NVIDIA-NeMo/RL

Length of output: 227


🏁 Script executed:

# Read the Dockerfile to see the full context and PyTorch base image
cat -n docker/Dockerfile | head -150

Repository: NVIDIA-NeMo/RL

Length of output: 7509


🏁 Script executed:

# Find PyTorch version in uv.lock
grep -A 5 "name = \"torch\"" uv.lock | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 2893


🌐 Web query:

PyTorch 2.9.0 CUDA 12.9 TORCH_CUDA_ARCH_LIST sm_103 (compute capability 10.3) support - is it officially supported for CUDA extension compilation?

💡 Result:

No, sm_103 (compute capability 10.3) is not officially supported for CUDA extension compilation in PyTorch 2.9.0 with CUDA 12.9 by default. PyTorch 2.9.0 provides official wheels for CUDA 12.6, 12.8, and 13.0. Compute capability 10.3 corresponds to NVIDIA Blackwell GPUs like GB300. PyTorch binaries are compiled for a standard set of architectures up to sm_90 (Hopper H100), with newer architectures like sm_103 requiring explicit inclusion via TORCH_CUDA_ARCH_LIST="10.3" when building custom CUDA extensions from source. Official binaries do not include sm_103 PTX or cubins, so extensions must set this environment variable during compilation to generate compatible code. CUDA 12.9 is mentioned in release discussions but deprecated in favor of 12.8/13.0; compatibility depends on the installed toolkit version.

Citations:


Remove 10.3 from TORCH_CUDA_ARCH_LIST in the Dockerfile.

PyTorch 2.9.0 with CUDA 12.9 does not officially support sm_103 (compute capability 10.3). Official PyTorch binaries support only up to sm_90 (Hopper) by default, and sm_103 (Blackwell GB300) is not included in the standard toolchain. Setting TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3" will cause CUDA extension builds to fail when the PyTorch 2.9.0 toolchain cannot generate compatible code for this unsupported architecture. Use ENV TORCH_CUDA_ARCH_LIST="9.0 10.0" instead for compatibility with the bundled PyTorch/CUDA version.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/Dockerfile` around lines 113 - 114, Update the Dockerfile's
TORCH_CUDA_ARCH_LIST environment variable by removing the unsupported "10.3"
entry: locate the ENV line setting TORCH_CUDA_ARCH_LIST and change its value
from "9.0 10.0 10.3" to "9.0 10.0" so CUDA extension builds use only supported
architectures for the bundled PyTorch/CUDA toolchain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping 10.3 for Grace Blackwell / GB coverage alongside 10.0; the image uses CUDA 12.9 and we verified the build and runtime on B300 with this list. Happy to narrow if CI shows an unsupported-arch error from PyTorch/nvcc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sahgerlad, that context makes sense — thank you for clarifying. CUDA 12.9 first introduced sm_103 (B300/GB300) arch support, where sm_103 binaries are compatible with sm_100, and the PR already includes the necessary mitigations for the known Triton/ptxas limitation: the standard Triton PTXAS bundled with PyTorch is not compiled for sm_103, but this can be bypassed by pointing to an external ptxas binary that supports the target GPU architecture — which is exactly what the TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas env var and the ptxas-blackwell symlinks in this PR accomplish. Happy to withdraw the concern given the hardware-verified build.


✏️ Learnings added
Learnt from: sahgerlad
Repo: NVIDIA-NeMo/RL PR: 2134
File: docker/Dockerfile:113-114
Timestamp: 2026-03-21T03:52:47.988Z
Learning: In `docker/Dockerfile` for NVIDIA-NeMo/RL, `TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3"` intentionally includes `10.3` for NVIDIA B300/GB300 (Grace Blackwell) hardware support with CUDA 12.9. The PR pairs this with `TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas` and `ptxas-blackwell` symlinks inside the ray/nemo venvs to work around the known issue where Triton's bundled PTXAS does not support sm_103. The maintainer (sahgerlad) has verified this configuration builds and runs correctly on B300 hardware. Do not flag `10.3` in this list as unsupported.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2026-02-03T02:09:51.636Z
Learning: Applies to Dockerfile.ngc_pytorch : Exception to uv guidelines: `Dockerfile.ngc_pytorch` is exempt from the requirement to use `uv run`

Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
@sahgerlad sahgerlad force-pushed the feat/docker-blackwell-support branch from fd3c0f7 to 2c32624 Compare March 21, 2026 03:52
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Mar 23, 2026
@yuki-97
Copy link
Contributor

yuki-97 commented Mar 23, 2026

@terrykong can you help to take a review?

@yuki-97 yuki-97 requested a review from terrykong March 23, 2026 15:54
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Mar 23, 2026
Copy link
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sahgerlad. is there a workload that can test this PR? Is this for GB300? I thought that needed cuda 13, but we haven't moved to cuda 13 yet

@sahgerlad
Copy link
Contributor Author

Thanks @sahgerlad. is there a workload that can test this PR? Is this for GB300? I thought that needed cuda 13, but we haven't moved to cuda 13 yet

Hi @terrykong , this change targets Blackwell B300 (sm_103) on the existing CUDA 12.9 image, not CUDA 13. The image needs both 10.3 in TORCH_CUDA_ARCH_LIST for extension builds and routing Triton/venv JIT to the CUDA toolkit ptxas so sm_103 is handled; without that, many jobs that compile or JIT for the GPU do not succeed on B300.

@sahgerlad
Copy link
Contributor Author

sahgerlad commented Mar 23, 2026

We had issues when trying to compile Qwen3-30B-A3B on B300. This PR resolved the issue and the image works on B300. I have a bundled image in my GHCR with this change and the other open PRs that enable successful compilation and performance enhancements
CUDA graph: #1736
Log cleanup: #1664
Flash Attention: #1628

Copy link
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sahgerlad can you also update these error messages?

if "TORCH_CUDA_ARCH_LIST" not in os.environ:
raise RuntimeError(
"TORCH_CUDA_ARCH_LIST is not set. This is required in Megatron backend. This variable is set in our container, but "
"if you are running a custom container or baremetal, you may need to set this variable manually. Example: export TORCH_CUDA_ARCH_LIST='9.0 10.0'"
)
else:
if not dtensor_enable:
raise ValueError(
"Please either set policy.megatron_cfg.enabled=true to use Megatron training backend "
"or set policy.dtensor_cfg.enabled=true to use DTensor training backend."
)
# Check if _v2 is enabled in dtensor_cfg (defaults to False for backward compatibility)
use_v2 = config.get("dtensor_cfg", {}).get("_v2", False)
if use_v2:
worker_builder_cls_fqn = "nemo_rl.models.policy.workers.dtensor_policy_worker_v2.DTensorPolicyWorkerV2"
if "TORCH_CUDA_ARCH_LIST" not in os.environ:
warnings.warn(
"TORCH_CUDA_ARCH_LIST is not set. This is needed if using DeepEP in DTensorPolicyWorker V2. This variable is set in our container, but "
"if you are running a custom container or baremetal, you may need to set this variable manually. Example: export TORCH_CUDA_ARCH_LIST='9.0 10.0'"
)

Copy link
Collaborator

terrykong commented Mar 23, 2026

@sahgerlad that's good to know about gb300 works even on our current, can you share a run to show that it works (put in PR description)? also i created an issue to track the sglang bump for gb300 #2144

Signed-off-by: Sahger Lad <lad.sahger@gmail.com>
@sahgerlad sahgerlad requested review from a team as code owners March 23, 2026 21:37
@sahgerlad
Copy link
Contributor Author

@terrykong
ray-driver.log
Attaching run log on B300. Note this has other changes including the flash attention PR mentioned previously

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants