feat: Docker Blackwell / Grace Blackwell container support by sahgerlad · Pull Request #2134 · NVIDIA-NeMo/RL

sahgerlad · 2026-03-21T03:37:29Z

What does this PR do ?

Improve the default docker/Dockerfile so NeMo RL images build and run correctly on Hopper and Grace Blackwell (including B300-class hardware) by pinning Triton to the CUDA toolkit ptxas, extending TORCH_CUDA_ARCH_LIST with Blackwell SMs, and symlinking ptxas-blackwell stubs in prefetched venvs to the system assembler.

Set TRITON_PTXAS_PATH to CUDA toolkit ptxas
Extend TORCH_CUDA_ARCH_LIST to 9.0 10.0 10.3 (Hopper + Blackwell)
Symlink ptxas-blackwell stubs in ray/nemo venvs to system ptxas

Issues

List issues that this PR closes (syntax):

Usage

Rebuild the image from this branch (same as upstream docs; no new Python API):

# Example: build with local NeMo RL checkout
docker buildx build --build-context nemo-rl=. -f docker/Dockerfile \
  -t nemo-rl:blackwell-test --load .

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

ray-driver.log

Summary by CodeRabbit

Chores
- Updated container configuration to support additional GPU architectures for improved hardware compatibility.
- Enhanced GPU tooling integration for optimized performance across virtual environments.

copy-pr-bot · 2026-03-21T03:37:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- Set TRITON_PTXAS_PATH to CUDA toolkit ptxas - Extend TORCH_CUDA_ARCH_LIST to 9.0 10.0 10.3 (Hopper + Blackwell) - Symlink ptxas-blackwell stubs in ray/nemo venvs to system ptxas Signed-off-by: Sahger Lad <lad.sahger@gmail.com>

coderabbitai · 2026-03-21T03:39:17Z

📝 Walkthrough

Walkthrough

Updated the Docker build configuration to add support for NVIDIA GPU architecture 10.3 and configure the Triton compiler path. Added environment variable for TRITON_PTXAS_PATH and created symlinks for ptxas-blackwell binaries in Python virtual environment directories.

Changes

Cohort / File(s)	Summary
CUDA and Triton Configuration `docker/Dockerfile`	Added `TRITON_PTXAS_PATH` environment variable pointing to CUDA ptxas binary; expanded `TORCH_CUDA_ARCH_LIST` to include GPU architecture 10.3; added post-build step to create symlinks for ptxas-blackwell in ray and nemo-rl virtual environment directories.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR adds support for new hardware architectures (Blackwell/Grace Blackwell) but lacks test results, build validation, or evidence of successful builds.	Include Docker build validation results, evidence the image builds successfully, resolution of PyTorch 2.9.0/sm_103 compatibility issue, and hardware-specific test results before merging.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: Docker Blackwell / Grace Blackwell container support' directly and accurately reflects the main changes in the PR, which add Blackwell and Grace Blackwell GPU architecture support to the Dockerfile.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

docker/Dockerfile (1)

201-201: Harden the find + symlink step against missing directories.

Line 201 can fail the build if either /opt/ray_venvs or /opt/nemo_rl_venv is absent. Guarding per-directory makes this step resilient.

♻️ Suggested hardening

-RUN find /opt/ray_venvs /opt/nemo_rl_venv -name "ptxas-blackwell" -exec ln -sf /usr/local/cuda/bin/ptxas {} \;
+RUN for d in /opt/ray_venvs /opt/nemo_rl_venv; do \
+      [ -d "$d" ] || continue; \
+      find "$d" -type f -name "ptxas-blackwell" -exec ln -sf /usr/local/cuda/bin/ptxas {} +; \
+    done

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docker/Dockerfile` at line 201, The RUN that looks for "ptxas-blackwell" and
symlinks /usr/local/cuda/bin/ptxas should be made resilient to missing
directories: instead of running a single find over both /opt/ray_venvs and
/opt/nemo_rl_venv, guard each directory with a test and only run find when the
directory exists (e.g., check [ -d "/opt/ray_venvs" ] && find ... and likewise
for /opt/nemo_rl_venv), so the ptxas-blackwell symlink step (the RUN invoking
find for "ptxas-blackwell" and creating a symlink to /usr/local/cuda/bin/ptxas)
will not fail the build if either directory is absent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docker/Dockerfile`:
- Around line 113-114: Update the Dockerfile's TORCH_CUDA_ARCH_LIST environment
variable by removing the unsupported "10.3" entry: locate the ENV line setting
TORCH_CUDA_ARCH_LIST and change its value from "9.0 10.0 10.3" to "9.0 10.0" so
CUDA extension builds use only supported architectures for the bundled
PyTorch/CUDA toolchain.

---

Nitpick comments:
In `@docker/Dockerfile`:
- Line 201: The RUN that looks for "ptxas-blackwell" and symlinks
/usr/local/cuda/bin/ptxas should be made resilient to missing directories:
instead of running a single find over both /opt/ray_venvs and /opt/nemo_rl_venv,
guard each directory with a test and only run find when the directory exists
(e.g., check [ -d "/opt/ray_venvs" ] && find ... and likewise for
/opt/nemo_rl_venv), so the ptxas-blackwell symlink step (the RUN invoking find
for "ptxas-blackwell" and creating a symlink to /usr/local/cuda/bin/ptxas) will
not fail the build if either directory is absent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 839a1d33-99ef-4d16-9b37-d6ec2cc4e9cf

📥 Commits

Reviewing files that changed from the base of the PR and between 2c1e5e0 and 3cd7284.

📒 Files selected for processing (1)

docker/Dockerfile

coderabbitai · 2026-03-21T03:44:09Z

docker/Dockerfile

+# Ensure DeepEP is built for hopper and (grace)blackwell (also mcore inference unified memory API now invokes a torch API that requires these to be set)
+ENV TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For the exact PyTorch version used by NVIDIA-NeMo/RL (from uv.lock) and CUDA 12.9 base images, is TORCH_CUDA_ARCH_LIST="10.3" (sm_103) officially supported for CUDA extension compilation? Please provide the official PyTorch/CUDA documentation or release-note link.

💡 Result:

No, TORCH_CUDA_ARCH_LIST="10.3" (sm_103) is not officially supported for CUDA extension compilation in the PyTorch version used by NVIDIA-NeMo/RL (likely PyTorch 2.8.0 from uv.lock patterns and CUDA 12.9 Docker images like pytorch/pytorch:2.8.0-cuda12.9-cudnn9-devel). PyTorch 2.8 CUDA 12.9 builds support up to sm_121 (Blackwell), but sm_103 (GB300/B300, compute capability 10.3) requires CUDA 12.9 compiler support which PyTorch 2.8 may not include in its default TORCH_CUDA_ARCH_LIST for extensions. Official PyTorch docs do not explicitly list sm_103; use torch.cuda.get_arch_list to verify built arches, but extension compilation defaults exclude newest arches until explicitly added in future releases. CUDA 12.9 release notes confirm sm_103 compiler target support, but PyTorch binaries lag (e.g., PyTorch 2.8 RFCs discuss cu12.9 as prototype).

Citations:

1: https://docs.nvidia.com/cuda/archive/12.9.0/cuda-toolkit-release-notes/index.html

2: https://developer.nvidia.com/cuda/gpus

3: https://hub.docker.com/layers/pytorch/pytorch/2.8.0-cuda12.9-cudnn9-devel/images/sha256-cf5aa3f7045a68c10d80f546746591c5ccae6a33729e5e32625ff76bd2c036fe

4: [RFC] Cuda support matrix for Release 2.9 pytorch/pytorch#159980

5: [RFC] Cuda support matrix for Release 2.8 pytorch/pytorch#154257

6: https://docs.pytorch.org/docs/2.8/backends.html

7: https://pytorch.org/docs/stable/generated/torch.cuda.get_arch_list.html

8: https://pytorch.org/blog/pytorch-2-8/

9: https://pytorch.org/docs/stable/cuda_environment_variables.html

🏁 Script executed:

# First, find the Dockerfile and understand the context find . -name "Dockerfile" -type f | head -5

Repository: NVIDIA-NeMo/RL

Length of output: 77

🏁 Script executed:

# Look for uv.lock or other dependency files to determine PyTorch version find . -name "uv.lock" -o -name "requirements*.txt" -o -name "pyproject.toml" | head -10

Repository: NVIDIA-NeMo/RL

Length of output: 227

🏁 Script executed:

# Read the Dockerfile to see the full context and PyTorch base image cat -n docker/Dockerfile | head -150

Repository: NVIDIA-NeMo/RL

Length of output: 7509

🏁 Script executed:

# Find PyTorch version in uv.lock grep -A 5 "name = \"torch\"" uv.lock | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 2893

🌐 Web query:

PyTorch 2.9.0 CUDA 12.9 TORCH_CUDA_ARCH_LIST sm_103 (compute capability 10.3) support - is it officially supported for CUDA extension compilation?

💡 Result:

No, sm_103 (compute capability 10.3) is not officially supported for CUDA extension compilation in PyTorch 2.9.0 with CUDA 12.9 by default. PyTorch 2.9.0 provides official wheels for CUDA 12.6, 12.8, and 13.0. Compute capability 10.3 corresponds to NVIDIA Blackwell GPUs like GB300. PyTorch binaries are compiled for a standard set of architectures up to sm_90 (Hopper H100), with newer architectures like sm_103 requiring explicit inclusion via TORCH_CUDA_ARCH_LIST="10.3" when building custom CUDA extensions from source. Official binaries do not include sm_103 PTX or cubins, so extensions must set this environment variable during compilation to generate compatible code. CUDA 12.9 is mentioned in release discussions but deprecated in favor of 12.8/13.0; compatibility depends on the installed toolkit version.

Citations:

1: https://github.com/pytorch/pytorch/releases/tag/v2.9.0

2: [RFC] Cuda support matrix for Release 2.9 pytorch/pytorch#159980

3: https://pytorch.org/blog/pytorch-2-9/

4: https://pytorch.org/get-started/previous-versions/

5: https://docs.pytorch.org/docs/2.9/generated/torch.cuda.get_arch_list.html

6: Allow sm90a in TORCH_CUDA_ARCH_LIST pytorch/pytorch#125413

7: Unknown CUDA Architecture Name 9.0a in CUDA_SELECT_NVCC_ARCH_FLAGS (compiling from source) pytorch/pytorch#113948

8: CUDA Capability sm_90 for H100 GPUs pytorch/pytorch#90761

9: https://docs.nvidia.com/cuda/cuda-programming-guide/05-appendices/compute-capabilities.html

Remove 10.3 from TORCH_CUDA_ARCH_LIST in the Dockerfile.

PyTorch 2.9.0 with CUDA 12.9 does not officially support sm_103 (compute capability 10.3). Official PyTorch binaries support only up to sm_90 (Hopper) by default, and sm_103 (Blackwell GB300) is not included in the standard toolchain. Setting TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3" will cause CUDA extension builds to fail when the PyTorch 2.9.0 toolchain cannot generate compatible code for this unsupported architecture. Use ENV TORCH_CUDA_ARCH_LIST="9.0 10.0" instead for compatibility with the bundled PyTorch/CUDA version.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docker/Dockerfile` around lines 113 - 114, Update the Dockerfile's TORCH_CUDA_ARCH_LIST environment variable by removing the unsupported "10.3" entry: locate the ENV line setting TORCH_CUDA_ARCH_LIST and change its value from "9.0 10.0 10.3" to "9.0 10.0" so CUDA extension builds use only supported architectures for the bundled PyTorch/CUDA toolchain.

Keeping 10.3 for Grace Blackwell / GB coverage alongside 10.0; the image uses CUDA 12.9 and we verified the build and runtime on B300 with this list. Happy to narrow if CI shows an unsupported-arch error from PyTorch/nvcc.

@sahgerlad, that context makes sense — thank you for clarifying. CUDA 12.9 first introduced sm_103 (B300/GB300) arch support, where sm_103 binaries are compatible with sm_100, and the PR already includes the necessary mitigations for the known Triton/ptxas limitation: the standard Triton PTXAS bundled with PyTorch is not compiled for sm_103, but this can be bypassed by pointing to an external ptxas binary that supports the target GPU architecture — which is exactly what the TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas env var and the ptxas-blackwell symlinks in this PR accomplish. Happy to withdraw the concern given the hardware-verified build.

✏️ Learnings added

Learnt from: sahgerlad Repo: NVIDIA-NeMo/RL PR: 2134 File: docker/Dockerfile:113-114 Timestamp: 2026-03-21T03:52:47.988Z Learning: In `docker/Dockerfile` for NVIDIA-NeMo/RL, `TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3"` intentionally includes `10.3` for NVIDIA B300/GB300 (Grace Blackwell) hardware support with CUDA 12.9. The PR pairs this with `TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas` and `ptxas-blackwell` symlinks inside the ray/nemo venvs to work around the known issue where Triton's bundled PTXAS does not support sm_103. The maintainer (sahgerlad) has verified this configuration builds and runs correctly on B300 hardware. Do not flag `10.3` in this list as unsupported.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: CR Repo: NVIDIA-NeMo/RL PR: 0 File: CODING_GUIDELINES.md:0-0 Timestamp: 2026-02-03T02:09:51.636Z Learning: Applies to Dockerfile.ngc_pytorch : Exception to uv guidelines: `Dockerfile.ngc_pytorch` is exempt from the requirement to use `uv run`

Signed-off-by: Sahger Lad <lad.sahger@gmail.com>

yuki-97 · 2026-03-23T15:54:26Z

@terrykong can you help to take a review?

terrykong

Thanks @sahgerlad. is there a workload that can test this PR? Is this for GB300? I thought that needed cuda 13, but we haven't moved to cuda 13 yet

sahgerlad · 2026-03-23T16:48:56Z

Thanks @sahgerlad. is there a workload that can test this PR? Is this for GB300? I thought that needed cuda 13, but we haven't moved to cuda 13 yet

Hi @terrykong , this change targets Blackwell B300 (sm_103) on the existing CUDA 12.9 image, not CUDA 13. The image needs both 10.3 in TORCH_CUDA_ARCH_LIST for extension builds and routing Triton/venv JIT to the CUDA toolkit ptxas so sm_103 is handled; without that, many jobs that compile or JIT for the GPU do not succeed on B300.

sahgerlad · 2026-03-23T16:55:16Z

We had issues when trying to compile Qwen3-30B-A3B on B300. This PR resolved the issue and the image works on B300. I have a bundled image in my GHCR with this change and the other open PRs that enable successful compilation and performance enhancements
CUDA graph: #1736
Log cleanup: #1664
Flash Attention: #1628

terrykong

@sahgerlad can you also update these error messages?

RL/nemo_rl/models/policy/lm_policy.py

Lines 100 to 122 in 9feb4b0

    
               if "TORCH_CUDA_ARCH_LIST" not in os.environ: 
        
                   raise RuntimeError( 
        
                       "TORCH_CUDA_ARCH_LIST is not set. This is required in Megatron backend. This variable is set in our container, but " 
        
                       "if you are running a custom container or baremetal, you may need to set this variable manually. Example: export TORCH_CUDA_ARCH_LIST='9.0 10.0'" 
        
                   ) 
        
           else: 
        
               if not dtensor_enable: 
        
                   raise ValueError( 
        
                       "Please either set policy.megatron_cfg.enabled=true to use Megatron training backend " 
        
                       "or set policy.dtensor_cfg.enabled=true to use DTensor training backend." 
        
                   ) 
        
               # Check if _v2 is enabled in dtensor_cfg (defaults to False for backward compatibility) 
        
               use_v2 = config.get("dtensor_cfg", {}).get("_v2", False) 
        
               if use_v2: 
        
                   worker_builder_cls_fqn = "nemo_rl.models.policy.workers.dtensor_policy_worker_v2.DTensorPolicyWorkerV2" 
        
                   if "TORCH_CUDA_ARCH_LIST" not in os.environ: 
        
                       warnings.warn( 
        
                           "TORCH_CUDA_ARCH_LIST is not set. This is needed if using DeepEP in DTensorPolicyWorker V2. This variable is set in our container, but " 
        
                           "if you are running a custom container or baremetal, you may need to set this variable manually. Example: export TORCH_CUDA_ARCH_LIST='9.0 10.0'" 
        
                       )

terrykong · 2026-03-23T17:59:30Z

@sahgerlad that's good to know about gb300 works even on our current, can you share a run to show that it works (put in PR description)? also i created an issue to track the sglang bump for gb300 #2144

Signed-off-by: Sahger Lad <lad.sahger@gmail.com>

sahgerlad · 2026-03-23T21:59:55Z

@terrykong
ray-driver.log
Attaching run log on B300. Note this has other changes including the flash attention PR mentioned previously

sahgerlad requested a review from a team as a code owner March 21, 2026 03:37

github-actions bot added the community-request label Mar 21, 2026

sahgerlad force-pushed the feat/docker-blackwell-support branch from 049965b to 3cd7284 Compare March 21, 2026 03:38

coderabbitai bot reviewed Mar 21, 2026

View reviewed changes

Harden the find and guarding per-directory

2c32624

Signed-off-by: Sahger Lad <lad.sahger@gmail.com>

sahgerlad force-pushed the feat/docker-blackwell-support branch from fd3c0f7 to 2c32624 Compare March 21, 2026 03:52

chtruong814 added the needs-follow-up Issue needs follow-up label Mar 23, 2026

yuki-97 requested a review from terrykong March 23, 2026 15:54

chtruong814 removed the needs-follow-up Issue needs follow-up label Mar 23, 2026

terrykong reviewed Mar 23, 2026

View reviewed changes

Merge branch 'main' into feat/docker-blackwell-support

11afc66

terrykong reviewed Mar 23, 2026

View reviewed changes

terrykong mentioned this pull request Mar 23, 2026

Add sm_103 (Grace Blackwell) arch support to sglang dependency configuration #2144

Open

Update CUDA_ARCH_LIST error message

ffde090

Signed-off-by: Sahger Lad <lad.sahger@gmail.com>

sahgerlad requested review from a team as code owners March 23, 2026 21:37

Merge branch 'main' into feat/docker-blackwell-support

160da9f

		# Ensure DeepEP is built for hopper and (grace)blackwell (also mcore inference unified memory API now invokes a torch API that requires these to be set)
		ENV TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3"

	if "TORCH_CUDA_ARCH_LIST" not in os.environ:
	raise RuntimeError(
	"TORCH_CUDA_ARCH_LIST is not set. This is required in Megatron backend. This variable is set in our container, but "
	"if you are running a custom container or baremetal, you may need to set this variable manually. Example: export TORCH_CUDA_ARCH_LIST='9.0 10.0'"
	)

	else:
	if not dtensor_enable:
	raise ValueError(
	"Please either set policy.megatron_cfg.enabled=true to use Megatron training backend "
	"or set policy.dtensor_cfg.enabled=true to use DTensor training backend."
	)

	# Check if _v2 is enabled in dtensor_cfg (defaults to False for backward compatibility)
	use_v2 = config.get("dtensor_cfg", {}).get("_v2", False)
	if use_v2:
	worker_builder_cls_fqn = "nemo_rl.models.policy.workers.dtensor_policy_worker_v2.DTensorPolicyWorkerV2"

	if "TORCH_CUDA_ARCH_LIST" not in os.environ:
	warnings.warn(
	"TORCH_CUDA_ARCH_LIST is not set. This is needed if using DeepEP in DTensorPolicyWorker V2. This variable is set in our container, but "
	"if you are running a custom container or baremetal, you may need to set this variable manually. Example: export TORCH_CUDA_ARCH_LIST='9.0 10.0'"
	)

Conversation

sahgerlad commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Mar 21, 2026

Uh oh!

coderabbitai bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sahgerlad Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

yuki-97 commented Mar 23, 2026

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

sahgerlad commented Mar 23, 2026

Uh oh!

sahgerlad commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

terrykong commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sahgerlad commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sahgerlad commented Mar 21, 2026 •

edited

Loading

coderabbitai bot commented Mar 21, 2026 •

edited

Loading

coderabbitai bot Mar 21, 2026 •

edited

Loading

sahgerlad commented Mar 23, 2026 •

edited

Loading

terrykong commented Mar 23, 2026 •

edited

Loading