-
Notifications
You must be signed in to change notification settings - Fork 300
feat: Docker Blackwell / Grace Blackwell container support #2134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sahgerlad
wants to merge
5
commits into
NVIDIA-NeMo:main
Choose a base branch
from
sahgerlad:feat/docker-blackwell-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
3cd7284
feat(docker): Blackwell / Grace Blackwell container support
sahgerlad 2c32624
Harden the find and guarding per-directory
sahgerlad 11afc66
Merge branch 'main' into feat/docker-blackwell-support
sahgerlad ffde090
Update CUDA_ARCH_LIST error message
sahgerlad 160da9f
Merge branch 'main' into feat/docker-blackwell-support
sahgerlad File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
For the exact PyTorch version used by NVIDIA-NeMo/RL (from uv.lock) and CUDA 12.9 base images, isTORCH_CUDA_ARCH_LIST="10.3"(sm_103) officially supported for CUDA extension compilation? Please provide the official PyTorch/CUDA documentation or release-note link.💡 Result:
No, TORCH_CUDA_ARCH_LIST="10.3" (sm_103) is not officially supported for CUDA extension compilation in the PyTorch version used by NVIDIA-NeMo/RL (likely PyTorch 2.8.0 from uv.lock patterns and CUDA 12.9 Docker images like pytorch/pytorch:2.8.0-cuda12.9-cudnn9-devel). PyTorch 2.8 CUDA 12.9 builds support up to sm_121 (Blackwell), but sm_103 (GB300/B300, compute capability 10.3) requires CUDA 12.9 compiler support which PyTorch 2.8 may not include in its default TORCH_CUDA_ARCH_LIST for extensions. Official PyTorch docs do not explicitly list sm_103; use torch.cuda.get_arch_list to verify built arches, but extension compilation defaults exclude newest arches until explicitly added in future releases. CUDA 12.9 release notes confirm sm_103 compiler target support, but PyTorch binaries lag (e.g., PyTorch 2.8 RFCs discuss cu12.9 as prototype).
Citations:
🏁 Script executed:
Repository: NVIDIA-NeMo/RL
Length of output: 77
🏁 Script executed:
Repository: NVIDIA-NeMo/RL
Length of output: 227
🏁 Script executed:
Repository: NVIDIA-NeMo/RL
Length of output: 7509
🏁 Script executed:
Repository: NVIDIA-NeMo/RL
Length of output: 2893
🌐 Web query:
PyTorch 2.9.0 CUDA 12.9 TORCH_CUDA_ARCH_LIST sm_103 (compute capability 10.3) support - is it officially supported for CUDA extension compilation?💡 Result:
No, sm_103 (compute capability 10.3) is not officially supported for CUDA extension compilation in PyTorch 2.9.0 with CUDA 12.9 by default. PyTorch 2.9.0 provides official wheels for CUDA 12.6, 12.8, and 13.0. Compute capability 10.3 corresponds to NVIDIA Blackwell GPUs like GB300. PyTorch binaries are compiled for a standard set of architectures up to sm_90 (Hopper H100), with newer architectures like sm_103 requiring explicit inclusion via TORCH_CUDA_ARCH_LIST="10.3" when building custom CUDA extensions from source. Official binaries do not include sm_103 PTX or cubins, so extensions must set this environment variable during compilation to generate compatible code. CUDA 12.9 is mentioned in release discussions but deprecated in favor of 12.8/13.0; compatibility depends on the installed toolkit version.
Citations:
Remove
10.3fromTORCH_CUDA_ARCH_LISTin the Dockerfile.PyTorch 2.9.0 with CUDA 12.9 does not officially support
sm_103(compute capability 10.3). Official PyTorch binaries support only up tosm_90(Hopper) by default, and sm_103 (Blackwell GB300) is not included in the standard toolchain. SettingTORCH_CUDA_ARCH_LIST="9.0 10.0 10.3"will cause CUDA extension builds to fail when the PyTorch 2.9.0 toolchain cannot generate compatible code for this unsupported architecture. UseENV TORCH_CUDA_ARCH_LIST="9.0 10.0"instead for compatibility with the bundled PyTorch/CUDA version.🤖 Prompt for AI Agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping
10.3for Grace Blackwell / GB coverage alongside10.0; the image uses CUDA 12.9 and we verified the build and runtime on B300 with this list. Happy to narrow if CI shows an unsupported-arch error from PyTorch/nvcc.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sahgerlad, that context makes sense — thank you for clarifying. CUDA 12.9 first introduced sm_103 (B300/GB300) arch support, where sm_103 binaries are compatible with sm_100, and the PR already includes the necessary mitigations for the known Triton/ptxas limitation: the standard Triton PTXAS bundled with PyTorch is not compiled for sm_103, but this can be bypassed by pointing to an external ptxas binary that supports the target GPU architecture — which is exactly what theTRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxasenv var and theptxas-blackwellsymlinks in this PR accomplish. Happy to withdraw the concern given the hardware-verified build.✏️ Learnings added
🧠 Learnings used