Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ RUN GITHUB_ARTIFACTORY=github.com \
ARG UV_VERSION=0.9.7
ARG PYTHON_VERSION=3.12
ENV PATH="/root/.local/bin:$PATH"
ENV TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas
RUN curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | sh && \
uv python install ${PYTHON_VERSION}

Expand Down Expand Up @@ -109,8 +110,8 @@ ARG SKIP_SGLANG_BUILD
ENV UV_PROJECT_ENVIRONMENT=/opt/nemo_rl_venv
ENV UV_LINK_MODE=copy

# Ensure DeepEP is built for H100 and B200 (also mcore inference unified memory API now invokes a torch API that requires these to be set)
ENV TORCH_CUDA_ARCH_LIST="9.0 10.0"
# Ensure DeepEP is built for hopper and (grace)blackwell (also mcore inference unified memory API now invokes a torch API that requires these to be set)
ENV TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3"
Comment on lines +113 to +114
Copy link
Contributor

@coderabbitai coderabbitai bot Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For the exact PyTorch version used by NVIDIA-NeMo/RL (from uv.lock) and CUDA 12.9 base images, is TORCH_CUDA_ARCH_LIST="10.3" (sm_103) officially supported for CUDA extension compilation? Please provide the official PyTorch/CUDA documentation or release-note link.

💡 Result:

No, TORCH_CUDA_ARCH_LIST="10.3" (sm_103) is not officially supported for CUDA extension compilation in the PyTorch version used by NVIDIA-NeMo/RL (likely PyTorch 2.8.0 from uv.lock patterns and CUDA 12.9 Docker images like pytorch/pytorch:2.8.0-cuda12.9-cudnn9-devel). PyTorch 2.8 CUDA 12.9 builds support up to sm_121 (Blackwell), but sm_103 (GB300/B300, compute capability 10.3) requires CUDA 12.9 compiler support which PyTorch 2.8 may not include in its default TORCH_CUDA_ARCH_LIST for extensions. Official PyTorch docs do not explicitly list sm_103; use torch.cuda.get_arch_list to verify built arches, but extension compilation defaults exclude newest arches until explicitly added in future releases. CUDA 12.9 release notes confirm sm_103 compiler target support, but PyTorch binaries lag (e.g., PyTorch 2.8 RFCs discuss cu12.9 as prototype).

Citations:


🏁 Script executed:

# First, find the Dockerfile and understand the context
find . -name "Dockerfile" -type f | head -5

Repository: NVIDIA-NeMo/RL

Length of output: 77


🏁 Script executed:

# Look for uv.lock or other dependency files to determine PyTorch version
find . -name "uv.lock" -o -name "requirements*.txt" -o -name "pyproject.toml" | head -10

Repository: NVIDIA-NeMo/RL

Length of output: 227


🏁 Script executed:

# Read the Dockerfile to see the full context and PyTorch base image
cat -n docker/Dockerfile | head -150

Repository: NVIDIA-NeMo/RL

Length of output: 7509


🏁 Script executed:

# Find PyTorch version in uv.lock
grep -A 5 "name = \"torch\"" uv.lock | head -20

Repository: NVIDIA-NeMo/RL

Length of output: 2893


🌐 Web query:

PyTorch 2.9.0 CUDA 12.9 TORCH_CUDA_ARCH_LIST sm_103 (compute capability 10.3) support - is it officially supported for CUDA extension compilation?

💡 Result:

No, sm_103 (compute capability 10.3) is not officially supported for CUDA extension compilation in PyTorch 2.9.0 with CUDA 12.9 by default. PyTorch 2.9.0 provides official wheels for CUDA 12.6, 12.8, and 13.0. Compute capability 10.3 corresponds to NVIDIA Blackwell GPUs like GB300. PyTorch binaries are compiled for a standard set of architectures up to sm_90 (Hopper H100), with newer architectures like sm_103 requiring explicit inclusion via TORCH_CUDA_ARCH_LIST="10.3" when building custom CUDA extensions from source. Official binaries do not include sm_103 PTX or cubins, so extensions must set this environment variable during compilation to generate compatible code. CUDA 12.9 is mentioned in release discussions but deprecated in favor of 12.8/13.0; compatibility depends on the installed toolkit version.

Citations:


Remove 10.3 from TORCH_CUDA_ARCH_LIST in the Dockerfile.

PyTorch 2.9.0 with CUDA 12.9 does not officially support sm_103 (compute capability 10.3). Official PyTorch binaries support only up to sm_90 (Hopper) by default, and sm_103 (Blackwell GB300) is not included in the standard toolchain. Setting TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3" will cause CUDA extension builds to fail when the PyTorch 2.9.0 toolchain cannot generate compatible code for this unsupported architecture. Use ENV TORCH_CUDA_ARCH_LIST="9.0 10.0" instead for compatibility with the bundled PyTorch/CUDA version.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/Dockerfile` around lines 113 - 114, Update the Dockerfile's
TORCH_CUDA_ARCH_LIST environment variable by removing the unsupported "10.3"
entry: locate the ENV line setting TORCH_CUDA_ARCH_LIST and change its value
from "9.0 10.0 10.3" to "9.0 10.0" so CUDA extension builds use only supported
architectures for the bundled PyTorch/CUDA toolchain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping 10.3 for Grace Blackwell / GB coverage alongside 10.0; the image uses CUDA 12.9 and we verified the build and runtime on B300 with this list. Happy to narrow if CI shows an unsupported-arch error from PyTorch/nvcc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sahgerlad, that context makes sense — thank you for clarifying. CUDA 12.9 first introduced sm_103 (B300/GB300) arch support, where sm_103 binaries are compatible with sm_100, and the PR already includes the necessary mitigations for the known Triton/ptxas limitation: the standard Triton PTXAS bundled with PyTorch is not compiled for sm_103, but this can be bypassed by pointing to an external ptxas binary that supports the target GPU architecture — which is exactly what the TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas env var and the ptxas-blackwell symlinks in this PR accomplish. Happy to withdraw the concern given the hardware-verified build.


✏️ Learnings added
Learnt from: sahgerlad
Repo: NVIDIA-NeMo/RL PR: 2134
File: docker/Dockerfile:113-114
Timestamp: 2026-03-21T03:52:47.988Z
Learning: In `docker/Dockerfile` for NVIDIA-NeMo/RL, `TORCH_CUDA_ARCH_LIST="9.0 10.0 10.3"` intentionally includes `10.3` for NVIDIA B300/GB300 (Grace Blackwell) hardware support with CUDA 12.9. The PR pairs this with `TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas` and `ptxas-blackwell` symlinks inside the ray/nemo venvs to work around the known issue where Triton's bundled PTXAS does not support sm_103. The maintainer (sahgerlad) has verified this configuration builds and runs correctly on B300 hardware. Do not flag `10.3` in this list as unsupported.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: NVIDIA-NeMo/RL PR: 0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2026-02-03T02:09:51.636Z
Learning: Applies to Dockerfile.ngc_pytorch : Exception to uv guidelines: `Dockerfile.ngc_pytorch` is exempt from the requirement to use `uv run`


# First copy only the dependency files
COPY --from=nemo-rl pyproject.toml uv.lock ./
Expand Down Expand Up @@ -197,6 +198,10 @@ else
UV_LINK_MODE=symlink uv run nemo_rl/utils/prefetch_venvs.py
fi
EOF
RUN for d in /opt/ray_venvs /opt/nemo_rl_venv; do \
[ -d "$d" ] || continue; \
find "$d" -name "ptxas-blackwell" -exec ln -sf /usr/local/cuda/bin/ptxas {} +; \
done

# Generate container fingerprint for frozen environment support
# Store outside /opt/nemo-rl to avoid being overwritten by user mounts
Expand Down
4 changes: 2 additions & 2 deletions nemo_rl/models/policy/lm_policy.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ def __init__(
if "TORCH_CUDA_ARCH_LIST" not in os.environ:
raise RuntimeError(
"TORCH_CUDA_ARCH_LIST is not set. This is required in Megatron backend. This variable is set in our container, but "
"if you are running a custom container or baremetal, you may need to set this variable manually. Example: export TORCH_CUDA_ARCH_LIST='9.0 10.0'"
"if you are running a custom container or baremetal, you may need to set this variable manually. Example: export TORCH_CUDA_ARCH_LIST='9.0 10.0 10.3'"
)

else:
Expand All @@ -118,7 +118,7 @@ def __init__(
if "TORCH_CUDA_ARCH_LIST" not in os.environ:
warnings.warn(
"TORCH_CUDA_ARCH_LIST is not set. This is needed if using DeepEP in DTensorPolicyWorker V2. This variable is set in our container, but "
"if you are running a custom container or baremetal, you may need to set this variable manually. Example: export TORCH_CUDA_ARCH_LIST='9.0 10.0'"
"if you are running a custom container or baremetal, you may need to set this variable manually. Example: export TORCH_CUDA_ARCH_LIST='9.0 10.0 10.3'"
)
else:
assert (
Expand Down
Loading