Adding support for arm64 containers by Kipok · Pull Request #856 · NVIDIA-NeMo/Skills

Kipok · 2025-09-28T18:51:43Z

Updated nemo-skills container to ubuntu instead of debian to make it easier to install apptainer on arm64
Upgraded trtllm / sglang containers
Added instructions for building on arm64
Removed nemo-aligner slurm test as we will soon fully remove support
Rolled-back nemo-rl dockerfile to a stable version (0.7.0)

Summary by CodeRabbit

New Features
- Added ARM64/aarch64 build guide and multi-platform build instructions.
- Introduced a benchmarks environment for builds.
Chores
- Upgraded default container images: TensorRT‑LLM → 1.0.0, SGLang → v0.5.3rc1‑cu126, NeMo‑Skills → 0.7.1.
- Updated base image and dependency installation for build images.
Documentation
- Updated Docker build README and docs to reflect new images and ARM64 notes.
Tests
- Bumped CI/test image tags, simplified SLURM test path, and increased one test execution timeout.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai · 2025-09-28T18:51:51Z

Walkthrough

Updated container image tags across CI, cluster configs, docs, and tests; modified Dockerfiles (base image, installs, apptainer, commit pin) and removed sglang patch application; adjusted sglang loader behavior; removed one nemo-aligner test invocation; added arm64 build guidance and benchmark setup.

Changes

Cohort / File(s)	Summary
CI workflows `.github/workflows/gpu_tests.yml`, `.github/workflows/tests.yml`	Bumped `nemo-skills` image tag from `0.7.0`→`0.7.1` in build/pull/cleanup steps; no control-flow changes.
Configs & defaults: image tags `cluster_configs/example-local.yaml`, `cluster_configs/example-slurm.yaml`, `nemo_skills/__init__.py`, `tests/gpu-tests/test-local.yaml`, `docs/basics/index.md`	Updated container references: `trtllm` `release:0.21.0`→`release:1.0.0`, `sglang` `igitman/...:0.7.0`→`lmsysorg/sglang:v0.5.3rc1-cu126` (docs add arm64 variants), `nemo-skills` `0.7.0`→`0.7.1`.
Dockerfiles: core images & builds `dockerfiles/Dockerfile.nemo-skills`, `dockerfiles/Dockerfile.sglang`, `dockerfiles/Dockerfile.nemo-rl`	`Dockerfile.nemo-skills`: switch base to `ubuntu:22.04`, rewrite apt/pip/apptainer installs, add benchmarks clone/setup. `Dockerfile.sglang`: removed base-image pin and patch-apply steps. `Dockerfile.nemo-rl`: updated `NEMO_RL_COMMIT` hash.
Docker build docs `dockerfiles/README.md`	Added arm64/aarch64 build guide and updated referenced image tags (nemo-skills, trtllm, sglang, vllm) including arm64 variants.
SGLang patch content `dockerfiles/sglang.patch`	Removed remaining-keys validation; converted a `ShardedStateLoader` method from `@staticmethod` to instance method; added optional `post_load_weights` hook and ensured `model.eval()` is returned.
Slurm tests script `tests/slurm-tests/run_all.sh`	Removed nemo-aligner `omr_simple_recipe` invocation and its `sleep 10`; script now runs only the nemo-rl path and final wait.
Tests tweak `tests/test_code_execution.py`	Increased `sandbox.execute_code` timeout to `timeout=60` in `test_lean4_mathlib_code_execution`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Runner as Slurm Test Runner
  participant Job as omr_simple_recipe
  note over Runner,Job: tests/slurm-tests/run_all.sh (after change)
  Runner->>Job: Run with backend=nemo-rl
  Job-->>Runner: Complete
  Runner-->>Runner: Final wait

sequenceDiagram
  autonumber
  participant Client as Caller
  participant Loader as ShardedStateLoader (instance)
  participant Model as Model
  note over Loader: loader is now an instance method and may call post_load_weights
  Client->>Loader: load_sharded_state(...)
  Loader->>Model: load_state_dict(...)
  alt Model has post_load_weights
    Loader->>Model: post_load_weights()
    Note right of Model: "Post loading weights"
  end
  Loader-->>Client: return model.eval()

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hops through tags anew,
From 0.7.0 to .1 we flew.
Triton hums at 1.0 bright,
Loader whispers weights at night.
Slurm trims a path — carrot cheers! 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “Adding support for arm64 containers” succinctly and accurately captures the primary purpose of the pull request, which revolves around updating container base images, build instructions, and CI workflows to enable arm64 support. It is a clear, single-sentence description that focuses on the most significant change without extraneous details.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch igitman/aarch64

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

Public repositories are always opted into early access features.
You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

dockerfiles/README.md (1)

13-16: Wrap the bare URL to satisfy markdownlint.

markdownlint (MD034) is flagging the bare link added on Line 14. Please wrap it in angle brackets or convert it into Markdown link syntax to keep the docs lint-clean.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec5fb53 and 81cc258.

📒 Files selected for processing (13)

.github/workflows/gpu_tests.yml (2 hunks)
.github/workflows/tests.yml (1 hunks)
cluster_configs/example-local.yaml (1 hunks)
cluster_configs/example-slurm.yaml (1 hunks)
dockerfiles/Dockerfile.nemo-rl (1 hunks)
dockerfiles/Dockerfile.nemo-skills (2 hunks)
dockerfiles/Dockerfile.sglang (0 hunks)
dockerfiles/README.md (1 hunks)
dockerfiles/sglang.patch (0 hunks)
docs/basics/index.md (1 hunks)
nemo_skills/__init__.py (1 hunks)
tests/gpu-tests/test-local.yaml (1 hunks)
tests/slurm-tests/run_all.sh (0 hunks)

💤 Files with no reviewable changes (3)

tests/slurm-tests/run_all.sh
dockerfiles/Dockerfile.sglang
dockerfiles/sglang.patch

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

dockerfiles/README.md

14-14: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: gpu-tests-qwen
GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (3)

tests/gpu-tests/test-local.yaml (1)

20-20: Confirm upstream sglang image still keeps our Nemo patches.

Switching from the custom igitman/nemo-skills-sglang build to lmsysorg/sglang:v0.5.3rc1-cu126 removes the baked-in Nemo-Skills tweaks (post-load hook / instance method change). Please double-check that whatever mechanism we now rely on still patches the container at runtime so we don’t lose that functionality.

.github/workflows/gpu_tests.yml (1)

55-55: Ensure igitman/nemo-skills:0.7.1 exists on the GPU runners.

These self-hosted jobs never build the image; they only docker run it (here in the cleanup step, elsewhere for the tests). If igintman/nemo-skills:0.7.1 isn’t already pushed to the registry those jobs will start failing once this merges. Please confirm the tag is published or extend the workflow to build it.

dockerfiles/Dockerfile.nemo-rl (1)

52-52: Double-check the rollback commit still includes our expected tooling.

By defaulting to 9301d36cbf847212430b84a27cfe6990f773b7cf, we need to be sure that scripts we invoke later (e.g., nemo_rl/utils/prefetch_venvs.py) and the patch we apply still exist and apply cleanly at that revision. Please verify the rollback doesn’t drop those assets.

dockerfiles/Dockerfile.nemo-skills

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

dockerfiles/Dockerfile.nemo-skills (1)
34-36: Remove unnecessary directory change.

Line 36 (RUN cd /opt/benchmarks) has no effect since each RUN command executes in a fresh shell. The working directory change doesn't persist to subsequent layers.

Apply this diff:
 RUN git clone https://github.com/allenai/IFBench.git /opt/benchmarks/IFBench --depth=1
 RUN cd /opt/benchmarks/IFBench && sed -i '/^unicodedata[=<>]*.*$/d' requirements.txt && pip install -r requirements.txt
-RUN cd /opt/benchmarks

🧹 Nitpick comments (1)

dockerfiles/Dockerfile.nemo-skills (1)
45-45: Add apt cache cleanup after package removal.

For consistency with Docker best practices and to reduce image size, add cache cleanup after the apt remove operation.

Apply this diff:
-RUN apt remove -y python3-blinker
+RUN apt-get remove -y python3-blinker && \
+    rm -rf /var/cache/apt/archives /var/lib/apt/lists/*

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 81cc258 and 5f689e9.

📒 Files selected for processing (1)

dockerfiles/Dockerfile.nemo-skills (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (3)

dockerfiles/Dockerfile.nemo-skills (3)

1-2: LGTM: Base image change aligns with arm64 support goals.

The switch from Debian to Ubuntu 22.04 LTS is appropriate for simplifying Apptainer installation on arm64 architectures.

4-14: LGTM: Clean installation with proper cache cleanup.

The apt cache cleanup follows Docker best practices as discussed in previous reviews. The single RUN command reduces image layers effectively.

41-43: LGTM: Gorilla repo pinned to specific commit.

Pinning to a specific commit (d217799) ensures reproducible builds and prevents unexpected breaking changes from upstream updates.

coderabbitai · 2025-09-29T23:51:45Z

dockerfiles/Dockerfile.nemo-skills

+# Update package lists and install apptainer for arm64
+# https://apptainer.org/docs/admin/1.1/installation.html
+RUN apt update && \
+    apt install -y software-properties-common && \
+    add-apt-repository -y ppa:apptainer/ppa && \
+    apt update && apt -y install apptainer && \
+    add-apt-repository -y ppa:apptainer/ppa && \
+    apt update && apt install -y apptainer-suid && \
+    rm -rf /var/cache/apt/archives /var/lib/apt/lists/*


⚠️ Potential issue | 🟡 Minor

Remove duplicate PPA addition and consolidate apt operations.

Line 24 duplicates line 22 by adding the same PPA twice. The repository only needs to be added once, and both packages can be installed in a single operation.

Apply this diff to consolidate the operations:

-# Update package lists and install apptainer for arm64 -# https://apptainer.org/docs/admin/1.1/installation.html -RUN apt update && \ - apt install -y software-properties-common && \ - add-apt-repository -y ppa:apptainer/ppa && \ - apt update && apt -y install apptainer && \ - add-apt-repository -y ppa:apptainer/ppa && \ - apt update && apt install -y apptainer-suid && \ - rm -rf /var/cache/apt/archives /var/lib/apt/lists/* +# Update package lists and install apptainer for arm64 +# https://apptainer.org/docs/admin/1.1/installation.html +RUN apt-get update && \ + apt-get install -y software-properties-common && \ + add-apt-repository -y ppa:apptainer/ppa && \ + apt-get update && \ + apt-get install -y apptainer apptainer-suid && \ + rm -rf /var/cache/apt/archives /var/lib/apt/lists/*

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Update package lists and install apptainer for arm64

# https://apptainer.org/docs/admin/1.1/installation.html

RUN apt update && \

apt install -y software-properties-common && \

add-apt-repository -y ppa:apptainer/ppa && \

apt update && apt -y install apptainer && \

add-apt-repository -y ppa:apptainer/ppa && \

apt update && apt install -y apptainer-suid && \

rm -rf /var/cache/apt/archives /var/lib/apt/lists/*

# Update package lists and install apptainer for arm64

# https://apptainer.org/docs/admin/1.1/installation.html

RUN apt-get update && \

apt-get install -y software-properties-common && \

add-apt-repository -y ppa:apptainer/ppa && \

apt-get update && \

apt-get install -y apptainer apptainer-suid && \

rm -rf /var/cache/apt/archives /var/lib/apt/lists/*

🤖 Prompt for AI Agents

In dockerfiles/Dockerfile.nemo-skills around lines 18 to 26, the Dockerfile adds the same PPA twice and performs multiple separate apt updates/installs; remove the duplicate add-apt-repository call (keep a single one), consolidate apt update and apt install into a single RUN command that installs software-properties-common, apptainer, and apptainer-suid in one apt -y install invocation, and keep the cleanup (rm -rf /var/cache/apt/archives /var/lib/apt/lists/*) at the end of that RUN to minimize layers and avoid redundant repository additions.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Kipok added 7 commits September 27, 2025 20:45

Update dockerfile to be installable on arm64

17604f5

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Rollback changes to stable commit

16482e7

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix blinker issue

f61b6bd

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Remove nemo-aligner test

83883ec

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update container versions

04476f2

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix for bfcl

73c2975

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into igitman/aarch64

81cc258

Kipok added the run GPU tests label Sep 28, 2025

coderabbitai bot reviewed Sep 28, 2025

View reviewed changes

activatedgeek requested changes Sep 29, 2025

View reviewed changes

dockerfiles/Dockerfile.nemo-skills Outdated Show resolved Hide resolved

Kipok added 2 commits September 29, 2025 16:16

Merge branch 'main' into igitman/aarch64

7a2b4a3

Fixes for main dockerfile

5f689e9

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot reviewed Sep 29, 2025

View reviewed changes

Kipok added 2 commits September 29, 2025 17:21

Update timeout

c7d15a7

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into igitman/aarch64

04bf125

Kipok changed the title ~~Adding support for amr64 containers~~ Adding support for arm64 containers Sep 30, 2025

Merge branch 'main' into igitman/aarch64

df2450d

activatedgeek approved these changes Sep 30, 2025

View reviewed changes

Kipok merged commit 74b8ed8 into main Sep 30, 2025
6 checks passed

Kipok deleted the igitman/aarch64 branch September 30, 2025 16:06

wasiahmad pushed a commit that referenced this pull request Oct 1, 2025

Adding support for arm64 containers (#856)

2650a75

Signed-off-by: Igor Gitman <igitman@nvidia.com>

SeanNaren removed the run GPU tests label Oct 2, 2025

SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 9, 2025

Adding support for arm64 containers (NVIDIA-NeMo#856)

0627d20

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

SeanNaren pushed a commit that referenced this pull request Oct 9, 2025

Adding support for arm64 containers (#856)

77757ed

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

coderabbitai bot mentioned this pull request Dec 18, 2025

Unable to run multi-node scripts using Enroot images NVIDIA-NeMo/RL#1657

Closed

coderabbitai bot mentioned this pull request Jan 29, 2026

Upgrade containers #1198

Merged

coderabbitai bot mentioned this pull request Feb 11, 2026

feat: migrate sandbox from uwsgi to gunicorn #1232

Closed

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Adding support for arm64 containers (#856)

27e4117

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for arm64 containers#856

Adding support for arm64 containers#856
Kipok merged 12 commits intomainfrom
igitman/aarch64

Kipok commented Sep 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 28, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Kipok commented Sep 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kipok commented Sep 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 28, 2025 •

edited

Loading