Skip to content

Adding support for arm64 containers#856

Merged
Kipok merged 12 commits intomainfrom
igitman/aarch64
Sep 30, 2025
Merged

Adding support for arm64 containers#856
Kipok merged 12 commits intomainfrom
igitman/aarch64

Conversation

@Kipok
Copy link
Collaborator

@Kipok Kipok commented Sep 28, 2025

  1. Updated nemo-skills container to ubuntu instead of debian to make it easier to install apptainer on arm64
  2. Upgraded trtllm / sglang containers
  3. Added instructions for building on arm64
  4. Removed nemo-aligner slurm test as we will soon fully remove support
  5. Rolled-back nemo-rl dockerfile to a stable version (0.7.0)

Summary by CodeRabbit

  • New Features

    • Added ARM64/aarch64 build guide and multi-platform build instructions.
    • Introduced a benchmarks environment for builds.
  • Chores

    • Upgraded default container images: TensorRT‑LLM → 1.0.0, SGLang → v0.5.3rc1‑cu126, NeMo‑Skills → 0.7.1.
    • Updated base image and dependency installation for build images.
  • Documentation

    • Updated Docker build README and docs to reflect new images and ARM64 notes.
  • Tests

    • Bumped CI/test image tags, simplified SLURM test path, and increased one test execution timeout.

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 28, 2025

Walkthrough

Updated container image tags across CI, cluster configs, docs, and tests; modified Dockerfiles (base image, installs, apptainer, commit pin) and removed sglang patch application; adjusted sglang loader behavior; removed one nemo-aligner test invocation; added arm64 build guidance and benchmark setup.

Changes

Cohort / File(s) Summary
CI workflows
.github/workflows/gpu_tests.yml, .github/workflows/tests.yml
Bumped nemo-skills image tag from 0.7.00.7.1 in build/pull/cleanup steps; no control-flow changes.
Configs & defaults: image tags
cluster_configs/example-local.yaml, cluster_configs/example-slurm.yaml, nemo_skills/__init__.py, tests/gpu-tests/test-local.yaml, docs/basics/index.md
Updated container references: trtllm release:0.21.0release:1.0.0, sglang igitman/...:0.7.0lmsysorg/sglang:v0.5.3rc1-cu126 (docs add arm64 variants), nemo-skills 0.7.00.7.1.
Dockerfiles: core images & builds
dockerfiles/Dockerfile.nemo-skills, dockerfiles/Dockerfile.sglang, dockerfiles/Dockerfile.nemo-rl
Dockerfile.nemo-skills: switch base to ubuntu:22.04, rewrite apt/pip/apptainer installs, add benchmarks clone/setup. Dockerfile.sglang: removed base-image pin and patch-apply steps. Dockerfile.nemo-rl: updated NEMO_RL_COMMIT hash.
Docker build docs
dockerfiles/README.md
Added arm64/aarch64 build guide and updated referenced image tags (nemo-skills, trtllm, sglang, vllm) including arm64 variants.
SGLang patch content
dockerfiles/sglang.patch
Removed remaining-keys validation; converted a ShardedStateLoader method from @staticmethod to instance method; added optional post_load_weights hook and ensured model.eval() is returned.
Slurm tests script
tests/slurm-tests/run_all.sh
Removed nemo-aligner omr_simple_recipe invocation and its sleep 10; script now runs only the nemo-rl path and final wait.
Tests tweak
tests/test_code_execution.py
Increased sandbox.execute_code timeout to timeout=60 in test_lean4_mathlib_code_execution.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Runner as Slurm Test Runner
  participant Job as omr_simple_recipe
  note over Runner,Job: tests/slurm-tests/run_all.sh (after change)
  Runner->>Job: Run with backend=nemo-rl
  Job-->>Runner: Complete
  Runner-->>Runner: Final wait
Loading
sequenceDiagram
  autonumber
  participant Client as Caller
  participant Loader as ShardedStateLoader (instance)
  participant Model as Model
  note over Loader: loader is now an instance method and may call post_load_weights
  Client->>Loader: load_sharded_state(...)
  Loader->>Model: load_state_dict(...)
  alt Model has post_load_weights
    Loader->>Model: post_load_weights()
    Note right of Model: "Post loading weights"
  end
  Loader-->>Client: return model.eval()
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

A rabbit hops through tags anew,
From 0.7.0 to .1 we flew.
Triton hums at 1.0 bright,
Loader whispers weights at night.
Slurm trims a path — carrot cheers! 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “Adding support for arm64 containers” succinctly and accurately captures the primary purpose of the pull request, which revolves around updating container base images, build instructions, and CI workflows to enable arm64 support. It is a clear, single-sentence description that focuses on the most significant change without extraneous details.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch igitman/aarch64

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

  • Public repositories are always opted into early access features.
  • You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
dockerfiles/README.md (1)

13-16: Wrap the bare URL to satisfy markdownlint.

markdownlint (MD034) is flagging the bare link added on Line 14. Please wrap it in angle brackets or convert it into Markdown link syntax to keep the docs lint-clean.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec5fb53 and 81cc258.

📒 Files selected for processing (13)
  • .github/workflows/gpu_tests.yml (2 hunks)
  • .github/workflows/tests.yml (1 hunks)
  • cluster_configs/example-local.yaml (1 hunks)
  • cluster_configs/example-slurm.yaml (1 hunks)
  • dockerfiles/Dockerfile.nemo-rl (1 hunks)
  • dockerfiles/Dockerfile.nemo-skills (2 hunks)
  • dockerfiles/Dockerfile.sglang (0 hunks)
  • dockerfiles/README.md (1 hunks)
  • dockerfiles/sglang.patch (0 hunks)
  • docs/basics/index.md (1 hunks)
  • nemo_skills/__init__.py (1 hunks)
  • tests/gpu-tests/test-local.yaml (1 hunks)
  • tests/slurm-tests/run_all.sh (0 hunks)
💤 Files with no reviewable changes (3)
  • tests/slurm-tests/run_all.sh
  • dockerfiles/Dockerfile.sglang
  • dockerfiles/sglang.patch
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
dockerfiles/README.md

14-14: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: gpu-tests-qwen
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (3)
tests/gpu-tests/test-local.yaml (1)

20-20: Confirm upstream sglang image still keeps our Nemo patches.

Switching from the custom igitman/nemo-skills-sglang build to lmsysorg/sglang:v0.5.3rc1-cu126 removes the baked-in Nemo-Skills tweaks (post-load hook / instance method change). Please double-check that whatever mechanism we now rely on still patches the container at runtime so we don’t lose that functionality.

.github/workflows/gpu_tests.yml (1)

55-55: Ensure igitman/nemo-skills:0.7.1 exists on the GPU runners.

These self-hosted jobs never build the image; they only docker run it (here in the cleanup step, elsewhere for the tests). If igintman/nemo-skills:0.7.1 isn’t already pushed to the registry those jobs will start failing once this merges. Please confirm the tag is published or extend the workflow to build it.

dockerfiles/Dockerfile.nemo-rl (1)

52-52: Double-check the rollback commit still includes our expected tooling.

By defaulting to 9301d36cbf847212430b84a27cfe6990f773b7cf, we need to be sure that scripts we invoke later (e.g., nemo_rl/utils/prefetch_venvs.py) and the patch we apply still exist and apply cleanly at that revision. Please verify the rollback doesn’t drop those assets.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
dockerfiles/Dockerfile.nemo-skills (1)

34-36: Remove unnecessary directory change.

Line 36 (RUN cd /opt/benchmarks) has no effect since each RUN command executes in a fresh shell. The working directory change doesn't persist to subsequent layers.

Apply this diff:

 RUN git clone https://github.com/allenai/IFBench.git /opt/benchmarks/IFBench --depth=1
 RUN cd /opt/benchmarks/IFBench && sed -i '/^unicodedata[=<>]*.*$/d' requirements.txt && pip install -r requirements.txt
-RUN cd /opt/benchmarks
🧹 Nitpick comments (1)
dockerfiles/Dockerfile.nemo-skills (1)

45-45: Add apt cache cleanup after package removal.

For consistency with Docker best practices and to reduce image size, add cache cleanup after the apt remove operation.

Apply this diff:

-RUN apt remove -y python3-blinker
+RUN apt-get remove -y python3-blinker && \
+    rm -rf /var/cache/apt/archives /var/lib/apt/lists/*
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 81cc258 and 5f689e9.

📒 Files selected for processing (1)
  • dockerfiles/Dockerfile.nemo-skills (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (3)
dockerfiles/Dockerfile.nemo-skills (3)

1-2: LGTM: Base image change aligns with arm64 support goals.

The switch from Debian to Ubuntu 22.04 LTS is appropriate for simplifying Apptainer installation on arm64 architectures.


4-14: LGTM: Clean installation with proper cache cleanup.

The apt cache cleanup follows Docker best practices as discussed in previous reviews. The single RUN command reduces image layers effectively.


41-43: LGTM: Gorilla repo pinned to specific commit.

Pinning to a specific commit (d217799) ensures reproducible builds and prevents unexpected breaking changes from upstream updates.

Comment on lines +18 to +26
# Update package lists and install apptainer for arm64
# https://apptainer.org/docs/admin/1.1/installation.html
RUN apt update && \
apt install -y software-properties-common && \
add-apt-repository -y ppa:apptainer/ppa && \
apt update && apt -y install apptainer && \
add-apt-repository -y ppa:apptainer/ppa && \
apt update && apt install -y apptainer-suid && \
rm -rf /var/cache/apt/archives /var/lib/apt/lists/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove duplicate PPA addition and consolidate apt operations.

Line 24 duplicates line 22 by adding the same PPA twice. The repository only needs to be added once, and both packages can be installed in a single operation.

Apply this diff to consolidate the operations:

-# Update package lists and install apptainer for arm64
-# https://apptainer.org/docs/admin/1.1/installation.html
-RUN apt update && \
-    apt install -y software-properties-common && \
-    add-apt-repository -y ppa:apptainer/ppa && \
-    apt update && apt -y install apptainer && \
-    add-apt-repository -y ppa:apptainer/ppa && \
-    apt update && apt install -y apptainer-suid && \
-    rm -rf /var/cache/apt/archives /var/lib/apt/lists/*
+# Update package lists and install apptainer for arm64
+# https://apptainer.org/docs/admin/1.1/installation.html
+RUN apt-get update && \
+    apt-get install -y software-properties-common && \
+    add-apt-repository -y ppa:apptainer/ppa && \
+    apt-get update && \
+    apt-get install -y apptainer apptainer-suid && \
+    rm -rf /var/cache/apt/archives /var/lib/apt/lists/*
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Update package lists and install apptainer for arm64
# https://apptainer.org/docs/admin/1.1/installation.html
RUN apt update && \
apt install -y software-properties-common && \
add-apt-repository -y ppa:apptainer/ppa && \
apt update && apt -y install apptainer && \
add-apt-repository -y ppa:apptainer/ppa && \
apt update && apt install -y apptainer-suid && \
rm -rf /var/cache/apt/archives /var/lib/apt/lists/*
# Update package lists and install apptainer for arm64
# https://apptainer.org/docs/admin/1.1/installation.html
RUN apt-get update && \
apt-get install -y software-properties-common && \
add-apt-repository -y ppa:apptainer/ppa && \
apt-get update && \
apt-get install -y apptainer apptainer-suid && \
rm -rf /var/cache/apt/archives /var/lib/apt/lists/*
🤖 Prompt for AI Agents
In dockerfiles/Dockerfile.nemo-skills around lines 18 to 26, the Dockerfile adds
the same PPA twice and performs multiple separate apt updates/installs; remove
the duplicate add-apt-repository call (keep a single one), consolidate apt
update and apt install into a single RUN command that installs
software-properties-common, apptainer, and apptainer-suid in one apt -y install
invocation, and keep the cleanup (rm -rf /var/cache/apt/archives
/var/lib/apt/lists/*) at the end of that RUN to minimize layers and avoid
redundant repository additions.

Signed-off-by: Igor Gitman <igitman@nvidia.com>
@Kipok Kipok changed the title Adding support for amr64 containers Adding support for arm64 containers Sep 30, 2025
@Kipok Kipok merged commit 74b8ed8 into main Sep 30, 2025
6 checks passed
@Kipok Kipok deleted the igitman/aarch64 branch September 30, 2025 16:06
wasiahmad pushed a commit that referenced this pull request Oct 1, 2025
Signed-off-by: Igor Gitman <igitman@nvidia.com>
SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 9, 2025
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
SeanNaren pushed a commit that referenced this pull request Oct 9, 2025
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Jan 29, 2026
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants