Conversation
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
WalkthroughUpdated container image tags across CI, cluster configs, docs, and tests; modified Dockerfiles (base image, installs, apptainer, commit pin) and removed sglang patch application; adjusted sglang loader behavior; removed one nemo-aligner test invocation; added arm64 build guidance and benchmark setup. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Runner as Slurm Test Runner
participant Job as omr_simple_recipe
note over Runner,Job: tests/slurm-tests/run_all.sh (after change)
Runner->>Job: Run with backend=nemo-rl
Job-->>Runner: Complete
Runner-->>Runner: Final wait
sequenceDiagram
autonumber
participant Client as Caller
participant Loader as ShardedStateLoader (instance)
participant Model as Model
note over Loader: loader is now an instance method and may call post_load_weights
Client->>Loader: load_sharded_state(...)
Loader->>Model: load_state_dict(...)
alt Model has post_load_weights
Loader->>Model: post_load_weights()
Note right of Model: "Post loading weights"
end
Loader-->>Client: return model.eval()
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🧪 Early access (Sonnet 4.5): enabledWe are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience. Note:
Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
dockerfiles/README.md (1)
13-16: Wrap the bare URL to satisfy markdownlint.
markdownlint(MD034) is flagging the bare link added on Line 14. Please wrap it in angle brackets or convert it into Markdown link syntax to keep the docs lint-clean.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.github/workflows/gpu_tests.yml(2 hunks).github/workflows/tests.yml(1 hunks)cluster_configs/example-local.yaml(1 hunks)cluster_configs/example-slurm.yaml(1 hunks)dockerfiles/Dockerfile.nemo-rl(1 hunks)dockerfiles/Dockerfile.nemo-skills(2 hunks)dockerfiles/Dockerfile.sglang(0 hunks)dockerfiles/README.md(1 hunks)dockerfiles/sglang.patch(0 hunks)docs/basics/index.md(1 hunks)nemo_skills/__init__.py(1 hunks)tests/gpu-tests/test-local.yaml(1 hunks)tests/slurm-tests/run_all.sh(0 hunks)
💤 Files with no reviewable changes (3)
- tests/slurm-tests/run_all.sh
- dockerfiles/Dockerfile.sglang
- dockerfiles/sglang.patch
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
dockerfiles/README.md
14-14: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: gpu-tests-qwen
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (3)
tests/gpu-tests/test-local.yaml (1)
20-20: Confirm upstream sglang image still keeps our Nemo patches.Switching from the custom
igitman/nemo-skills-sglangbuild tolmsysorg/sglang:v0.5.3rc1-cu126removes the baked-in Nemo-Skills tweaks (post-load hook / instance method change). Please double-check that whatever mechanism we now rely on still patches the container at runtime so we don’t lose that functionality..github/workflows/gpu_tests.yml (1)
55-55: Ensure igitman/nemo-skills:0.7.1 exists on the GPU runners.These self-hosted jobs never build the image; they only
docker runit (here in the cleanup step, elsewhere for the tests). Ifigintman/nemo-skills:0.7.1isn’t already pushed to the registry those jobs will start failing once this merges. Please confirm the tag is published or extend the workflow to build it.dockerfiles/Dockerfile.nemo-rl (1)
52-52: Double-check the rollback commit still includes our expected tooling.By defaulting to
9301d36cbf847212430b84a27cfe6990f773b7cf, we need to be sure that scripts we invoke later (e.g.,nemo_rl/utils/prefetch_venvs.py) and the patch we apply still exist and apply cleanly at that revision. Please verify the rollback doesn’t drop those assets.
Signed-off-by: Igor Gitman <igitman@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
dockerfiles/Dockerfile.nemo-skills (1)
34-36: Remove unnecessary directory change.Line 36 (
RUN cd /opt/benchmarks) has no effect since each RUN command executes in a fresh shell. The working directory change doesn't persist to subsequent layers.Apply this diff:
RUN git clone https://github.com/allenai/IFBench.git /opt/benchmarks/IFBench --depth=1 RUN cd /opt/benchmarks/IFBench && sed -i '/^unicodedata[=<>]*.*$/d' requirements.txt && pip install -r requirements.txt -RUN cd /opt/benchmarks
🧹 Nitpick comments (1)
dockerfiles/Dockerfile.nemo-skills (1)
45-45: Add apt cache cleanup after package removal.For consistency with Docker best practices and to reduce image size, add cache cleanup after the
apt removeoperation.Apply this diff:
-RUN apt remove -y python3-blinker +RUN apt-get remove -y python3-blinker && \ + rm -rf /var/cache/apt/archives /var/lib/apt/lists/*
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
dockerfiles/Dockerfile.nemo-skills(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (3)
dockerfiles/Dockerfile.nemo-skills (3)
1-2: LGTM: Base image change aligns with arm64 support goals.The switch from Debian to Ubuntu 22.04 LTS is appropriate for simplifying Apptainer installation on arm64 architectures.
4-14: LGTM: Clean installation with proper cache cleanup.The apt cache cleanup follows Docker best practices as discussed in previous reviews. The single RUN command reduces image layers effectively.
41-43: LGTM: Gorilla repo pinned to specific commit.Pinning to a specific commit (
d217799) ensures reproducible builds and prevents unexpected breaking changes from upstream updates.
| # Update package lists and install apptainer for arm64 | ||
| # https://apptainer.org/docs/admin/1.1/installation.html | ||
| RUN apt update && \ | ||
| apt install -y software-properties-common && \ | ||
| add-apt-repository -y ppa:apptainer/ppa && \ | ||
| apt update && apt -y install apptainer && \ | ||
| add-apt-repository -y ppa:apptainer/ppa && \ | ||
| apt update && apt install -y apptainer-suid && \ | ||
| rm -rf /var/cache/apt/archives /var/lib/apt/lists/* |
There was a problem hiding this comment.
Remove duplicate PPA addition and consolidate apt operations.
Line 24 duplicates line 22 by adding the same PPA twice. The repository only needs to be added once, and both packages can be installed in a single operation.
Apply this diff to consolidate the operations:
-# Update package lists and install apptainer for arm64
-# https://apptainer.org/docs/admin/1.1/installation.html
-RUN apt update && \
- apt install -y software-properties-common && \
- add-apt-repository -y ppa:apptainer/ppa && \
- apt update && apt -y install apptainer && \
- add-apt-repository -y ppa:apptainer/ppa && \
- apt update && apt install -y apptainer-suid && \
- rm -rf /var/cache/apt/archives /var/lib/apt/lists/*
+# Update package lists and install apptainer for arm64
+# https://apptainer.org/docs/admin/1.1/installation.html
+RUN apt-get update && \
+ apt-get install -y software-properties-common && \
+ add-apt-repository -y ppa:apptainer/ppa && \
+ apt-get update && \
+ apt-get install -y apptainer apptainer-suid && \
+ rm -rf /var/cache/apt/archives /var/lib/apt/lists/*📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Update package lists and install apptainer for arm64 | |
| # https://apptainer.org/docs/admin/1.1/installation.html | |
| RUN apt update && \ | |
| apt install -y software-properties-common && \ | |
| add-apt-repository -y ppa:apptainer/ppa && \ | |
| apt update && apt -y install apptainer && \ | |
| add-apt-repository -y ppa:apptainer/ppa && \ | |
| apt update && apt install -y apptainer-suid && \ | |
| rm -rf /var/cache/apt/archives /var/lib/apt/lists/* | |
| # Update package lists and install apptainer for arm64 | |
| # https://apptainer.org/docs/admin/1.1/installation.html | |
| RUN apt-get update && \ | |
| apt-get install -y software-properties-common && \ | |
| add-apt-repository -y ppa:apptainer/ppa && \ | |
| apt-get update && \ | |
| apt-get install -y apptainer apptainer-suid && \ | |
| rm -rf /var/cache/apt/archives /var/lib/apt/lists/* |
🤖 Prompt for AI Agents
In dockerfiles/Dockerfile.nemo-skills around lines 18 to 26, the Dockerfile adds
the same PPA twice and performs multiple separate apt updates/installs; remove
the duplicate add-apt-repository call (keep a single one), consolidate apt
update and apt install into a single RUN command that installs
software-properties-common, apptainer, and apptainer-suid in one apt -y install
invocation, and keep the cleanup (rm -rf /var/cache/apt/archives
/var/lib/apt/lists/*) at the end of that RUN to minimize layers and avoid
redundant repository additions.
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
New Features
Chores
Documentation
Tests