Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions .github/workflows/gpu_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@ jobs:
pip install -e .
pip install -r requirements/common-tests.txt
ns prepare_data gsm8k human-eval mbpp algebra222 mmlu ifeval math-500 amc23 aime24
- name: Build Docker image
run: |
cd ${{ github.run_id }}
docker build -t nemo-skills-image -f dockerfiles/Dockerfile.nemo-skills .
- name: Run GPU tests
timeout-minutes: 240
env:
Expand All @@ -52,7 +56,7 @@ jobs:
- name: Cleanup
if: always()
run: |
docker run --rm -v /tmp:/tmp -v /home:/home igitman/nemo-skills:0.7.1 bash -c 'rm -rf /tmp/nemo-skills-tests /home/azureuser/.nemo_run/'
docker run --rm -v /tmp:/tmp -v /home:/home nemo-skills-image bash -c 'rm -rf /tmp/nemo-skills-tests /home/azureuser/.nemo_run/'
docker ps -a -q | xargs -r docker stop

gpu-tests-qwen:
Expand All @@ -79,6 +83,10 @@ jobs:
pip install -e .
pip install -r requirements/common-tests.txt
ns prepare_data gsm8k human-eval mbpp algebra222 mmlu ifeval math-500 amc23 aime24
- name: Build Docker image
run: |
cd ${{ github.run_id }}
docker build -t nemo-skills-image -f dockerfiles/Dockerfile.nemo-skills .
- name: Run GPU tests
timeout-minutes: 240
env:
Expand All @@ -91,5 +99,5 @@ jobs:
- name: Cleanup
if: always()
run: |
docker run --rm -v /tmp:/tmp -v /home:/home igitman/nemo-skills:0.7.1 bash -c 'rm -rf /tmp/nemo-skills-tests /home/azureuser/.nemo_run/'
docker run --rm -v /tmp:/tmp -v /home:/home nemo-skills-image bash -c 'rm -rf /tmp/nemo-skills-tests /home/azureuser/.nemo_run/'
docker ps -a -q | xargs -r docker stop
22 changes: 3 additions & 19 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,37 +25,21 @@ jobs:
with:
python-version: "3.10"
cache: pip
- name: Detect Docker changes
id: changes
uses: dorny/paths-filter@v3
with:
filters: |
docker:
- 'dockerfiles/Dockerfile.sandbox'
- 'dockerfiles/Dockerfile.nemo-skills'
- 'nemo_skills/code_execution/local_sandbox/**'
- 'requirements/**'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e .[dev]
- name: Build Images
if: steps.changes.outputs.docker == 'true'
run: |
# these tags need to match the ones in tests/gpu-tests/test-local.yaml
docker build -t igitman/nemo-skills:0.7.1 -f dockerfiles/Dockerfile.nemo-skills .
docker build -t igitman/nemo-skills-sandbox:0.7.1 -f dockerfiles/Dockerfile.sandbox .
- name: Pull Images
if: steps.changes.outputs.docker != 'true'
run: |
docker pull igitman/nemo-skills:0.7.1
docker pull igitman/nemo-skills-sandbox:0.7.1
docker build -t nemo-skills-image -f dockerfiles/Dockerfile.nemo-skills .
docker build -t nemo-skills-sandbox-image -f dockerfiles/Dockerfile.sandbox .
- name: Run all tests
env:
NVIDIA_API_KEY: ${{ secrets.NVIDIA_API_KEY }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
run: |
docker run --rm --network=host igitman/nemo-skills-sandbox:0.7.1 &
docker run --rm --network=host nemo-skills-sandbox-image &
sleep 10
set -o pipefail # this will make sure next line returns non-0 exit code if tests fail
ns prepare_data gsm8k math-500
Expand Down
2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
recursive-include nemo_skills *.yaml
recursive-include nemo_skills *.txt
graft dockerfiles
graft requirements
16 changes: 8 additions & 8 deletions cluster_configs/example-local.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,12 @@ containers:
trtllm: nvcr.io/nvidia/tensorrt-llm/release:1.0.0
vllm: vllm/vllm-openai:v0.10.1.1
sglang: lmsysorg/sglang:v0.5.3rc1-cu126
nemo: igitman/nemo-skills-nemo:0.7.0
megatron: igitman/nemo-skills-megatron:0.7.0
sandbox: igitman/nemo-skills-sandbox:0.7.1
nemo-skills: igitman/nemo-skills:0.7.1
verl: igitman/nemo-skills-verl:0.7.0
nemo-rl: igitman/nemo-skills-nemo-rl:0.7.1
# dockerfile: for now can only specify relative to repo root
megatron: dockerfile:dockerfiles/Dockerfile.megatron
sandbox: dockerfile:dockerfiles/Dockerfile.sandbox
nemo-skills: dockerfile:dockerfiles/Dockerfile.nemo-skills
verl: dockerfile:dockerfiles/Dockerfile.verl
nemo-rl: dockerfile:dockerfiles/Dockerfile.nemo-rl

# add required mounts for models/data here
# the code is mounted automatically inside /nemo_run/code
Expand All @@ -34,8 +34,8 @@ containers:
# - /mnt/datadrive/models:/models
# - /mnt/datadrive/data:/data
# - /home/<username>/workspace:/workspace
# you can also override container libraries by directly mounting over them. E.g. to override NeMo-Aligner do
# - <...>/NeMo-Aligner:/opt/NeMo-Aligner
# you can also override container libraries by directly mounting over them. E.g. to override NeMo-RL do
# - <...>/NeMo-RL:/opt/NeMo-RL

# define any environment variables. Note that HF_HOME is required by default and needs to be a mounted path!
# env_vars:
Expand Down
11 changes: 2 additions & 9 deletions cluster_configs/example-slurm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,8 @@
executor: slurm

containers:
trtllm: nvcr.io/nvidia/tensorrt-llm/release:1.0.0
vllm: vllm/vllm-openai:v0.10.1.1
sglang: lmsysorg/sglang:v0.5.3rc1-cu126
nemo: igitman/nemo-skills-nemo:0.7.0
megatron: igitman/nemo-skills-megatron:0.7.0
sandbox: igitman/nemo-skills-sandbox:0.7.1
nemo-skills: igitman/nemo-skills:0.7.1
verl: igitman/nemo-skills-verl:0.7.0
nemo-rl: igitman/nemo-skills-nemo-rl:0.7.1
# follow steps in https://nvidia-nemo.github.io/Skills/basics/#slurm-inference
# to complete this section

job_name_prefix: "nemo_skills:"

Expand Down
169 changes: 0 additions & 169 deletions dockerfiles/Dockerfile.nemo

This file was deleted.

17 changes: 9 additions & 8 deletions dockerfiles/Dockerfile.nemo-skills
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,6 @@ RUN cd /opt/gorilla/berkeley-function-call-leaderboard && pip install -e .

RUN apt remove -y python3-blinker

RUN mkdir -p /opt/NeMo-Skills/requirements
COPY pyproject.toml README.md /opt/NeMo-Skills/
COPY nemo_skills /opt/NeMo-Skills/nemo_skills/
COPY requirements /opt/NeMo-Skills/requirements/
# installing sdp in container only
RUN pip install git+https://github.com/NVIDIA/NeMo-speech-data-processor@29b9b1ec0ceaf3ffa441c1d01297371b3f8e11d2
RUN cd /opt/NeMo-Skills && pip install -e .

# ifbench
RUN git clone https://github.com/allenai/IFBench.git /opt/benchmarks/IFBench --depth=1
RUN cd /opt/benchmarks/IFBench && pip install -r requirements.txt
Expand All @@ -55,3 +47,12 @@ RUN cd /opt/benchmarks/IFBench && git apply ifbench.patch
RUN pip install langdetect absl-py immutabledict nltk ipython && \
python -c "import nltk; from spacy.cli import download; nltk.download('punkt'); nltk.download('punkt_tab'); \
nltk.download('stopwords'); nltk.download('averaged_perceptron_tagger_eng'); download('en_core_web_sm')"

# we aren't copying main nemo_skills folder as it will always be mounted from host
# but we do want to install all requirements in the container directly
RUN mkdir -p /opt/NeMo-Skills/requirements
COPY pyproject.toml README.md /opt/NeMo-Skills/
COPY requirements /opt/NeMo-Skills/requirements/
# installing sdp in container only
RUN pip install git+https://github.com/NVIDIA/NeMo-speech-data-processor@29b9b1ec0ceaf3ffa441c1d01297371b3f8e11d2
RUN pip install -r /opt/NeMo-Skills/requirements/main.txt
Comment on lines +51 to +58
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that all container usage contexts properly mount the nemo_skills code.

This is a significant architectural change: the image no longer contains the nemo_skills source code and instead expects it to be mounted from the host at runtime. While this is excellent for local development workflows (allowing live code changes), it introduces a dependency on proper mount configuration.

Ensure that:

  1. All places that use this container (CI workflows, documentation examples, cluster configs) properly mount the code directory
  2. Error messages are clear if the container is run without the required mount
  3. Documentation clearly explains this requirement

🏁 Script executed:

#!/bin/bash
# Verify that container usage includes proper code mounting

echo "=== Checking workflow files for code mounting ==="
rg -n -B2 -A2 'nemo-skills-image|nemo-skills-sandbox-image' .github/workflows/

echo -e "\n=== Checking if resolve_container_image or related code handles mounting ==="
rg -n -B3 -A3 'resolve_container_image' nemo_skills/pipeline/

echo -e "\n=== Checking documentation for mount requirements ==="
rg -n -B2 -A2 'dockerfile:dockerfiles/Dockerfile.nemo-skills' docs/

Length of output: 6127


Let me examine the core Docker execution code and mount handling:


---

🏁 Script executed:

```shell
#!/bin/bash
# Look at the DockerExecutor class and how it handles mounting

echo "=== Finding DockerExecutor class ==="
rg -n 'class DockerExecutor' nemo_skills/

echo -e "\n=== Looking at DockerExecutor instantiation in exp.py ==="
rg -n -B5 -A10 'DockerExecutor(' nemo_skills/pipeline/utils/exp.py

echo -e "\n=== Looking at mounts.py file ==="
rg -n 'def get_mounts_from_config' nemo_skills/

echo -e "\n=== Checking how docker executor actually runs containers ==="
rg -n 'docker run' nemo_skills/

Length of output: 460



---

🏁 Script executed:

```shell
#!/bin/bash
# Let's look at actual mount configurations

echo "=== Looking at get_mounts_from_config in mounts.py ==="
rg -n -A30 'def get_mounts_from_config' nemo_skills/pipeline/utils/mounts.py

echo -e "\n=== Looking at start_local_sandbox.sh docker run command ==="
cat nemo_skills/code_execution/local_sandbox/start_local_sandbox.sh | head -50

echo -e "\n=== Checking for nemo_skills source mount patterns ==="
rg -n 'nemo_skills' nemo_skills/pipeline/utils/mounts.py | head -20

Length of output: 2872



---

🏁 Script executed:

```shell
#!/bin/bash
# Check the Dockerfile to see what paths it expects at runtime

echo "=== Full Dockerfile.nemo-skills ==="
cat dockerfiles/Dockerfile.nemo-skills

echo -e "\n=== Looking for default cluster configs ==="
fd -n 'cluster.*\.yaml|config.*\.yaml' docs/ examples/ | head -20

echo -e "\n=== Checking if nemo_skills is installed as a package ==="
rg -n 'pip install.*nemo.skills|setup.py|pyproject.toml' dockerfiles/Dockerfile.nemo-skills

Length of output: 2899



---

🏁 Script executed:

```shell
#!/bin/bash
# Find cluster configs and understand mount setup

echo "=== Look for cluster config examples ==="
find docs examples -name "*.yaml" -o -name "*.yml" 2>/dev/null | head -20

echo -e "\n=== Check exp.py for local executor mount setup ==="
rg -n -B10 -A20 'executor.*local' nemo_skills/pipeline/utils/exp.py | head -60

echo -e "\n=== Check where nemo_skills code is imported at runtime ==="
rg -n 'from nemo_skills|import nemo_skills' nemo_skills/pipeline/ | head -10

Length of output: 3691



---

🏁 Script executed:

```shell
#!/bin/bash
# Final check: look for documentation about the mounting requirement

echo "=== Check README for mounting instructions ==="
rg -n -B2 -A2 'mount|volume|docker.*run' README.md | head -40

echo -e "\n=== Look for any setup or quickstart docs ==="
find docs -name "*.md" -type f | xargs grep -l "docker.*mount\|volume" 2>/dev/null | head -5

echo -e "\n=== Check if there's a getting started doc ==="
ls -la docs/*.md docs/*/index.md 2>/dev/null | head -20

Length of output: 572



---

🏁 Script executed:

```shell
#!/bin/bash
# Quick check: does pyproject.toml get installed as a package in the container?

echo "=== Check if pyproject.toml is used to install nemo_skills as package ==="
rg -n 'pip install.*pyproject|pip install -e' dockerfiles/Dockerfile.nemo-skills

echo -e "\n=== Check what's in main.txt requirements ==="
head -30 requirements/main.txt

echo -e "\n=== Check if nemo_skills is even listed in requirements ==="
grep -i 'nemo.skills\|nemo-skills' requirements/main.txt

Length of output: 1426


Fix the incomplete architectural transition: source code mounting not implemented in workflows or documentation.

This is a breaking change that's incompletely implemented:

  1. Dockerfile expects mounting (line 51 comment) but the code isn't in the container
  2. Workflows don't mount the source code:
    • tests.yml: docker run --rm --network=host nemo-skills-sandbox-image (no mounts)
    • gpu_tests.yml: docker run --rm -v /tmp:/tmp -v /home:/home nemo-skills-image (mounts /tmp and /home only)
  3. No automatic mounting mechanism exists in DockerExecutorvolumes=mounts only includes user-specified mounts from cluster config
  4. No documentation explains this requirement

Required fixes:

  • Add source code mount to CI workflows (e.g., -v $(pwd):/opt/NeMo-Skills)
  • Document the mounting requirement in README and docs
  • Consider adding default mounting for local executor or clear error handling if mount is missing
🤖 Prompt for AI Agents
In dockerfiles/Dockerfile.nemo-skills around lines 51 to 58, the Dockerfile and
project assume the NeMo-Skills source will be mounted into the container but CI
workflows, the DockerExecutor, and documentation were not updated to provide or
require that mount; update CI workflow files (tests.yml and gpu_tests.yml) to
mount the repository into the container (e.g., -v $(pwd):/opt/NeMo-Skills or
equivalent CI-safe path), update DockerExecutor to either add a sensible default
source mount for local runs or detect missing mount and fail early with a clear
error, and add a short note in README/docs explaining that the container expects
the host source to be mounted at /opt/NeMo-Skills and how to run CI/locally with
the mount; ensure the mount path used in workflows matches the path referenced
in the Dockerfile.

2 changes: 1 addition & 1 deletion dockerfiles/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Some dockerfiles are directly included in this folder and for some others the in

The dockerfiles can be built using the standard docker build command. e.g.,
```shell
docker build -t igitman/nemo-skills:0.7.1 -f dockerfiles/Dockerfile.nemo-skills .
docker build -t nemo-skills-image:0.7.1 -f dockerfiles/Dockerfile.nemo-skills .
```

In addition, we provide a utility script which provides sane build defaults
Expand Down
33 changes: 32 additions & 1 deletion docs/basics/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,9 +98,12 @@ config might look like
executor: local

containers:
# some containers are public and we pull them
trtllm: nvcr.io/nvidia/tensorrt-llm/release:1.0.0
vllm: vllm/vllm-openai:v0.10.1.1
nemo: igitman/nemo-skills-nemo:0.7.0
# some containers are custom and we will build them locally before running the job
# you can always pre-build them as well
nemo-skills: dockerfile:dockerfiles/Dockerfile.nemo-skills
# ... there are some more containers defined here

env_vars:
Expand Down Expand Up @@ -172,6 +175,34 @@ leverage a Slurm cluster[^2]. Let's setup our cluster config for that case by ru
This time pick `slurm` for the config type and fill out all other required information
(such as ssh access, account, partition, etc.).

!!! note
If you're an NVIDIA employee, we have a pre-configured cluster configs for internal usage with pre-built sqsh
containers available at https://gitlab-master.nvidia.com/igitman/nemo-skills-configs. You can most likely
skip the step below and reuse one of the existing configurations.

You will also need to build .sqsh files for all containers or upload all `dockerfile:...` containers to
some registry (e.g. dockerhub) and reference the uploaded versions. To build sqsh files you can use the following commands

1. Build images locally and upload to some container registry. E.g.
```bash
docker build -t gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1 -f dockerfiles/Dockerfile.nemo-skills .
docker push gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1
```
2. Start an interactive shell, e.g. with the following (assuming there is a "cpu" partition)
```bash
srun -A <account> --partition cpu --job-name build-sqsh --time=1:00:00 --exclusive --pty /bin/bash -l
```
3. Import the image, e.g.:
```bash
enroot import -o /path/to/nemo-skills-image.sqsh --docker://gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1
```
4. Specify this image path in your cluster config
```yaml
containers:
nemo-skills: /path/to/nemo-skills-image.sqsh
```
```
Comment on lines +178 to +204
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix markdown linting issues in the documentation.

The new section provides helpful guidance, but there are a couple of formatting issues:

  1. Line 180: The bare URL should be formatted as a proper markdown link or wrapped in angle brackets
  2. Line 204: The closing code fence is missing a language specifier (should be ```yaml instead of just ```)

Apply these fixes:

For line 180 - replace the bare URL with a proper markdown link:

-    containers available at https://gitlab-master.nvidia.com/igitman/nemo-skills-configs. You can most likely
+    containers available at [GitLab](https://gitlab-master.nvidia.com/igitman/nemo-skills-configs). You can most likely

For line 204 - the code block starting earlier should close with a language identifier. Based on the content showing yaml examples, verify the code fence at line 204 properly closes the yaml block.

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

180-180: Bare URL used

(MD034, no-bare-urls)


204-204: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In docs/basics/index.md around lines 178-204, fix two markdown lint issues:
replace the bare URL on line 180 with a proper markdown link (e.g.
[pre-configured
configs](https://gitlab-master.nvidia.com/igitman/nemo-skills-configs) or wrap
it in angle brackets) and ensure the code block that demonstrates the YAML
cluster config is closed correctly by using a matching fenced code block with
the yaml language identifier (change the trailing ``` to ```yaml).


Now that we have a slurm config setup, we can try running some jobs. Generally, you will need to upload models / data
on cluster manually and then reference a proper mounted path. But for small-scale things we can also leverage the
[code packaging](./code-packaging.md) functionality that nemo-skills provide. Whenever you run any of the ns commands
Expand Down
2 changes: 1 addition & 1 deletion docs/basics/sandbox.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Most of the time, the pipeline scripts will launch sandbox automatically when re
it manually, you can use the following command

```bash
docker run --rm --network=host igitman/nemo-skills-sandbox:0.7.1
./nemo_skills/code_execution/local_sandbox/start_local_sandbox.sh
```

If docker is not available, you can still run a sandbox (although less efficient version) like this
Expand Down
Loading