Switch to building containers on-the-fly for local runs by Kipok · Pull Request #969 · NVIDIA-NeMo/Skills

Kipok · 2025-10-17T22:40:49Z

Summary by CodeRabbit

New Features
- Local container build support: tooling and CI now build images from repository Dockerfiles and accept dockerfile-based image specs.
Documentation
- Updated local and Slurm guides with step‑by‑step local image build/import instructions and revised sandbox startup guidance.
Chores
- CI and cluster templates moved away from preset remote image tags toward locally-built image references; packaging now includes build assets.
Bug Fix / UX
- Mount existence check relaxed for non‑Slurm flows; tests default to a safer fallback image.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai · 2025-10-17T22:41:04Z

Walkthrough

Workflows, configs, and Python utilities now prefer repository Dockerfile-based local builds over prebuilt public images; a Dockerfile resolver/builder was added, one large NeMo Dockerfile removed, Dockerfiles/manifest/docs/tests updated, and CI/local image names switched to local image names.

Changes

Cohort / File(s)	Summary
GitHub Workflows `.github/workflows/gpu_tests.yml`, `.github/workflows/tests.yml`	CI jobs now build local images (`nemo-skills-image`, `nemo-skills-sandbox-image`) from repo Dockerfiles, removed Docker-change detection/pull paths, and updated cleanup/test steps to reference local image names.
Dockerfiles & manifest `dockerfiles/Dockerfile.nemo` (deleted), `dockerfiles/Dockerfile.nemo-skills`, `dockerfiles/README.md`, `MANIFEST.in`	Deleted `Dockerfile.nemo`; `Dockerfile.nemo-skills` reworked to expect host-mounted `nemo_skills` and install requirements in-container; README example tag updated to `nemo-skills-image`; `MANIFEST.in` grafts `dockerfiles` and `requirements`.
Cluster configs & tests `cluster_configs/example-local.yaml`, `cluster_configs/example-slurm.yaml`, `tests/gpu-tests/test-local.yaml`	Local configs switched image refs from public tags to `dockerfile:dockerfiles/Dockerfile.*`; SLURM example removed many predefined images and added guidance comments; test config updated to use dockerfile refs.
Runtime mapping & setup `nemo_skills/__init__.py`, `nemo_skills/pipeline/setup.py`	`_containers` mapping changed to `dockerfile:dockerfiles/...` entries; `setup.py` now conditionally includes containers only for local configs, writes YAML with slurm guidance, and prompts/performs pull/build for local executor.
Docker image resolver & exports `nemo_skills/pipeline/utils/docker_images.py` (new), `nemo_skills/pipeline/utils/__init__.py`	New module resolves `dockerfile:` specs, validates paths, computes digests, builds images via Docker CLI, and returns resolved image names; `resolve_container_image` re-exported in utils.
Executor integration `nemo_skills/pipeline/utils/exp.py`	Local executor path now calls `resolve_container_image()` so DockerExecutor receives resolved (built) image names.
Mount checks `nemo_skills/pipeline/utils/mounts.py`	`check_remote_mount_directories` no longer performs local existence checks for non-slurm paths (returns early); slurm/remote checks unchanged.
Tests helper `tests/conftest.py`	Default image_name fallback changed from config `containers["nemo-skills"]` to literal `"bash"`; command wrapper uses `bash -c '{command}'`.
Docs & sandbox `docs/basics/index.md`, `docs/basics/sandbox.md`	Docs updated to show locally-built container examples, .sqsh workflows, and a local start script for sandbox instead of Docker run examples.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User
    participant Setup as setup.py
    participant PullBuild as pull_docker_containers()
    participant Resolver as resolve_container_image()
    participant Docker as Docker CLI
    participant Executor as DockerExecutor

    User->>Setup: request create local config
    Setup->>User: include containers with "dockerfile:..." refs
    User-->>Setup: confirm/prompt to pull/build
    Setup->>PullBuild: pull_docker_containers(containers)
    activate PullBuild
    loop per container
        alt container starts with "dockerfile:"
            PullBuild->>Resolver: resolve_container_image(container, config)
            Resolver->>Resolver: validate path & compute digest
            Resolver->>Docker: docker build -f <path> -t <local-name>
            Docker-->>Resolver: build result (tag)
            Resolver-->>PullBuild: resolved image name
        else
            PullBuild->>Docker: docker pull <image>
            Docker-->>PullBuild: image ready
        end
    end
    deactivate PullBuild
    PullBuild->>Executor: provide resolved images
    Executor->>Docker: run containers (mount host code)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibble Dockerfiles at dawn,

I stitch images from the repo lawn.
Host code mounted, builds unfold,
Local tags now brave and bold.
Hop, run, and log — new stories told.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "Switch to building containers on-the-fly for local runs" accurately and clearly describes the main change in the pull request. The changeset systematically transitions from using pre-built public Docker images (igitman/...) to building container images locally via Dockerfiles, as evidenced by changes across workflow files, configuration files, Python code for container resolution, and documentation updates. The title is concise, specific, and directly related to the core objective of the PR, making it clear to someone scanning the commit history what the primary change entails.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch igitman/containers

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

nemo_skills/pipeline/setup.py (1)

39-52: Consider more specific exception handling, but acceptable for CLI context.

The function now correctly handles both dockerfile-based and registry-based containers. The broad exception catching on lines 44 and 51 could potentially hide unexpected errors, but for a user-facing CLI command that aims to continue processing even if some containers fail, this is a reasonable trade-off.

If you want to be more precise, consider catching specific exceptions like subprocess.SubprocessError, ValueError, or RuntimeError instead of bare Exception.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c52273c and 84da1ef.

📒 Files selected for processing (16)

.github/workflows/gpu_tests.yml (4 hunks)
.github/workflows/tests.yml (1 hunks)
MANIFEST.in (1 hunks)
cluster_configs/example-local.yaml (2 hunks)
cluster_configs/example-slurm.yaml (1 hunks)
dockerfiles/Dockerfile.nemo (0 hunks)
dockerfiles/Dockerfile.nemo-skills (1 hunks)
dockerfiles/README.md (1 hunks)
docs/basics/index.md (2 hunks)
docs/basics/sandbox.md (1 hunks)
nemo_skills/__init__.py (1 hunks)
nemo_skills/pipeline/setup.py (4 hunks)
nemo_skills/pipeline/utils/__init__.py (1 hunks)
nemo_skills/pipeline/utils/docker_images.py (1 hunks)
nemo_skills/pipeline/utils/exp.py (2 hunks)
tests/gpu-tests/test-local.yaml (1 hunks)

💤 Files with no reviewable changes (1)

dockerfiles/Dockerfile.nemo

🧰 Additional context used

🧬 Code graph analysis (4)

nemo_skills/pipeline/utils/__init__.py (1)

nemo_skills/pipeline/utils/docker_images.py (1)

resolve_container_image (87-97)

nemo_skills/pipeline/utils/docker_images.py (2)

nemo_skills/utils.py (1)

get_logger_name (131-135)

nemo_skills/pipeline/utils/declarative.py (1)

run (343-484)

nemo_skills/pipeline/setup.py (1)

nemo_skills/pipeline/utils/docker_images.py (1)

resolve_container_image (87-97)

nemo_skills/pipeline/utils/exp.py (1)

nemo_skills/pipeline/utils/docker_images.py (1)

resolve_container_image (87-97)

🪛 markdownlint-cli2 (0.18.1)

docs/basics/index.md

180-180: Bare URL used

(MD034, no-bare-urls)

204-204: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.0)

nemo_skills/pipeline/utils/docker_images.py

37-37: Avoid specifying long messages outside the exception class

(TRY003)

43-43: Avoid specifying long messages outside the exception class

(TRY003)

46-48: Avoid specifying long messages outside the exception class

(TRY003)

50-50: Avoid specifying long messages outside the exception class

(TRY003)

65-65: subprocess call: check for execution of untrusted input

(S603)

66-74: Starting a process with a partial executable path

(S607)

78-80: Avoid specifying long messages outside the exception class

(TRY003)

82-82: Avoid specifying long messages outside the exception class

(TRY003)

92-92: Avoid specifying long messages outside the exception class

(TRY003)

96-96: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/pipeline/setup.py

44-44: Do not catch blind exception: Exception

(BLE001)

49-49: subprocess call: check for execution of untrusted input

(S603)

49-49: Starting a process with a partial executable path

(S607)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (24)

dockerfiles/README.md (1)

7-7: LGTM! Documentation updated to reflect local builds.

The example now uses a generic local image tag instead of a public registry reference, which aligns with the PR's shift to on-the-fly container builds.

docs/basics/sandbox.md (1)

21-21: LGTM! Simplified sandbox launch command.

The update references a local start script instead of a Docker run command, making the process more straightforward for users and consistent with the new local-build approach.

MANIFEST.in (1)

3-4: LGTM! Essential directories included for local builds.

Adding dockerfiles and requirements to the distribution is necessary for the new dockerfile-based local build workflow. This ensures users have access to the files needed to build images on-the-fly.

cluster_configs/example-slurm.yaml (1)

18-19: LGTM! Guidance-based approach for SLURM container configuration.

The file now directs users to documentation instead of providing explicit container image definitions. This aligns with the new dockerfile-based approach and keeps the example config more maintainable.

nemo_skills/pipeline/utils/docker_images.py (5)

15-27: LGTM! Module setup is correct.

The imports, logger setup, and constants are appropriate. The _REPO_ROOT computation correctly navigates up three directory levels from the file location to reach the repository root.

29-31: LGTM! Robust sanitization for Docker image names.

The function conservatively sanitizes image components to lowercase alphanumeric characters and hyphens, which is safe for Docker naming requirements.

34-52: LGTM! Secure path validation with proper containment checks.

The function correctly validates Dockerfile paths with important security checks:

Rejects absolute paths

Resolves symlinks and relative components

Ensures paths stay within the repository root

Validates file existence and type

The static analysis hints about TRY003 are stylistic preferences and don't affect the security or correctness of the implementation.

55-84: Verify: Docker image caching behavior based on Dockerfile content only.

The image digest is computed solely from the Dockerfile contents (line 59), not from the build context or source files. This means:

Images are rebuilt only when the Dockerfile itself changes

Changes to source code or other context files won't trigger rebuilds

The code will be mounted at runtime instead

Confirm this is the intended caching strategy for local development workflows.

The static analysis warnings about subprocess execution (S603/S607) are expected for Docker CLI usage and don't represent security concerns here since the Dockerfile path is validated and the build context is restricted to the repository root.

87-97: LGTM! Clear entry point with appropriate validation.

The function correctly:

Passes through non-dockerfile container specifications unchanged

Enforces that dockerfile-based builds only work with the local executor

Validates the dockerfile path is provided

Delegates to the build function

The static analysis hints are stylistic and don't affect functionality.

nemo_skills/pipeline/utils/__init__.py (1)

34-34: LGTM! Proper public API exposure.

The resolve_container_image function is correctly exported from the utils module, making it available for use across the pipeline codebase.

nemo_skills/pipeline/utils/exp.py (2)

37-37: LGTM! Import added for container resolution.

The import statement is correctly placed and will be used in the local executor path.

200-202: LGTM! Seamless integration of dockerfile-based image resolution.

The container resolution is correctly integrated into the local executor flow. The resolved image reference is used for the DockerExecutor, enabling on-the-fly image builds when dockerfile specifications are provided.

.github/workflows/gpu_tests.yml (3)

43-46: LGTM! Docker image build step added for GPU tests.

The workflow now builds the nemo-skills-image locally from dockerfiles/Dockerfile.nemo-skills before running tests. This aligns with the PR's objective to build containers on-the-fly.

86-89: LGTM! Consistent Docker build step for Qwen tests.

The build step matches the Llama tests configuration, ensuring both GPU test jobs follow the same local build approach.

59-59: LGTM! Cleanup steps updated to use local image reference.

The cleanup commands now reference nemo-skills-image instead of the previous public registry image, maintaining consistency with the new local build workflow.

Also applies to: 102-102

cluster_configs/example-local.yaml (1)

21-26: LGTM! Clear transition to dockerfile-based container definitions.

The switch from public images to local dockerfile references is well-documented with helpful comments. This aligns with the PR's objective to build containers on-the-fly for local runs.

tests/gpu-tests/test-local.yaml (1)

21-25: LGTM! Consistent with the new dockerfile-based approach.

The container references are properly updated to use dockerfile specifications, matching the pattern established in other config files.

.github/workflows/tests.yml (1)

44-46: Verify comment accuracy about matching tags.

The comment states "these tags need to match the ones in tests/gpu-tests/test-local.yaml", but test-local.yaml uses dockerfile: references rather than direct image tags. The tags here (nemo-skills-image, nemo-skills-sandbox-image) appear to be build-time names for local use in this workflow.

Consider clarifying the comment to explain these are local build tags for CI, not references that need to match the config file.

nemo_skills/__init__.py (1)

22-26: LGTM! Consistent container reference updates.

The _containers mapping now properly uses dockerfile-based references, aligning with the broader PR changes to support local builds.

docs/basics/index.md (1)

101-106: LGTM! Clear documentation of the new build approach.

The updated comments accurately reflect that some containers are now built locally from Dockerfiles rather than pulled from registries.

nemo_skills/pipeline/setup.py (4)

25-25: LGTM! Proper import for dockerfile-based image resolution.

The import of resolve_container_image enables the new build-from-dockerfile functionality for local containers.

93-96: LGTM! Appropriate separation of concerns between executor types.

Only including the containers mapping for local configs makes sense, as slurm configs require different handling (sqsh files or registry images). The config is properly initialized with just the executor, and containers are added conditionally.

215-225: LGTM! Clean YAML generation with helpful slurm guidance.

The approach of generating YAML content as a string and then replacing the executor line with a helpful comment block for slurm configs is a pragmatic solution that provides clear guidance to users.

227-246: LGTM! Clear messaging adapted to executor type.

The conditional messaging and prompt text properly reflect the different workflows for local (pull/build containers) versus slurm (manual sqsh setup) configurations.

coderabbitai · 2025-10-17T22:47:05Z

dockerfiles/Dockerfile.nemo-skills

+# we aren't copying main nemo_skills folder as it will always be mounted from host
+# but we do want to install all requirements in the container directly
+RUN mkdir -p /opt/NeMo-Skills/requirements
+COPY pyproject.toml README.md /opt/NeMo-Skills/
+COPY requirements /opt/NeMo-Skills/requirements/
+# installing sdp in container only
+RUN pip install git+https://github.com/NVIDIA/NeMo-speech-data-processor@29b9b1ec0ceaf3ffa441c1d01297371b3f8e11d2
+RUN pip install -r /opt/NeMo-Skills/requirements/main.txt


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that all container usage contexts properly mount the nemo_skills code.

This is a significant architectural change: the image no longer contains the nemo_skills source code and instead expects it to be mounted from the host at runtime. While this is excellent for local development workflows (allowing live code changes), it introduces a dependency on proper mount configuration.

Ensure that:

All places that use this container (CI workflows, documentation examples, cluster configs) properly mount the code directory

Error messages are clear if the container is run without the required mount

Documentation clearly explains this requirement

🏁 Script executed:

#!/bin/bash # Verify that container usage includes proper code mounting echo "=== Checking workflow files for code mounting ===" rg -n -B2 -A2 'nemo-skills-image|nemo-skills-sandbox-image' .github/workflows/ echo -e "\n=== Checking if resolve_container_image or related code handles mounting ===" rg -n -B3 -A3 'resolve_container_image' nemo_skills/pipeline/ echo -e "\n=== Checking documentation for mount requirements ===" rg -n -B2 -A2 'dockerfile:dockerfiles/Dockerfile.nemo-skills' docs/

Length of output: 6127

Let me examine the core Docker execution code and mount handling:

--- 🏁 Script executed: ```shell #!/bin/bash # Look at the DockerExecutor class and how it handles mounting echo "=== Finding DockerExecutor class ===" rg -n 'class DockerExecutor' nemo_skills/ echo -e "\n=== Looking at DockerExecutor instantiation in exp.py ===" rg -n -B5 -A10 'DockerExecutor(' nemo_skills/pipeline/utils/exp.py echo -e "\n=== Looking at mounts.py file ===" rg -n 'def get_mounts_from_config' nemo_skills/ echo -e "\n=== Checking how docker executor actually runs containers ===" rg -n 'docker run' nemo_skills/

Length of output: 460

--- 🏁 Script executed: ```shell #!/bin/bash # Let's look at actual mount configurations echo "=== Looking at get_mounts_from_config in mounts.py ===" rg -n -A30 'def get_mounts_from_config' nemo_skills/pipeline/utils/mounts.py echo -e "\n=== Looking at start_local_sandbox.sh docker run command ===" cat nemo_skills/code_execution/local_sandbox/start_local_sandbox.sh | head -50 echo -e "\n=== Checking for nemo_skills source mount patterns ===" rg -n 'nemo_skills' nemo_skills/pipeline/utils/mounts.py | head -20

Length of output: 2872

--- 🏁 Script executed: ```shell #!/bin/bash # Check the Dockerfile to see what paths it expects at runtime echo "=== Full Dockerfile.nemo-skills ===" cat dockerfiles/Dockerfile.nemo-skills echo -e "\n=== Looking for default cluster configs ===" fd -n 'cluster.*\.yaml|config.*\.yaml' docs/ examples/ | head -20 echo -e "\n=== Checking if nemo_skills is installed as a package ===" rg -n 'pip install.*nemo.skills|setup.py|pyproject.toml' dockerfiles/Dockerfile.nemo-skills

Length of output: 2899

--- 🏁 Script executed: ```shell #!/bin/bash # Find cluster configs and understand mount setup echo "=== Look for cluster config examples ===" find docs examples -name "*.yaml" -o -name "*.yml" 2>/dev/null | head -20 echo -e "\n=== Check exp.py for local executor mount setup ===" rg -n -B10 -A20 'executor.*local' nemo_skills/pipeline/utils/exp.py | head -60 echo -e "\n=== Check where nemo_skills code is imported at runtime ===" rg -n 'from nemo_skills|import nemo_skills' nemo_skills/pipeline/ | head -10

Length of output: 3691

--- 🏁 Script executed: ```shell #!/bin/bash # Final check: look for documentation about the mounting requirement echo "=== Check README for mounting instructions ===" rg -n -B2 -A2 'mount|volume|docker.*run' README.md | head -40 echo -e "\n=== Look for any setup or quickstart docs ===" find docs -name "*.md" -type f | xargs grep -l "docker.*mount\|volume" 2>/dev/null | head -5 echo -e "\n=== Check if there's a getting started doc ===" ls -la docs/*.md docs/*/index.md 2>/dev/null | head -20

Length of output: 572

--- 🏁 Script executed: ```shell #!/bin/bash # Quick check: does pyproject.toml get installed as a package in the container? echo "=== Check if pyproject.toml is used to install nemo_skills as package ===" rg -n 'pip install.*pyproject|pip install -e' dockerfiles/Dockerfile.nemo-skills echo -e "\n=== Check what's in main.txt requirements ===" head -30 requirements/main.txt echo -e "\n=== Check if nemo_skills is even listed in requirements ===" grep -i 'nemo.skills\|nemo-skills' requirements/main.txt

Length of output: 1426

Fix the incomplete architectural transition: source code mounting not implemented in workflows or documentation.

This is a breaking change that's incompletely implemented:

Dockerfile expects mounting (line 51 comment) but the code isn't in the container

Workflows don't mount the source code:

tests.yml: docker run --rm --network=host nemo-skills-sandbox-image (no mounts)

gpu_tests.yml: docker run --rm -v /tmp:/tmp -v /home:/home nemo-skills-image (mounts /tmp and /home only)

No automatic mounting mechanism exists in DockerExecutor — volumes=mounts only includes user-specified mounts from cluster config

No documentation explains this requirement

Required fixes:

Add source code mount to CI workflows (e.g., -v $(pwd):/opt/NeMo-Skills)

Document the mounting requirement in README and docs

Consider adding default mounting for local executor or clear error handling if mount is missing

🤖 Prompt for AI Agents

In dockerfiles/Dockerfile.nemo-skills around lines 51 to 58, the Dockerfile and project assume the NeMo-Skills source will be mounted into the container but CI workflows, the DockerExecutor, and documentation were not updated to provide or require that mount; update CI workflow files (tests.yml and gpu_tests.yml) to mount the repository into the container (e.g., -v $(pwd):/opt/NeMo-Skills or equivalent CI-safe path), update DockerExecutor to either add a sensible default source mount for local runs or detect missing mount and fail early with a clear error, and add a short note in README/docs explaining that the container expects the host source to be mounted at /opt/NeMo-Skills and how to run CI/locally with the mount; ensure the mount path used in workflows matches the path referenced in the Dockerfile.

coderabbitai · 2025-10-17T22:47:05Z

docs/basics/index.md

+!!! note
+    If you're an NVIDIA employee, we have a pre-configured cluster configs for internal usage with pre-built sqsh
+    containers available at https://gitlab-master.nvidia.com/igitman/nemo-skills-configs. You can most likely
+    skip the step below and reuse one of the existing configurations.
+
+You will also need to build .sqsh files for all containers or upload all `dockerfile:...` containers to
+some registry (e.g. dockerhub) and reference the uploaded versions. To build sqsh files you can use the following commands
+
+1. Build images locally and upload to some container registry. E.g.
+   ```bash
+   docker build -t gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1 -f dockerfiles/Dockerfile.nemo-skills .
+   docker push gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1
+   ```
+2. Start an interactive shell, e.g. with the following (assuming there is a "cpu" partition)
+   ```bash
+   srun -A <account> --partition cpu --job-name build-sqsh --time=1:00:00 --exclusive --pty /bin/bash -l
+   ```
+3. Import the image, e.g.:
+   ```bash
+   enroot import -o /path/to/nemo-skills-image.sqsh --docker://gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1
+   ```
+4. Specify this image path in your cluster config
+   ```yaml
+   containers:
+     nemo-skills: /path/to/nemo-skills-image.sqsh
+   ```
+```


⚠️ Potential issue | 🟡 Minor

Fix markdown linting issues in the documentation.

The new section provides helpful guidance, but there are a couple of formatting issues:

Line 180: The bare URL should be formatted as a proper markdown link or wrapped in angle brackets

Line 204: The closing code fence is missing a language specifier (should be ```yaml instead of just ```)

Apply these fixes:

For line 180 - replace the bare URL with a proper markdown link:

- containers available at https://gitlab-master.nvidia.com/igitman/nemo-skills-configs. You can most likely + containers available at [GitLab](https://gitlab-master.nvidia.com/igitman/nemo-skills-configs). You can most likely

For line 204 - the code block starting earlier should close with a language identifier. Based on the content showing yaml examples, verify the code fence at line 204 properly closes the yaml block.

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

180-180: Bare URL used

(MD034, no-bare-urls)

204-204: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In docs/basics/index.md around lines 178-204, fix two markdown lint issues: replace the bare URL on line 180 with a proper markdown link (e.g. [pre-configured configs](https://gitlab-master.nvidia.com/igitman/nemo-skills-configs) or wrap it in angle brackets) and ensure the code block that demonstrates the YAML cluster config is closed correctly by using a matching fenced code block with the yaml language identifier (change the trailing ``` to ```yaml).

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.github/workflows/tests.yml (1)
42-43: Improve readiness verification for the background sandbox container.

The workflow starts a background container and waits a fixed 10 seconds before running tests. This sleep-based approach is fragile and can lead to intermittent test failures if the container takes longer to initialize or fails silently.

Consider adding a health check or readiness probe instead of relying on a fixed delay.

Example approach using a health check loop:
-        docker run --rm --network=host nemo-skills-sandbox-image &
-        sleep 10
+        docker run --rm --network=host --name nemo-skills-sandbox nemo-skills-sandbox-image &
+        # Wait for container to be ready (verify with health check or port)
+        for i in {1..30}; do
+          if docker exec nemo-skills-sandbox healthcheck-command &>/dev/null; then
+            echo "Container is ready"
+            break
+          fi
+          if [ $i -eq 30 ]; then echo "Container failed to start"; exit 1; fi
+          sleep 1
+        done
Alternatively, if the container exposes a port/service, verify connectivity before proceeding with tests.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b64ba26 and f139e70.

📒 Files selected for processing (2)

.github/workflows/tests.yml (1 hunks)
nemo_skills/pipeline/utils/mounts.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (3)

nemo_skills/pipeline/utils/mounts.py (2)

363-363: LGTM: Docstring accurately reflects function behavior.

The updated docstring correctly describes that this function checks directory existence rather than creating directories.

375-377: I need to gather more context about executor types and whether this behavior is intentional. Let me search for more information:

I need to understand the executor types better and check if this behavior change affects container mount handling:

Now let me check the order of operations and understand when these functions are called:

Based on my analysis of the codebase and verification of Docker bind mount behavior, I can now provide my final assessment:

Confirm the mount validation skip requires verification and poses a risk.

Docker's --mount flag does not automatically create directories if the source path doesn't exist and produces an error instead: "bind source path does not exist". The code change skips validation for non-slurm executors (lines 375-377), which conflicts with Docker's actual runtime behavior.

From the script output, I observed:

create_remote_directory() (lines 299-302) still attempts local directory creation via LocalTunnel for non-slurm executors

check_remote_mount_directories() (lines 375-377) now skips validation entirely for the same executor types

This asymmetry means directories may be created but never validated to exist before container startup

This creates a gap where configuration errors won't surface until the container attempts to mount, resulting in confusing runtime failures instead of clear validation-time errors during local development.

Recommendation: Verify this behavior change is intentional for the dockerfile-based local workflow. If directories are genuinely not required locally, add documentation explaining why. If this is unintended, the validation should not be skipped.

.github/workflows/tests.yml (1)

35-36: Line 34's comment is inaccurate; image tags don't need to match test-local.yaml format.

Dockerfiles exist and appear structurally sound. However, the comment on line 34 ("these tags need to match the ones in tests/gpu-tests/test-local.yaml") is misleading. The workflow uses Docker image tags ("nemo-skills-image", "nemo-skills-sandbox-image"), while test-local.yaml uses Apptainer/Singularity format ("dockerfile:dockerfiles/..."). These are different test frameworks and their naming conventions don't need to match.

Additionally, line 42's container readiness check relies on an arbitrary 10-second sleep with no health verification. If the sandbox service requires more time to start, tests will fail without clear diagnostics. Consider replacing this with a health check or explicit readiness validation.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f139e70 and b823b48.

📒 Files selected for processing (1)

tests/conftest.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

coderabbitai · 2025-10-18T00:38:56Z

tests/conftest.py

    if image_name is None:
-        image_name = config["containers"]["nemo-skills"]
+        image_name = "bash"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that "bash" is the correct default image for tests.

Changing the default from config["containers"]["nemo-skills"] to the literal string "bash" is concerning. The public bash Docker image is a minimal shell environment without NeMo Skills dependencies, which would cause test failures.

If the intent is to remove the default entirely (forcing callers to specify images), consider raising an error instead. If a default is still needed, it should integrate with the new dockerfile-based resolution system mentioned in the PR (e.g., nemo_skills/pipeline/utils/docker_images.py).

Run the following script to verify all callers now pass explicit image_name:

🏁 Script executed:

#!/bin/bash # Description: Find all calls to docker_run without explicit image_name parameter # Search for docker_run calls that might rely on the default rg -n -A 2 -B 2 'docker_run\(' --type=py

Length of output: 1172

--- 🏁 Script executed: ```shell #!/bin/bash # Get full conftest.py to understand docker_run implementation and old defaults cat tests/conftest.py

Length of output: 2384

🏁 Script executed:

#!/bin/bash # Search for docker_images utilities integration and config references rg -n 'docker_images|config\[.*containers' --type=py -C 3

Length of output: 16462

🏁 Script executed:

#!/bin/bash # Check if "bash" image is used as default elsewhere or if it's a custom build rg -n '"bash"' --type=py tests/

Length of output: 48

Fix incomplete refactoring: docker_run default image must integrate with docker_images utilities or remove the default entirely.

The change from config["containers"]["nemo-skills"] to hardcoded "bash" is a breaking refactor that:

Leaves 3 callers broken: docker_rm() (line 66), docker_rm_and_mkdir() (line 72), and test code (line 181) all rely on the default and will now silently use a minimal bash image instead of the full NeMo Skills environment.

Ignores the docker_images infrastructure: The PR introduces nemo_skills/pipeline/utils/docker_images.py with resolve_container_image(), which is imported throughout the main codebase (setup.py, exp.py, etc.), but conftest.py doesn't use it—just hardcodes "bash".

Breaks the abstraction: The production pipeline is config-driven (cluster_config["containers"]["nemo-skills"]), but tests are now hardcoded to a public minimal image.

Either integrate conftest.py with the docker_images utilities for proper container resolution, or remove the default and require all callers to pass explicit image_name. Hardcoding "bash" is not maintainable.

🤖 Prompt for AI Agents

In tests/conftest.py around lines 27-28, the default image_name was hardcoded to "bash", which breaks callers and bypasses the docker_images resolution; replace this by either (A) invoking the existing docker_images.resolve_container_image or reading cluster_config["containers"]["nemo-skills"] to compute the default image (importing the resolver at top), or (B) remove the implicit default and raise/require callers to pass image_name (update docker_rm, docker_rm_and_mkdir and tests to supply it). Ensure imports are added and all callers updated if you choose option B.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tests/conftest.py (1)

27-28: The hardcoded "bash" default breaks callers and bypasses docker_images resolution.

This issue was already flagged in the previous review. The change from config["containers"]["nemo-skills"] to "bash" leaves docker_rm() (line 66) and docker_rm_and_mkdir() (line 72) using a minimal bash image instead of the NeMo Skills environment, and ignores the PR's docker_images infrastructure with resolve_container_image().

🧹 Nitpick comments (1)

tests/conftest.py (1)
42-42: Consider using absolute path /bin/bash for better robustness.

The change from /bin/bash -c to bash -c relies on PATH resolution instead of an explicit absolute path. While functionally equivalent in standard containers, using the absolute path is more explicit and portable.

Apply this diff to restore the absolute path:
-    full_command = f"bash -c '{command}'"
+    full_command = f"/bin/bash -c '{command}'"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b823b48 and a451bcc.

📒 Files selected for processing (1)

tests/conftest.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Kipok added 7 commits October 17, 2025 12:24

Add logic to build containers on-the-fly

9c64f66

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Simplify and add docs

04a3f61

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fixes

4be4547

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update container references

1ee91cd

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into igitman/containers

02b2610

Update setup to build containers

e50ce6d

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add license

84da1ef

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok requested a review from activatedgeek October 17, 2025 22:41

coderabbitai bot reviewed Oct 17, 2025

View reviewed changes

Kipok added 3 commits October 17, 2025 16:35

Merge branch 'main' into igitman/containers

2e17547

Fix mount

b64ba26

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Skip mount check locally

f139e70

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot reviewed Oct 18, 2025

View reviewed changes

Update default image

b823b48

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot reviewed Oct 18, 2025

View reviewed changes

Fix bash command

a451bcc

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot reviewed Oct 18, 2025

View reviewed changes

Merge branch 'main' into igitman/containers

e3d14ef

activatedgeek approved these changes Oct 20, 2025

View reviewed changes

Kipok merged commit 617e893 into main Oct 20, 2025
7 checks passed

Kipok deleted the igitman/containers branch October 20, 2025 16:49

SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 22, 2025

Switch to building containers on-the-fly for local runs (NVIDIA-NeMo#969

1bf5057

) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Oct 29, 2025

Switch to building containers on-the-fly for local runs (#969)

0376d82

Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot mentioned this pull request Dec 4, 2025

Port ICPC changes to IOI #1046

Merged

This was referenced Dec 18, 2025

Unable to run multi-node scripts using Enroot images NVIDIA-NeMo/RL#1657

Closed

Revert "Port ICPC changes to IOI (#1046)" #1130

Closed

coderabbitai bot mentioned this pull request Jan 15, 2026

Update nemo-rl to latest #1087

Merged

coderabbitai bot mentioned this pull request Jan 29, 2026

Upgrade containers #1198

Merged

coderabbitai bot mentioned this pull request Feb 13, 2026

Fix CI disk space for Docker image builds #1241

Merged

coderabbitai bot mentioned this pull request Feb 25, 2026

New slurm customization parameters (account, containers) #1209

Merged

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Switch to building containers on-the-fly for local runs (#969)

0cee22f

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to building containers on-the-fly for local runs#969

Switch to building containers on-the-fly for local runs#969
Kipok merged 13 commits intomainfrom
igitman/containers

Kipok commented Oct 17, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 17, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 17, 2025

Uh oh!

coderabbitai bot Oct 17, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 18, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kipok commented Oct 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kipok commented Oct 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 17, 2025 •

edited

Loading