Skip to content

Switch to building containers on-the-fly for local runs#969

Merged
Kipok merged 13 commits intomainfrom
igitman/containers
Oct 20, 2025
Merged

Switch to building containers on-the-fly for local runs#969
Kipok merged 13 commits intomainfrom
igitman/containers

Conversation

@Kipok
Copy link
Collaborator

@Kipok Kipok commented Oct 17, 2025

Summary by CodeRabbit

  • New Features

    • Local container build support: tooling and CI now build images from repository Dockerfiles and accept dockerfile-based image specs.
  • Documentation

    • Updated local and Slurm guides with step‑by‑step local image build/import instructions and revised sandbox startup guidance.
  • Chores

    • CI and cluster templates moved away from preset remote image tags toward locally-built image references; packaging now includes build assets.
  • Bug Fix / UX

    • Mount existence check relaxed for non‑Slurm flows; tests default to a safer fallback image.

Kipok added 7 commits October 17, 2025 12:24
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@Kipok Kipok requested a review from activatedgeek October 17, 2025 22:41
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 17, 2025

Walkthrough

Workflows, configs, and Python utilities now prefer repository Dockerfile-based local builds over prebuilt public images; a Dockerfile resolver/builder was added, one large NeMo Dockerfile removed, Dockerfiles/manifest/docs/tests updated, and CI/local image names switched to local image names.

Changes

Cohort / File(s) Summary
GitHub Workflows
.github/workflows/gpu_tests.yml, .github/workflows/tests.yml
CI jobs now build local images (nemo-skills-image, nemo-skills-sandbox-image) from repo Dockerfiles, removed Docker-change detection/pull paths, and updated cleanup/test steps to reference local image names.
Dockerfiles & manifest
dockerfiles/Dockerfile.nemo (deleted), dockerfiles/Dockerfile.nemo-skills, dockerfiles/README.md, MANIFEST.in
Deleted Dockerfile.nemo; Dockerfile.nemo-skills reworked to expect host-mounted nemo_skills and install requirements in-container; README example tag updated to nemo-skills-image; MANIFEST.in grafts dockerfiles and requirements.
Cluster configs & tests
cluster_configs/example-local.yaml, cluster_configs/example-slurm.yaml, tests/gpu-tests/test-local.yaml
Local configs switched image refs from public tags to dockerfile:dockerfiles/Dockerfile.*; SLURM example removed many predefined images and added guidance comments; test config updated to use dockerfile refs.
Runtime mapping & setup
nemo_skills/__init__.py, nemo_skills/pipeline/setup.py
_containers mapping changed to dockerfile:dockerfiles/... entries; setup.py now conditionally includes containers only for local configs, writes YAML with slurm guidance, and prompts/performs pull/build for local executor.
Docker image resolver & exports
nemo_skills/pipeline/utils/docker_images.py (new), nemo_skills/pipeline/utils/__init__.py
New module resolves dockerfile: specs, validates paths, computes digests, builds images via Docker CLI, and returns resolved image names; resolve_container_image re-exported in utils.
Executor integration
nemo_skills/pipeline/utils/exp.py
Local executor path now calls resolve_container_image() so DockerExecutor receives resolved (built) image names.
Mount checks
nemo_skills/pipeline/utils/mounts.py
check_remote_mount_directories no longer performs local existence checks for non-slurm paths (returns early); slurm/remote checks unchanged.
Tests helper
tests/conftest.py
Default image_name fallback changed from config containers["nemo-skills"] to literal "bash"; command wrapper uses bash -c '{command}'.
Docs & sandbox
docs/basics/index.md, docs/basics/sandbox.md
Docs updated to show locally-built container examples, .sqsh workflows, and a local start script for sandbox instead of Docker run examples.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User
    participant Setup as setup.py
    participant PullBuild as pull_docker_containers()
    participant Resolver as resolve_container_image()
    participant Docker as Docker CLI
    participant Executor as DockerExecutor

    User->>Setup: request create local config
    Setup->>User: include containers with "dockerfile:..." refs
    User-->>Setup: confirm/prompt to pull/build
    Setup->>PullBuild: pull_docker_containers(containers)
    activate PullBuild
    loop per container
        alt container starts with "dockerfile:"
            PullBuild->>Resolver: resolve_container_image(container, config)
            Resolver->>Resolver: validate path & compute digest
            Resolver->>Docker: docker build -f <path> -t <local-name>
            Docker-->>Resolver: build result (tag)
            Resolver-->>PullBuild: resolved image name
        else
            PullBuild->>Docker: docker pull <image>
            Docker-->>PullBuild: image ready
        end
    end
    deactivate PullBuild
    PullBuild->>Executor: provide resolved images
    Executor->>Docker: run containers (mount host code)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibble Dockerfiles at dawn,

I stitch images from the repo lawn.
Host code mounted, builds unfold,
Local tags now brave and bold.
Hop, run, and log — new stories told.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Switch to building containers on-the-fly for local runs" accurately and clearly describes the main change in the pull request. The changeset systematically transitions from using pre-built public Docker images (igitman/...) to building container images locally via Dockerfiles, as evidenced by changes across workflow files, configuration files, Python code for container resolution, and documentation updates. The title is concise, specific, and directly related to the core objective of the PR, making it clear to someone scanning the commit history what the primary change entails.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch igitman/containers

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
nemo_skills/pipeline/setup.py (1)

39-52: Consider more specific exception handling, but acceptable for CLI context.

The function now correctly handles both dockerfile-based and registry-based containers. The broad exception catching on lines 44 and 51 could potentially hide unexpected errors, but for a user-facing CLI command that aims to continue processing even if some containers fail, this is a reasonable trade-off.

If you want to be more precise, consider catching specific exceptions like subprocess.SubprocessError, ValueError, or RuntimeError instead of bare Exception.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c52273c and 84da1ef.

📒 Files selected for processing (16)
  • .github/workflows/gpu_tests.yml (4 hunks)
  • .github/workflows/tests.yml (1 hunks)
  • MANIFEST.in (1 hunks)
  • cluster_configs/example-local.yaml (2 hunks)
  • cluster_configs/example-slurm.yaml (1 hunks)
  • dockerfiles/Dockerfile.nemo (0 hunks)
  • dockerfiles/Dockerfile.nemo-skills (1 hunks)
  • dockerfiles/README.md (1 hunks)
  • docs/basics/index.md (2 hunks)
  • docs/basics/sandbox.md (1 hunks)
  • nemo_skills/__init__.py (1 hunks)
  • nemo_skills/pipeline/setup.py (4 hunks)
  • nemo_skills/pipeline/utils/__init__.py (1 hunks)
  • nemo_skills/pipeline/utils/docker_images.py (1 hunks)
  • nemo_skills/pipeline/utils/exp.py (2 hunks)
  • tests/gpu-tests/test-local.yaml (1 hunks)
💤 Files with no reviewable changes (1)
  • dockerfiles/Dockerfile.nemo
🧰 Additional context used
🧬 Code graph analysis (4)
nemo_skills/pipeline/utils/__init__.py (1)
nemo_skills/pipeline/utils/docker_images.py (1)
  • resolve_container_image (87-97)
nemo_skills/pipeline/utils/docker_images.py (2)
nemo_skills/utils.py (1)
  • get_logger_name (131-135)
nemo_skills/pipeline/utils/declarative.py (1)
  • run (343-484)
nemo_skills/pipeline/setup.py (1)
nemo_skills/pipeline/utils/docker_images.py (1)
  • resolve_container_image (87-97)
nemo_skills/pipeline/utils/exp.py (1)
nemo_skills/pipeline/utils/docker_images.py (1)
  • resolve_container_image (87-97)
🪛 markdownlint-cli2 (0.18.1)
docs/basics/index.md

180-180: Bare URL used

(MD034, no-bare-urls)


204-204: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.0)
nemo_skills/pipeline/utils/docker_images.py

37-37: Avoid specifying long messages outside the exception class

(TRY003)


43-43: Avoid specifying long messages outside the exception class

(TRY003)


46-48: Avoid specifying long messages outside the exception class

(TRY003)


50-50: Avoid specifying long messages outside the exception class

(TRY003)


65-65: subprocess call: check for execution of untrusted input

(S603)


66-74: Starting a process with a partial executable path

(S607)


78-80: Avoid specifying long messages outside the exception class

(TRY003)


82-82: Avoid specifying long messages outside the exception class

(TRY003)


92-92: Avoid specifying long messages outside the exception class

(TRY003)


96-96: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/pipeline/setup.py

44-44: Do not catch blind exception: Exception

(BLE001)


49-49: subprocess call: check for execution of untrusted input

(S603)


49-49: Starting a process with a partial executable path

(S607)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (24)
dockerfiles/README.md (1)

7-7: LGTM! Documentation updated to reflect local builds.

The example now uses a generic local image tag instead of a public registry reference, which aligns with the PR's shift to on-the-fly container builds.

docs/basics/sandbox.md (1)

21-21: LGTM! Simplified sandbox launch command.

The update references a local start script instead of a Docker run command, making the process more straightforward for users and consistent with the new local-build approach.

MANIFEST.in (1)

3-4: LGTM! Essential directories included for local builds.

Adding dockerfiles and requirements to the distribution is necessary for the new dockerfile-based local build workflow. This ensures users have access to the files needed to build images on-the-fly.

cluster_configs/example-slurm.yaml (1)

18-19: LGTM! Guidance-based approach for SLURM container configuration.

The file now directs users to documentation instead of providing explicit container image definitions. This aligns with the new dockerfile-based approach and keeps the example config more maintainable.

nemo_skills/pipeline/utils/docker_images.py (5)

15-27: LGTM! Module setup is correct.

The imports, logger setup, and constants are appropriate. The _REPO_ROOT computation correctly navigates up three directory levels from the file location to reach the repository root.


29-31: LGTM! Robust sanitization for Docker image names.

The function conservatively sanitizes image components to lowercase alphanumeric characters and hyphens, which is safe for Docker naming requirements.


34-52: LGTM! Secure path validation with proper containment checks.

The function correctly validates Dockerfile paths with important security checks:

  • Rejects absolute paths
  • Resolves symlinks and relative components
  • Ensures paths stay within the repository root
  • Validates file existence and type

The static analysis hints about TRY003 are stylistic preferences and don't affect the security or correctness of the implementation.


55-84: Verify: Docker image caching behavior based on Dockerfile content only.

The image digest is computed solely from the Dockerfile contents (line 59), not from the build context or source files. This means:

  • Images are rebuilt only when the Dockerfile itself changes
  • Changes to source code or other context files won't trigger rebuilds
  • The code will be mounted at runtime instead

Confirm this is the intended caching strategy for local development workflows.

The static analysis warnings about subprocess execution (S603/S607) are expected for Docker CLI usage and don't represent security concerns here since the Dockerfile path is validated and the build context is restricted to the repository root.


87-97: LGTM! Clear entry point with appropriate validation.

The function correctly:

  • Passes through non-dockerfile container specifications unchanged
  • Enforces that dockerfile-based builds only work with the local executor
  • Validates the dockerfile path is provided
  • Delegates to the build function

The static analysis hints are stylistic and don't affect functionality.

nemo_skills/pipeline/utils/__init__.py (1)

34-34: LGTM! Proper public API exposure.

The resolve_container_image function is correctly exported from the utils module, making it available for use across the pipeline codebase.

nemo_skills/pipeline/utils/exp.py (2)

37-37: LGTM! Import added for container resolution.

The import statement is correctly placed and will be used in the local executor path.


200-202: LGTM! Seamless integration of dockerfile-based image resolution.

The container resolution is correctly integrated into the local executor flow. The resolved image reference is used for the DockerExecutor, enabling on-the-fly image builds when dockerfile specifications are provided.

.github/workflows/gpu_tests.yml (3)

43-46: LGTM! Docker image build step added for GPU tests.

The workflow now builds the nemo-skills-image locally from dockerfiles/Dockerfile.nemo-skills before running tests. This aligns with the PR's objective to build containers on-the-fly.


86-89: LGTM! Consistent Docker build step for Qwen tests.

The build step matches the Llama tests configuration, ensuring both GPU test jobs follow the same local build approach.


59-59: LGTM! Cleanup steps updated to use local image reference.

The cleanup commands now reference nemo-skills-image instead of the previous public registry image, maintaining consistency with the new local build workflow.

Also applies to: 102-102

cluster_configs/example-local.yaml (1)

21-26: LGTM! Clear transition to dockerfile-based container definitions.

The switch from public images to local dockerfile references is well-documented with helpful comments. This aligns with the PR's objective to build containers on-the-fly for local runs.

tests/gpu-tests/test-local.yaml (1)

21-25: LGTM! Consistent with the new dockerfile-based approach.

The container references are properly updated to use dockerfile specifications, matching the pattern established in other config files.

.github/workflows/tests.yml (1)

44-46: Verify comment accuracy about matching tags.

The comment states "these tags need to match the ones in tests/gpu-tests/test-local.yaml", but test-local.yaml uses dockerfile: references rather than direct image tags. The tags here (nemo-skills-image, nemo-skills-sandbox-image) appear to be build-time names for local use in this workflow.

Consider clarifying the comment to explain these are local build tags for CI, not references that need to match the config file.

nemo_skills/__init__.py (1)

22-26: LGTM! Consistent container reference updates.

The _containers mapping now properly uses dockerfile-based references, aligning with the broader PR changes to support local builds.

docs/basics/index.md (1)

101-106: LGTM! Clear documentation of the new build approach.

The updated comments accurately reflect that some containers are now built locally from Dockerfiles rather than pulled from registries.

nemo_skills/pipeline/setup.py (4)

25-25: LGTM! Proper import for dockerfile-based image resolution.

The import of resolve_container_image enables the new build-from-dockerfile functionality for local containers.


93-96: LGTM! Appropriate separation of concerns between executor types.

Only including the containers mapping for local configs makes sense, as slurm configs require different handling (sqsh files or registry images). The config is properly initialized with just the executor, and containers are added conditionally.


215-225: LGTM! Clean YAML generation with helpful slurm guidance.

The approach of generating YAML content as a string and then replacing the executor line with a helpful comment block for slurm configs is a pragmatic solution that provides clear guidance to users.


227-246: LGTM! Clear messaging adapted to executor type.

The conditional messaging and prompt text properly reflect the different workflows for local (pull/build containers) versus slurm (manual sqsh setup) configurations.

Comment on lines +51 to +58
# we aren't copying main nemo_skills folder as it will always be mounted from host
# but we do want to install all requirements in the container directly
RUN mkdir -p /opt/NeMo-Skills/requirements
COPY pyproject.toml README.md /opt/NeMo-Skills/
COPY requirements /opt/NeMo-Skills/requirements/
# installing sdp in container only
RUN pip install git+https://github.com/NVIDIA/NeMo-speech-data-processor@29b9b1ec0ceaf3ffa441c1d01297371b3f8e11d2
RUN pip install -r /opt/NeMo-Skills/requirements/main.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that all container usage contexts properly mount the nemo_skills code.

This is a significant architectural change: the image no longer contains the nemo_skills source code and instead expects it to be mounted from the host at runtime. While this is excellent for local development workflows (allowing live code changes), it introduces a dependency on proper mount configuration.

Ensure that:

  1. All places that use this container (CI workflows, documentation examples, cluster configs) properly mount the code directory
  2. Error messages are clear if the container is run without the required mount
  3. Documentation clearly explains this requirement

🏁 Script executed:

#!/bin/bash
# Verify that container usage includes proper code mounting

echo "=== Checking workflow files for code mounting ==="
rg -n -B2 -A2 'nemo-skills-image|nemo-skills-sandbox-image' .github/workflows/

echo -e "\n=== Checking if resolve_container_image or related code handles mounting ==="
rg -n -B3 -A3 'resolve_container_image' nemo_skills/pipeline/

echo -e "\n=== Checking documentation for mount requirements ==="
rg -n -B2 -A2 'dockerfile:dockerfiles/Dockerfile.nemo-skills' docs/

Length of output: 6127


Let me examine the core Docker execution code and mount handling:


---

🏁 Script executed:

```shell
#!/bin/bash
# Look at the DockerExecutor class and how it handles mounting

echo "=== Finding DockerExecutor class ==="
rg -n 'class DockerExecutor' nemo_skills/

echo -e "\n=== Looking at DockerExecutor instantiation in exp.py ==="
rg -n -B5 -A10 'DockerExecutor(' nemo_skills/pipeline/utils/exp.py

echo -e "\n=== Looking at mounts.py file ==="
rg -n 'def get_mounts_from_config' nemo_skills/

echo -e "\n=== Checking how docker executor actually runs containers ==="
rg -n 'docker run' nemo_skills/

Length of output: 460



---

🏁 Script executed:

```shell
#!/bin/bash
# Let's look at actual mount configurations

echo "=== Looking at get_mounts_from_config in mounts.py ==="
rg -n -A30 'def get_mounts_from_config' nemo_skills/pipeline/utils/mounts.py

echo -e "\n=== Looking at start_local_sandbox.sh docker run command ==="
cat nemo_skills/code_execution/local_sandbox/start_local_sandbox.sh | head -50

echo -e "\n=== Checking for nemo_skills source mount patterns ==="
rg -n 'nemo_skills' nemo_skills/pipeline/utils/mounts.py | head -20

Length of output: 2872



---

🏁 Script executed:

```shell
#!/bin/bash
# Check the Dockerfile to see what paths it expects at runtime

echo "=== Full Dockerfile.nemo-skills ==="
cat dockerfiles/Dockerfile.nemo-skills

echo -e "\n=== Looking for default cluster configs ==="
fd -n 'cluster.*\.yaml|config.*\.yaml' docs/ examples/ | head -20

echo -e "\n=== Checking if nemo_skills is installed as a package ==="
rg -n 'pip install.*nemo.skills|setup.py|pyproject.toml' dockerfiles/Dockerfile.nemo-skills

Length of output: 2899



---

🏁 Script executed:

```shell
#!/bin/bash
# Find cluster configs and understand mount setup

echo "=== Look for cluster config examples ==="
find docs examples -name "*.yaml" -o -name "*.yml" 2>/dev/null | head -20

echo -e "\n=== Check exp.py for local executor mount setup ==="
rg -n -B10 -A20 'executor.*local' nemo_skills/pipeline/utils/exp.py | head -60

echo -e "\n=== Check where nemo_skills code is imported at runtime ==="
rg -n 'from nemo_skills|import nemo_skills' nemo_skills/pipeline/ | head -10

Length of output: 3691



---

🏁 Script executed:

```shell
#!/bin/bash
# Final check: look for documentation about the mounting requirement

echo "=== Check README for mounting instructions ==="
rg -n -B2 -A2 'mount|volume|docker.*run' README.md | head -40

echo -e "\n=== Look for any setup or quickstart docs ==="
find docs -name "*.md" -type f | xargs grep -l "docker.*mount\|volume" 2>/dev/null | head -5

echo -e "\n=== Check if there's a getting started doc ==="
ls -la docs/*.md docs/*/index.md 2>/dev/null | head -20

Length of output: 572



---

🏁 Script executed:

```shell
#!/bin/bash
# Quick check: does pyproject.toml get installed as a package in the container?

echo "=== Check if pyproject.toml is used to install nemo_skills as package ==="
rg -n 'pip install.*pyproject|pip install -e' dockerfiles/Dockerfile.nemo-skills

echo -e "\n=== Check what's in main.txt requirements ==="
head -30 requirements/main.txt

echo -e "\n=== Check if nemo_skills is even listed in requirements ==="
grep -i 'nemo.skills\|nemo-skills' requirements/main.txt

Length of output: 1426


Fix the incomplete architectural transition: source code mounting not implemented in workflows or documentation.

This is a breaking change that's incompletely implemented:

  1. Dockerfile expects mounting (line 51 comment) but the code isn't in the container
  2. Workflows don't mount the source code:
    • tests.yml: docker run --rm --network=host nemo-skills-sandbox-image (no mounts)
    • gpu_tests.yml: docker run --rm -v /tmp:/tmp -v /home:/home nemo-skills-image (mounts /tmp and /home only)
  3. No automatic mounting mechanism exists in DockerExecutorvolumes=mounts only includes user-specified mounts from cluster config
  4. No documentation explains this requirement

Required fixes:

  • Add source code mount to CI workflows (e.g., -v $(pwd):/opt/NeMo-Skills)
  • Document the mounting requirement in README and docs
  • Consider adding default mounting for local executor or clear error handling if mount is missing
🤖 Prompt for AI Agents
In dockerfiles/Dockerfile.nemo-skills around lines 51 to 58, the Dockerfile and
project assume the NeMo-Skills source will be mounted into the container but CI
workflows, the DockerExecutor, and documentation were not updated to provide or
require that mount; update CI workflow files (tests.yml and gpu_tests.yml) to
mount the repository into the container (e.g., -v $(pwd):/opt/NeMo-Skills or
equivalent CI-safe path), update DockerExecutor to either add a sensible default
source mount for local runs or detect missing mount and fail early with a clear
error, and add a short note in README/docs explaining that the container expects
the host source to be mounted at /opt/NeMo-Skills and how to run CI/locally with
the mount; ensure the mount path used in workflows matches the path referenced
in the Dockerfile.

Comment on lines +178 to +204
!!! note
If you're an NVIDIA employee, we have a pre-configured cluster configs for internal usage with pre-built sqsh
containers available at https://gitlab-master.nvidia.com/igitman/nemo-skills-configs. You can most likely
skip the step below and reuse one of the existing configurations.

You will also need to build .sqsh files for all containers or upload all `dockerfile:...` containers to
some registry (e.g. dockerhub) and reference the uploaded versions. To build sqsh files you can use the following commands

1. Build images locally and upload to some container registry. E.g.
```bash
docker build -t gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1 -f dockerfiles/Dockerfile.nemo-skills .
docker push gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1
```
2. Start an interactive shell, e.g. with the following (assuming there is a "cpu" partition)
```bash
srun -A <account> --partition cpu --job-name build-sqsh --time=1:00:00 --exclusive --pty /bin/bash -l
```
3. Import the image, e.g.:
```bash
enroot import -o /path/to/nemo-skills-image.sqsh --docker://gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1
```
4. Specify this image path in your cluster config
```yaml
containers:
nemo-skills: /path/to/nemo-skills-image.sqsh
```
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix markdown linting issues in the documentation.

The new section provides helpful guidance, but there are a couple of formatting issues:

  1. Line 180: The bare URL should be formatted as a proper markdown link or wrapped in angle brackets
  2. Line 204: The closing code fence is missing a language specifier (should be ```yaml instead of just ```)

Apply these fixes:

For line 180 - replace the bare URL with a proper markdown link:

-    containers available at https://gitlab-master.nvidia.com/igitman/nemo-skills-configs. You can most likely
+    containers available at [GitLab](https://gitlab-master.nvidia.com/igitman/nemo-skills-configs). You can most likely

For line 204 - the code block starting earlier should close with a language identifier. Based on the content showing yaml examples, verify the code fence at line 204 properly closes the yaml block.

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

180-180: Bare URL used

(MD034, no-bare-urls)


204-204: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In docs/basics/index.md around lines 178-204, fix two markdown lint issues:
replace the bare URL on line 180 with a proper markdown link (e.g.
[pre-configured
configs](https://gitlab-master.nvidia.com/igitman/nemo-skills-configs) or wrap
it in angle brackets) and ensure the code block that demonstrates the YAML
cluster config is closed correctly by using a matching fenced code block with
the yaml language identifier (change the trailing ``` to ```yaml).

Kipok added 3 commits October 17, 2025 16:35
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
.github/workflows/tests.yml (1)

42-43: Improve readiness verification for the background sandbox container.

The workflow starts a background container and waits a fixed 10 seconds before running tests. This sleep-based approach is fragile and can lead to intermittent test failures if the container takes longer to initialize or fails silently.

Consider adding a health check or readiness probe instead of relying on a fixed delay.

Example approach using a health check loop:

-        docker run --rm --network=host nemo-skills-sandbox-image &
-        sleep 10
+        docker run --rm --network=host --name nemo-skills-sandbox nemo-skills-sandbox-image &
+        # Wait for container to be ready (verify with health check or port)
+        for i in {1..30}; do
+          if docker exec nemo-skills-sandbox healthcheck-command &>/dev/null; then
+            echo "Container is ready"
+            break
+          fi
+          if [ $i -eq 30 ]; then echo "Container failed to start"; exit 1; fi
+          sleep 1
+        done

Alternatively, if the container exposes a port/service, verify connectivity before proceeding with tests.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b64ba26 and f139e70.

📒 Files selected for processing (2)
  • .github/workflows/tests.yml (1 hunks)
  • nemo_skills/pipeline/utils/mounts.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (3)
nemo_skills/pipeline/utils/mounts.py (2)

363-363: LGTM: Docstring accurately reflects function behavior.

The updated docstring correctly describes that this function checks directory existence rather than creating directories.


375-377: I need to gather more context about executor types and whether this behavior is intentional. Let me search for more information:

I need to understand the executor types better and check if this behavior change affects container mount handling:

Now let me check the order of operations and understand when these functions are called:

Based on my analysis of the codebase and verification of Docker bind mount behavior, I can now provide my final assessment:

Confirm the mount validation skip requires verification and poses a risk.

Docker's --mount flag does not automatically create directories if the source path doesn't exist and produces an error instead: "bind source path does not exist". The code change skips validation for non-slurm executors (lines 375-377), which conflicts with Docker's actual runtime behavior.

From the script output, I observed:

  • create_remote_directory() (lines 299-302) still attempts local directory creation via LocalTunnel for non-slurm executors
  • check_remote_mount_directories() (lines 375-377) now skips validation entirely for the same executor types
  • This asymmetry means directories may be created but never validated to exist before container startup

This creates a gap where configuration errors won't surface until the container attempts to mount, resulting in confusing runtime failures instead of clear validation-time errors during local development.

Recommendation: Verify this behavior change is intentional for the dockerfile-based local workflow. If directories are genuinely not required locally, add documentation explaining why. If this is unintended, the validation should not be skipped.

.github/workflows/tests.yml (1)

35-36: Line 34's comment is inaccurate; image tags don't need to match test-local.yaml format.

Dockerfiles exist and appear structurally sound. However, the comment on line 34 ("these tags need to match the ones in tests/gpu-tests/test-local.yaml") is misleading. The workflow uses Docker image tags ("nemo-skills-image", "nemo-skills-sandbox-image"), while test-local.yaml uses Apptainer/Singularity format ("dockerfile:dockerfiles/..."). These are different test frameworks and their naming conventions don't need to match.

Additionally, line 42's container readiness check relies on an arbitrary 10-second sleep with no health verification. If the sandbox service requires more time to start, tests will fail without clear diagnostics. Consider replacing this with a health check or explicit readiness validation.

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f139e70 and b823b48.

📒 Files selected for processing (1)
  • tests/conftest.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit

Comment on lines 27 to +28
if image_name is None:
image_name = config["containers"]["nemo-skills"]
image_name = "bash"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that "bash" is the correct default image for tests.

Changing the default from config["containers"]["nemo-skills"] to the literal string "bash" is concerning. The public bash Docker image is a minimal shell environment without NeMo Skills dependencies, which would cause test failures.

If the intent is to remove the default entirely (forcing callers to specify images), consider raising an error instead. If a default is still needed, it should integrate with the new dockerfile-based resolution system mentioned in the PR (e.g., nemo_skills/pipeline/utils/docker_images.py).

Run the following script to verify all callers now pass explicit image_name:


🏁 Script executed:

#!/bin/bash
# Description: Find all calls to docker_run without explicit image_name parameter

# Search for docker_run calls that might rely on the default
rg -n -A 2 -B 2 'docker_run\(' --type=py

Length of output: 1172



---

🏁 Script executed:

```shell
#!/bin/bash
# Get full conftest.py to understand docker_run implementation and old defaults

cat tests/conftest.py

Length of output: 2384


🏁 Script executed:

#!/bin/bash
# Search for docker_images utilities integration and config references

rg -n 'docker_images|config\[.*containers' --type=py -C 3

Length of output: 16462


🏁 Script executed:

#!/bin/bash
# Check if "bash" image is used as default elsewhere or if it's a custom build

rg -n '"bash"' --type=py tests/

Length of output: 48


Fix incomplete refactoring: docker_run default image must integrate with docker_images utilities or remove the default entirely.

The change from config["containers"]["nemo-skills"] to hardcoded "bash" is a breaking refactor that:

  1. Leaves 3 callers broken: docker_rm() (line 66), docker_rm_and_mkdir() (line 72), and test code (line 181) all rely on the default and will now silently use a minimal bash image instead of the full NeMo Skills environment.

  2. Ignores the docker_images infrastructure: The PR introduces nemo_skills/pipeline/utils/docker_images.py with resolve_container_image(), which is imported throughout the main codebase (setup.py, exp.py, etc.), but conftest.py doesn't use it—just hardcodes "bash".

  3. Breaks the abstraction: The production pipeline is config-driven (cluster_config["containers"]["nemo-skills"]), but tests are now hardcoded to a public minimal image.

Either integrate conftest.py with the docker_images utilities for proper container resolution, or remove the default and require all callers to pass explicit image_name. Hardcoding "bash" is not maintainable.

🤖 Prompt for AI Agents
In tests/conftest.py around lines 27-28, the default image_name was hardcoded to
"bash", which breaks callers and bypasses the docker_images resolution; replace
this by either (A) invoking the existing docker_images.resolve_container_image
or reading cluster_config["containers"]["nemo-skills"] to compute the default
image (importing the resolver at top), or (B) remove the implicit default and
raise/require callers to pass image_name (update docker_rm, docker_rm_and_mkdir
and tests to supply it). Ensure imports are added and all callers updated if you
choose option B.

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tests/conftest.py (1)

27-28: The hardcoded "bash" default breaks callers and bypasses docker_images resolution.

This issue was already flagged in the previous review. The change from config["containers"]["nemo-skills"] to "bash" leaves docker_rm() (line 66) and docker_rm_and_mkdir() (line 72) using a minimal bash image instead of the NeMo Skills environment, and ignores the PR's docker_images infrastructure with resolve_container_image().

🧹 Nitpick comments (1)
tests/conftest.py (1)

42-42: Consider using absolute path /bin/bash for better robustness.

The change from /bin/bash -c to bash -c relies on PATH resolution instead of an explicit absolute path. While functionally equivalent in standard containers, using the absolute path is more explicit and portable.

Apply this diff to restore the absolute path:

-    full_command = f"bash -c '{command}'"
+    full_command = f"/bin/bash -c '{command}'"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b823b48 and a451bcc.

📒 Files selected for processing (1)
  • tests/conftest.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests

@Kipok Kipok merged commit 617e893 into main Oct 20, 2025
7 checks passed
@Kipok Kipok deleted the igitman/containers branch October 20, 2025 16:49
SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 22, 2025
)

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Oct 29, 2025
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Dec 4, 2025
@coderabbitai coderabbitai bot mentioned this pull request Jan 15, 2026
@coderabbitai coderabbitai bot mentioned this pull request Jan 29, 2026
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants