Switch to building containers on-the-fly for local runs#969
Conversation
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
WalkthroughWorkflows, configs, and Python utilities now prefer repository Dockerfile-based local builds over prebuilt public images; a Dockerfile resolver/builder was added, one large NeMo Dockerfile removed, Dockerfiles/manifest/docs/tests updated, and CI/local image names switched to local image names. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Setup as setup.py
participant PullBuild as pull_docker_containers()
participant Resolver as resolve_container_image()
participant Docker as Docker CLI
participant Executor as DockerExecutor
User->>Setup: request create local config
Setup->>User: include containers with "dockerfile:..." refs
User-->>Setup: confirm/prompt to pull/build
Setup->>PullBuild: pull_docker_containers(containers)
activate PullBuild
loop per container
alt container starts with "dockerfile:"
PullBuild->>Resolver: resolve_container_image(container, config)
Resolver->>Resolver: validate path & compute digest
Resolver->>Docker: docker build -f <path> -t <local-name>
Docker-->>Resolver: build result (tag)
Resolver-->>PullBuild: resolved image name
else
PullBuild->>Docker: docker pull <image>
Docker-->>PullBuild: image ready
end
end
deactivate PullBuild
PullBuild->>Executor: provide resolved images
Executor->>Docker: run containers (mount host code)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
nemo_skills/pipeline/setup.py (1)
39-52: Consider more specific exception handling, but acceptable for CLI context.The function now correctly handles both dockerfile-based and registry-based containers. The broad exception catching on lines 44 and 51 could potentially hide unexpected errors, but for a user-facing CLI command that aims to continue processing even if some containers fail, this is a reasonable trade-off.
If you want to be more precise, consider catching specific exceptions like
subprocess.SubprocessError,ValueError, orRuntimeErrorinstead of bareException.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
.github/workflows/gpu_tests.yml(4 hunks).github/workflows/tests.yml(1 hunks)MANIFEST.in(1 hunks)cluster_configs/example-local.yaml(2 hunks)cluster_configs/example-slurm.yaml(1 hunks)dockerfiles/Dockerfile.nemo(0 hunks)dockerfiles/Dockerfile.nemo-skills(1 hunks)dockerfiles/README.md(1 hunks)docs/basics/index.md(2 hunks)docs/basics/sandbox.md(1 hunks)nemo_skills/__init__.py(1 hunks)nemo_skills/pipeline/setup.py(4 hunks)nemo_skills/pipeline/utils/__init__.py(1 hunks)nemo_skills/pipeline/utils/docker_images.py(1 hunks)nemo_skills/pipeline/utils/exp.py(2 hunks)tests/gpu-tests/test-local.yaml(1 hunks)
💤 Files with no reviewable changes (1)
- dockerfiles/Dockerfile.nemo
🧰 Additional context used
🧬 Code graph analysis (4)
nemo_skills/pipeline/utils/__init__.py (1)
nemo_skills/pipeline/utils/docker_images.py (1)
resolve_container_image(87-97)
nemo_skills/pipeline/utils/docker_images.py (2)
nemo_skills/utils.py (1)
get_logger_name(131-135)nemo_skills/pipeline/utils/declarative.py (1)
run(343-484)
nemo_skills/pipeline/setup.py (1)
nemo_skills/pipeline/utils/docker_images.py (1)
resolve_container_image(87-97)
nemo_skills/pipeline/utils/exp.py (1)
nemo_skills/pipeline/utils/docker_images.py (1)
resolve_container_image(87-97)
🪛 markdownlint-cli2 (0.18.1)
docs/basics/index.md
180-180: Bare URL used
(MD034, no-bare-urls)
204-204: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.0)
nemo_skills/pipeline/utils/docker_images.py
37-37: Avoid specifying long messages outside the exception class
(TRY003)
43-43: Avoid specifying long messages outside the exception class
(TRY003)
46-48: Avoid specifying long messages outside the exception class
(TRY003)
50-50: Avoid specifying long messages outside the exception class
(TRY003)
65-65: subprocess call: check for execution of untrusted input
(S603)
66-74: Starting a process with a partial executable path
(S607)
78-80: Avoid specifying long messages outside the exception class
(TRY003)
82-82: Avoid specifying long messages outside the exception class
(TRY003)
92-92: Avoid specifying long messages outside the exception class
(TRY003)
96-96: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/pipeline/setup.py
44-44: Do not catch blind exception: Exception
(BLE001)
49-49: subprocess call: check for execution of untrusted input
(S603)
49-49: Starting a process with a partial executable path
(S607)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (24)
dockerfiles/README.md (1)
7-7: LGTM! Documentation updated to reflect local builds.The example now uses a generic local image tag instead of a public registry reference, which aligns with the PR's shift to on-the-fly container builds.
docs/basics/sandbox.md (1)
21-21: LGTM! Simplified sandbox launch command.The update references a local start script instead of a Docker run command, making the process more straightforward for users and consistent with the new local-build approach.
MANIFEST.in (1)
3-4: LGTM! Essential directories included for local builds.Adding
dockerfilesandrequirementsto the distribution is necessary for the new dockerfile-based local build workflow. This ensures users have access to the files needed to build images on-the-fly.cluster_configs/example-slurm.yaml (1)
18-19: LGTM! Guidance-based approach for SLURM container configuration.The file now directs users to documentation instead of providing explicit container image definitions. This aligns with the new dockerfile-based approach and keeps the example config more maintainable.
nemo_skills/pipeline/utils/docker_images.py (5)
15-27: LGTM! Module setup is correct.The imports, logger setup, and constants are appropriate. The
_REPO_ROOTcomputation correctly navigates up three directory levels from the file location to reach the repository root.
29-31: LGTM! Robust sanitization for Docker image names.The function conservatively sanitizes image components to lowercase alphanumeric characters and hyphens, which is safe for Docker naming requirements.
34-52: LGTM! Secure path validation with proper containment checks.The function correctly validates Dockerfile paths with important security checks:
- Rejects absolute paths
- Resolves symlinks and relative components
- Ensures paths stay within the repository root
- Validates file existence and type
The static analysis hints about TRY003 are stylistic preferences and don't affect the security or correctness of the implementation.
55-84: Verify: Docker image caching behavior based on Dockerfile content only.The image digest is computed solely from the Dockerfile contents (line 59), not from the build context or source files. This means:
- Images are rebuilt only when the Dockerfile itself changes
- Changes to source code or other context files won't trigger rebuilds
- The code will be mounted at runtime instead
Confirm this is the intended caching strategy for local development workflows.
The static analysis warnings about subprocess execution (S603/S607) are expected for Docker CLI usage and don't represent security concerns here since the Dockerfile path is validated and the build context is restricted to the repository root.
87-97: LGTM! Clear entry point with appropriate validation.The function correctly:
- Passes through non-dockerfile container specifications unchanged
- Enforces that dockerfile-based builds only work with the local executor
- Validates the dockerfile path is provided
- Delegates to the build function
The static analysis hints are stylistic and don't affect functionality.
nemo_skills/pipeline/utils/__init__.py (1)
34-34: LGTM! Proper public API exposure.The
resolve_container_imagefunction is correctly exported from the utils module, making it available for use across the pipeline codebase.nemo_skills/pipeline/utils/exp.py (2)
37-37: LGTM! Import added for container resolution.The import statement is correctly placed and will be used in the local executor path.
200-202: LGTM! Seamless integration of dockerfile-based image resolution.The container resolution is correctly integrated into the local executor flow. The resolved image reference is used for the
DockerExecutor, enabling on-the-fly image builds when dockerfile specifications are provided..github/workflows/gpu_tests.yml (3)
43-46: LGTM! Docker image build step added for GPU tests.The workflow now builds the
nemo-skills-imagelocally fromdockerfiles/Dockerfile.nemo-skillsbefore running tests. This aligns with the PR's objective to build containers on-the-fly.
86-89: LGTM! Consistent Docker build step for Qwen tests.The build step matches the Llama tests configuration, ensuring both GPU test jobs follow the same local build approach.
59-59: LGTM! Cleanup steps updated to use local image reference.The cleanup commands now reference
nemo-skills-imageinstead of the previous public registry image, maintaining consistency with the new local build workflow.Also applies to: 102-102
cluster_configs/example-local.yaml (1)
21-26: LGTM! Clear transition to dockerfile-based container definitions.The switch from public images to local dockerfile references is well-documented with helpful comments. This aligns with the PR's objective to build containers on-the-fly for local runs.
tests/gpu-tests/test-local.yaml (1)
21-25: LGTM! Consistent with the new dockerfile-based approach.The container references are properly updated to use dockerfile specifications, matching the pattern established in other config files.
.github/workflows/tests.yml (1)
44-46: Verify comment accuracy about matching tags.The comment states "these tags need to match the ones in tests/gpu-tests/test-local.yaml", but test-local.yaml uses
dockerfile:references rather than direct image tags. The tags here (nemo-skills-image,nemo-skills-sandbox-image) appear to be build-time names for local use in this workflow.Consider clarifying the comment to explain these are local build tags for CI, not references that need to match the config file.
nemo_skills/__init__.py (1)
22-26: LGTM! Consistent container reference updates.The
_containersmapping now properly uses dockerfile-based references, aligning with the broader PR changes to support local builds.docs/basics/index.md (1)
101-106: LGTM! Clear documentation of the new build approach.The updated comments accurately reflect that some containers are now built locally from Dockerfiles rather than pulled from registries.
nemo_skills/pipeline/setup.py (4)
25-25: LGTM! Proper import for dockerfile-based image resolution.The import of
resolve_container_imageenables the new build-from-dockerfile functionality for local containers.
93-96: LGTM! Appropriate separation of concerns between executor types.Only including the containers mapping for local configs makes sense, as slurm configs require different handling (sqsh files or registry images). The config is properly initialized with just the executor, and containers are added conditionally.
215-225: LGTM! Clean YAML generation with helpful slurm guidance.The approach of generating YAML content as a string and then replacing the executor line with a helpful comment block for slurm configs is a pragmatic solution that provides clear guidance to users.
227-246: LGTM! Clear messaging adapted to executor type.The conditional messaging and prompt text properly reflect the different workflows for local (pull/build containers) versus slurm (manual sqsh setup) configurations.
| # we aren't copying main nemo_skills folder as it will always be mounted from host | ||
| # but we do want to install all requirements in the container directly | ||
| RUN mkdir -p /opt/NeMo-Skills/requirements | ||
| COPY pyproject.toml README.md /opt/NeMo-Skills/ | ||
| COPY requirements /opt/NeMo-Skills/requirements/ | ||
| # installing sdp in container only | ||
| RUN pip install git+https://github.com/NVIDIA/NeMo-speech-data-processor@29b9b1ec0ceaf3ffa441c1d01297371b3f8e11d2 | ||
| RUN pip install -r /opt/NeMo-Skills/requirements/main.txt |
There was a problem hiding this comment.
🧩 Analysis chain
Verify that all container usage contexts properly mount the nemo_skills code.
This is a significant architectural change: the image no longer contains the nemo_skills source code and instead expects it to be mounted from the host at runtime. While this is excellent for local development workflows (allowing live code changes), it introduces a dependency on proper mount configuration.
Ensure that:
- All places that use this container (CI workflows, documentation examples, cluster configs) properly mount the code directory
- Error messages are clear if the container is run without the required mount
- Documentation clearly explains this requirement
🏁 Script executed:
#!/bin/bash
# Verify that container usage includes proper code mounting
echo "=== Checking workflow files for code mounting ==="
rg -n -B2 -A2 'nemo-skills-image|nemo-skills-sandbox-image' .github/workflows/
echo -e "\n=== Checking if resolve_container_image or related code handles mounting ==="
rg -n -B3 -A3 'resolve_container_image' nemo_skills/pipeline/
echo -e "\n=== Checking documentation for mount requirements ==="
rg -n -B2 -A2 'dockerfile:dockerfiles/Dockerfile.nemo-skills' docs/Length of output: 6127
Let me examine the core Docker execution code and mount handling:
---
🏁 Script executed:
```shell
#!/bin/bash
# Look at the DockerExecutor class and how it handles mounting
echo "=== Finding DockerExecutor class ==="
rg -n 'class DockerExecutor' nemo_skills/
echo -e "\n=== Looking at DockerExecutor instantiation in exp.py ==="
rg -n -B5 -A10 'DockerExecutor(' nemo_skills/pipeline/utils/exp.py
echo -e "\n=== Looking at mounts.py file ==="
rg -n 'def get_mounts_from_config' nemo_skills/
echo -e "\n=== Checking how docker executor actually runs containers ==="
rg -n 'docker run' nemo_skills/
Length of output: 460
---
🏁 Script executed:
```shell
#!/bin/bash
# Let's look at actual mount configurations
echo "=== Looking at get_mounts_from_config in mounts.py ==="
rg -n -A30 'def get_mounts_from_config' nemo_skills/pipeline/utils/mounts.py
echo -e "\n=== Looking at start_local_sandbox.sh docker run command ==="
cat nemo_skills/code_execution/local_sandbox/start_local_sandbox.sh | head -50
echo -e "\n=== Checking for nemo_skills source mount patterns ==="
rg -n 'nemo_skills' nemo_skills/pipeline/utils/mounts.py | head -20
Length of output: 2872
---
🏁 Script executed:
```shell
#!/bin/bash
# Check the Dockerfile to see what paths it expects at runtime
echo "=== Full Dockerfile.nemo-skills ==="
cat dockerfiles/Dockerfile.nemo-skills
echo -e "\n=== Looking for default cluster configs ==="
fd -n 'cluster.*\.yaml|config.*\.yaml' docs/ examples/ | head -20
echo -e "\n=== Checking if nemo_skills is installed as a package ==="
rg -n 'pip install.*nemo.skills|setup.py|pyproject.toml' dockerfiles/Dockerfile.nemo-skills
Length of output: 2899
---
🏁 Script executed:
```shell
#!/bin/bash
# Find cluster configs and understand mount setup
echo "=== Look for cluster config examples ==="
find docs examples -name "*.yaml" -o -name "*.yml" 2>/dev/null | head -20
echo -e "\n=== Check exp.py for local executor mount setup ==="
rg -n -B10 -A20 'executor.*local' nemo_skills/pipeline/utils/exp.py | head -60
echo -e "\n=== Check where nemo_skills code is imported at runtime ==="
rg -n 'from nemo_skills|import nemo_skills' nemo_skills/pipeline/ | head -10
Length of output: 3691
---
🏁 Script executed:
```shell
#!/bin/bash
# Final check: look for documentation about the mounting requirement
echo "=== Check README for mounting instructions ==="
rg -n -B2 -A2 'mount|volume|docker.*run' README.md | head -40
echo -e "\n=== Look for any setup or quickstart docs ==="
find docs -name "*.md" -type f | xargs grep -l "docker.*mount\|volume" 2>/dev/null | head -5
echo -e "\n=== Check if there's a getting started doc ==="
ls -la docs/*.md docs/*/index.md 2>/dev/null | head -20
Length of output: 572
---
🏁 Script executed:
```shell
#!/bin/bash
# Quick check: does pyproject.toml get installed as a package in the container?
echo "=== Check if pyproject.toml is used to install nemo_skills as package ==="
rg -n 'pip install.*pyproject|pip install -e' dockerfiles/Dockerfile.nemo-skills
echo -e "\n=== Check what's in main.txt requirements ==="
head -30 requirements/main.txt
echo -e "\n=== Check if nemo_skills is even listed in requirements ==="
grep -i 'nemo.skills\|nemo-skills' requirements/main.txt
Length of output: 1426
Fix the incomplete architectural transition: source code mounting not implemented in workflows or documentation.
This is a breaking change that's incompletely implemented:
- Dockerfile expects mounting (line 51 comment) but the code isn't in the container
- Workflows don't mount the source code:
tests.yml:docker run --rm --network=host nemo-skills-sandbox-image(no mounts)gpu_tests.yml:docker run --rm -v /tmp:/tmp -v /home:/home nemo-skills-image(mounts /tmp and /home only)
- No automatic mounting mechanism exists in
DockerExecutor—volumes=mountsonly includes user-specified mounts from cluster config - No documentation explains this requirement
Required fixes:
- Add source code mount to CI workflows (e.g.,
-v $(pwd):/opt/NeMo-Skills) - Document the mounting requirement in README and docs
- Consider adding default mounting for local executor or clear error handling if mount is missing
🤖 Prompt for AI Agents
In dockerfiles/Dockerfile.nemo-skills around lines 51 to 58, the Dockerfile and
project assume the NeMo-Skills source will be mounted into the container but CI
workflows, the DockerExecutor, and documentation were not updated to provide or
require that mount; update CI workflow files (tests.yml and gpu_tests.yml) to
mount the repository into the container (e.g., -v $(pwd):/opt/NeMo-Skills or
equivalent CI-safe path), update DockerExecutor to either add a sensible default
source mount for local runs or detect missing mount and fail early with a clear
error, and add a short note in README/docs explaining that the container expects
the host source to be mounted at /opt/NeMo-Skills and how to run CI/locally with
the mount; ensure the mount path used in workflows matches the path referenced
in the Dockerfile.
| !!! note | ||
| If you're an NVIDIA employee, we have a pre-configured cluster configs for internal usage with pre-built sqsh | ||
| containers available at https://gitlab-master.nvidia.com/igitman/nemo-skills-configs. You can most likely | ||
| skip the step below and reuse one of the existing configurations. | ||
|
|
||
| You will also need to build .sqsh files for all containers or upload all `dockerfile:...` containers to | ||
| some registry (e.g. dockerhub) and reference the uploaded versions. To build sqsh files you can use the following commands | ||
|
|
||
| 1. Build images locally and upload to some container registry. E.g. | ||
| ```bash | ||
| docker build -t gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1 -f dockerfiles/Dockerfile.nemo-skills . | ||
| docker push gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1 | ||
| ``` | ||
| 2. Start an interactive shell, e.g. with the following (assuming there is a "cpu" partition) | ||
| ```bash | ||
| srun -A <account> --partition cpu --job-name build-sqsh --time=1:00:00 --exclusive --pty /bin/bash -l | ||
| ``` | ||
| 3. Import the image, e.g.: | ||
| ```bash | ||
| enroot import -o /path/to/nemo-skills-image.sqsh --docker://gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1 | ||
| ``` | ||
| 4. Specify this image path in your cluster config | ||
| ```yaml | ||
| containers: | ||
| nemo-skills: /path/to/nemo-skills-image.sqsh | ||
| ``` | ||
| ``` |
There was a problem hiding this comment.
Fix markdown linting issues in the documentation.
The new section provides helpful guidance, but there are a couple of formatting issues:
- Line 180: The bare URL should be formatted as a proper markdown link or wrapped in angle brackets
- Line 204: The closing code fence is missing a language specifier (should be
```yamlinstead of just```)
Apply these fixes:
For line 180 - replace the bare URL with a proper markdown link:
- containers available at https://gitlab-master.nvidia.com/igitman/nemo-skills-configs. You can most likely
+ containers available at [GitLab](https://gitlab-master.nvidia.com/igitman/nemo-skills-configs). You can most likelyFor line 204 - the code block starting earlier should close with a language identifier. Based on the content showing yaml examples, verify the code fence at line 204 properly closes the yaml block.
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
180-180: Bare URL used
(MD034, no-bare-urls)
204-204: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In docs/basics/index.md around lines 178-204, fix two markdown lint issues:
replace the bare URL on line 180 with a proper markdown link (e.g.
[pre-configured
configs](https://gitlab-master.nvidia.com/igitman/nemo-skills-configs) or wrap
it in angle brackets) and ensure the code block that demonstrates the YAML
cluster config is closed correctly by using a matching fenced code block with
the yaml language identifier (change the trailing ``` to ```yaml).
Signed-off-by: Igor Gitman <igitman@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
.github/workflows/tests.yml (1)
42-43: Improve readiness verification for the background sandbox container.The workflow starts a background container and waits a fixed 10 seconds before running tests. This sleep-based approach is fragile and can lead to intermittent test failures if the container takes longer to initialize or fails silently.
Consider adding a health check or readiness probe instead of relying on a fixed delay.
Example approach using a health check loop:
- docker run --rm --network=host nemo-skills-sandbox-image & - sleep 10 + docker run --rm --network=host --name nemo-skills-sandbox nemo-skills-sandbox-image & + # Wait for container to be ready (verify with health check or port) + for i in {1..30}; do + if docker exec nemo-skills-sandbox healthcheck-command &>/dev/null; then + echo "Container is ready" + break + fi + if [ $i -eq 30 ]; then echo "Container failed to start"; exit 1; fi + sleep 1 + doneAlternatively, if the container exposes a port/service, verify connectivity before proceeding with tests.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/tests.yml(1 hunks)nemo_skills/pipeline/utils/mounts.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (3)
nemo_skills/pipeline/utils/mounts.py (2)
363-363: LGTM: Docstring accurately reflects function behavior.The updated docstring correctly describes that this function checks directory existence rather than creating directories.
375-377: I need to gather more context about executor types and whether this behavior is intentional. Let me search for more information:I need to understand the executor types better and check if this behavior change affects container mount handling:
Now let me check the order of operations and understand when these functions are called:
Based on my analysis of the codebase and verification of Docker bind mount behavior, I can now provide my final assessment:
Confirm the mount validation skip requires verification and poses a risk.
Docker's
--mountflag does not automatically create directories if the source path doesn't exist and produces an error instead: "bind source path does not exist". The code change skips validation for non-slurm executors (lines 375-377), which conflicts with Docker's actual runtime behavior.From the script output, I observed:
create_remote_directory()(lines 299-302) still attempts local directory creation via LocalTunnel for non-slurm executorscheck_remote_mount_directories()(lines 375-377) now skips validation entirely for the same executor types- This asymmetry means directories may be created but never validated to exist before container startup
This creates a gap where configuration errors won't surface until the container attempts to mount, resulting in confusing runtime failures instead of clear validation-time errors during local development.
Recommendation: Verify this behavior change is intentional for the dockerfile-based local workflow. If directories are genuinely not required locally, add documentation explaining why. If this is unintended, the validation should not be skipped.
.github/workflows/tests.yml (1)
35-36: Line 34's comment is inaccurate; image tags don't need to match test-local.yaml format.Dockerfiles exist and appear structurally sound. However, the comment on line 34 ("these tags need to match the ones in tests/gpu-tests/test-local.yaml") is misleading. The workflow uses Docker image tags ("nemo-skills-image", "nemo-skills-sandbox-image"), while test-local.yaml uses Apptainer/Singularity format ("dockerfile:dockerfiles/..."). These are different test frameworks and their naming conventions don't need to match.
Additionally, line 42's container readiness check relies on an arbitrary 10-second sleep with no health verification. If the sandbox service requires more time to start, tests will fail without clear diagnostics. Consider replacing this with a health check or explicit readiness validation.
Signed-off-by: Igor Gitman <igitman@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/conftest.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
| if image_name is None: | ||
| image_name = config["containers"]["nemo-skills"] | ||
| image_name = "bash" |
There was a problem hiding this comment.
🧩 Analysis chain
Verify that "bash" is the correct default image for tests.
Changing the default from config["containers"]["nemo-skills"] to the literal string "bash" is concerning. The public bash Docker image is a minimal shell environment without NeMo Skills dependencies, which would cause test failures.
If the intent is to remove the default entirely (forcing callers to specify images), consider raising an error instead. If a default is still needed, it should integrate with the new dockerfile-based resolution system mentioned in the PR (e.g., nemo_skills/pipeline/utils/docker_images.py).
Run the following script to verify all callers now pass explicit image_name:
🏁 Script executed:
#!/bin/bash
# Description: Find all calls to docker_run without explicit image_name parameter
# Search for docker_run calls that might rely on the default
rg -n -A 2 -B 2 'docker_run\(' --type=pyLength of output: 1172
---
🏁 Script executed:
```shell
#!/bin/bash
# Get full conftest.py to understand docker_run implementation and old defaults
cat tests/conftest.py
Length of output: 2384
🏁 Script executed:
#!/bin/bash
# Search for docker_images utilities integration and config references
rg -n 'docker_images|config\[.*containers' --type=py -C 3Length of output: 16462
🏁 Script executed:
#!/bin/bash
# Check if "bash" image is used as default elsewhere or if it's a custom build
rg -n '"bash"' --type=py tests/Length of output: 48
Fix incomplete refactoring: docker_run default image must integrate with docker_images utilities or remove the default entirely.
The change from config["containers"]["nemo-skills"] to hardcoded "bash" is a breaking refactor that:
-
Leaves 3 callers broken:
docker_rm()(line 66),docker_rm_and_mkdir()(line 72), and test code (line 181) all rely on the default and will now silently use a minimal bash image instead of the full NeMo Skills environment. -
Ignores the docker_images infrastructure: The PR introduces
nemo_skills/pipeline/utils/docker_images.pywithresolve_container_image(), which is imported throughout the main codebase (setup.py, exp.py, etc.), but conftest.py doesn't use it—just hardcodes"bash". -
Breaks the abstraction: The production pipeline is config-driven (
cluster_config["containers"]["nemo-skills"]), but tests are now hardcoded to a public minimal image.
Either integrate conftest.py with the docker_images utilities for proper container resolution, or remove the default and require all callers to pass explicit image_name. Hardcoding "bash" is not maintainable.
🤖 Prompt for AI Agents
In tests/conftest.py around lines 27-28, the default image_name was hardcoded to
"bash", which breaks callers and bypasses the docker_images resolution; replace
this by either (A) invoking the existing docker_images.resolve_container_image
or reading cluster_config["containers"]["nemo-skills"] to compute the default
image (importing the resolver at top), or (B) remove the implicit default and
raise/require callers to pass image_name (update docker_rm, docker_rm_and_mkdir
and tests to supply it). Ensure imports are added and all callers updated if you
choose option B.
Signed-off-by: Igor Gitman <igitman@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/conftest.py (1)
27-28: The hardcoded "bash" default breaks callers and bypasses docker_images resolution.This issue was already flagged in the previous review. The change from
config["containers"]["nemo-skills"]to"bash"leavesdocker_rm()(line 66) anddocker_rm_and_mkdir()(line 72) using a minimal bash image instead of the NeMo Skills environment, and ignores the PR's docker_images infrastructure withresolve_container_image().
🧹 Nitpick comments (1)
tests/conftest.py (1)
42-42: Consider using absolute path/bin/bashfor better robustness.The change from
/bin/bash -ctobash -crelies on PATH resolution instead of an explicit absolute path. While functionally equivalent in standard containers, using the absolute path is more explicit and portable.Apply this diff to restore the absolute path:
- full_command = f"bash -c '{command}'" + full_command = f"/bin/bash -c '{command}'"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/conftest.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
New Features
Documentation
Chores
Bug Fix / UX