-
Notifications
You must be signed in to change notification settings - Fork 163
Switch to building containers on-the-fly for local runs #969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9c64f66
04a3f61
4be4547
1ee91cd
02b2610
e50ce6d
84da1ef
2e17547
b64ba26
f139e70
b823b48
a451bcc
e3d14ef
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,4 @@ | ||
| recursive-include nemo_skills *.yaml | ||
| recursive-include nemo_skills *.txt | ||
| graft dockerfiles | ||
| graft requirements |
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -98,9 +98,12 @@ config might look like | |
| executor: local | ||
|
|
||
| containers: | ||
| # some containers are public and we pull them | ||
| trtllm: nvcr.io/nvidia/tensorrt-llm/release:1.0.0 | ||
| vllm: vllm/vllm-openai:v0.10.1.1 | ||
| nemo: igitman/nemo-skills-nemo:0.7.0 | ||
| # some containers are custom and we will build them locally before running the job | ||
| # you can always pre-build them as well | ||
| nemo-skills: dockerfile:dockerfiles/Dockerfile.nemo-skills | ||
| # ... there are some more containers defined here | ||
|
|
||
| env_vars: | ||
|
|
@@ -172,6 +175,34 @@ leverage a Slurm cluster[^2]. Let's setup our cluster config for that case by ru | |
| This time pick `slurm` for the config type and fill out all other required information | ||
| (such as ssh access, account, partition, etc.). | ||
|
|
||
| !!! note | ||
| If you're an NVIDIA employee, we have a pre-configured cluster configs for internal usage with pre-built sqsh | ||
| containers available at https://gitlab-master.nvidia.com/igitman/nemo-skills-configs. You can most likely | ||
| skip the step below and reuse one of the existing configurations. | ||
|
|
||
| You will also need to build .sqsh files for all containers or upload all `dockerfile:...` containers to | ||
| some registry (e.g. dockerhub) and reference the uploaded versions. To build sqsh files you can use the following commands | ||
|
|
||
| 1. Build images locally and upload to some container registry. E.g. | ||
| ```bash | ||
| docker build -t gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1 -f dockerfiles/Dockerfile.nemo-skills . | ||
| docker push gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1 | ||
| ``` | ||
| 2. Start an interactive shell, e.g. with the following (assuming there is a "cpu" partition) | ||
| ```bash | ||
| srun -A <account> --partition cpu --job-name build-sqsh --time=1:00:00 --exclusive --pty /bin/bash -l | ||
| ``` | ||
| 3. Import the image, e.g.: | ||
| ```bash | ||
| enroot import -o /path/to/nemo-skills-image.sqsh --docker://gitlab-master.nvidia.com/igitman/nemo-skills-containers:nemo-skills-0.7.1 | ||
| ``` | ||
| 4. Specify this image path in your cluster config | ||
| ```yaml | ||
| containers: | ||
| nemo-skills: /path/to/nemo-skills-image.sqsh | ||
| ``` | ||
| ``` | ||
|
Comment on lines
+178
to
+204
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix markdown linting issues in the documentation. The new section provides helpful guidance, but there are a couple of formatting issues:
Apply these fixes: For line 180 - replace the bare URL with a proper markdown link: - containers available at https://gitlab-master.nvidia.com/igitman/nemo-skills-configs. You can most likely
+ containers available at [GitLab](https://gitlab-master.nvidia.com/igitman/nemo-skills-configs). You can most likelyFor line 204 - the code block starting earlier should close with a language identifier. Based on the content showing yaml examples, verify the code fence at line 204 properly closes the yaml block. 🧰 Tools🪛 markdownlint-cli2 (0.18.1)180-180: Bare URL used (MD034, no-bare-urls) 204-204: Fenced code blocks should have a language specified (MD040, fenced-code-language) 🤖 Prompt for AI Agents |
||
|
|
||
| Now that we have a slurm config setup, we can try running some jobs. Generally, you will need to upload models / data | ||
| on cluster manually and then reference a proper mounted path. But for small-scale things we can also leverage the | ||
| [code packaging](./code-packaging.md) functionality that nemo-skills provide. Whenever you run any of the ns commands | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify that all container usage contexts properly mount the nemo_skills code.
This is a significant architectural change: the image no longer contains the nemo_skills source code and instead expects it to be mounted from the host at runtime. While this is excellent for local development workflows (allowing live code changes), it introduces a dependency on proper mount configuration.
Ensure that:
🏁 Script executed:
Length of output: 6127
Let me examine the core Docker execution code and mount handling:
Length of output: 460
Length of output: 2872
Length of output: 2899
Length of output: 3691
Length of output: 572
Length of output: 1426
Fix the incomplete architectural transition: source code mounting not implemented in workflows or documentation.
This is a breaking change that's incompletely implemented:
tests.yml:docker run --rm --network=host nemo-skills-sandbox-image(no mounts)gpu_tests.yml:docker run --rm -v /tmp:/tmp -v /home:/home nemo-skills-image(mounts /tmp and /home only)DockerExecutor—volumes=mountsonly includes user-specified mounts from cluster configRequired fixes:
-v $(pwd):/opt/NeMo-Skills)🤖 Prompt for AI Agents