Skip to content

add container_env to SlurmExecutor for sbatch workloads#3034

Open
sudostock wants to merge 3 commits intomainfrom
afilby/container-env
Open

add container_env to SlurmExecutor for sbatch workloads#3034
sudostock wants to merge 3 commits intomainfrom
afilby/container-env

Conversation

@sudostock
Copy link
Copy Markdown
Contributor

@sudostock sudostock commented Mar 30, 2026

NeMo Run automatically sets --container-env for srun workloads but not sbatch. This means env var overrides that conflict with container defaults are silently dropped for sbatch jobs.

Changes:

  • Set container_env on SlurmExecutor using PERF_ENV_VARS keys, ensuring performance-critical vars always override container defaults
  • Add container_env parameter to slurm_executor() for callers that need additional keys
  • Add --container_env CLI flag to setup_experiment.py for key-only forwarding when the value is already present in the job environment

Background:
Slurm by default does --export=ALL which passes the entire env at submission time into the job env. Pyxis/Enroot does the same thing. Except when the container already has that variable set in which case it keeps the container version set. Unless the user tells it to override with --container-env flag.

Currently there is a gap if any of the values set in PERF_ENV_VARs ever already exists in the container they will silently be dropped. Additionally there's no easy way to override something like NCCL_DEBUG=INFO which is set 'WARN' in the container.

I proposed fixing this directly in Nemo Run (#394) to fix the incongruity between the srun and sbatch ops. In offline discussions the recommendation was to fix it in NeMo/Mbridge instead.

Summary by CodeRabbit

Release Notes

  • New Features

    • Users can now specify environment variables via command-line that should override container image defaults when running performance experiments, enabling better customization of execution environments.
  • Tests

    • Added comprehensive test coverage for environment variable propagation in container execution workflows, ensuring proper handling, deduplication, and consistent ordering of variables.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@sudostock sudostock force-pushed the afilby/container-env branch from 5d93fda to 6cbe786 Compare March 30, 2026 23:22
Additionally adds a flag for users to flag their own vars that need
override.

Signed-off-by: Alex Filby <afilby@nvidia.com>
@sudostock sudostock force-pushed the afilby/container-env branch from 6cbe786 to c6ebec2 Compare March 31, 2026 22:56
Signed-off-by: Alex Filby <afilby@nvidia.com>
@sudostock sudostock marked this pull request as ready for review April 1, 2026 00:38
@sudostock sudostock requested review from a team, erhoo82 and malay-nagda as code owners April 1, 2026 00:38
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 1, 2026

📝 Walkthrough

Walkthrough

A new container_env CLI argument and parameter enable users to specify environment variable names that should override container image values. This parameter threads through argument_parser.py, setup_experiment.py, and slurm_executor(), where it is merged with default performance environment variables for container execution.

Changes

Cohort / File(s) Summary
CLI Argument Definition
scripts/performance/argument_parser.py
Added --container_env CLI argument to accept comma-separated environment variable names with default empty list.
Parameter Threading
scripts/performance/setup_experiment.py
Extended main() signature with container_env parameter and forwarded it from CLI args to slurm_executor().
Executor Implementation
scripts/performance/utils/executors.py
Added container_env parameter to slurm_executor() and merged it with PERF_ENV_VARS keys using deduplication and sorted ordering.
Test Suite
tests/unit_tests/scripts/performance/test_executors.py
Added three unit tests validating that default performance variables, custom environment variables, and user-supplied container_env values are all present in the executor's container_env.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Test Results For Major Changes ❓ Inconclusive Test file existence cannot be verified in repository, and PR description lacks documented test execution results or CI/CD verification. Verify test file exists in PR, confirm test execution results are documented, and clarify if change qualifies as major requiring explicit test result documentation.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding container_env parameter to SlurmExecutor for sbatch workloads, which is the core focus of all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch afilby/container-env

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/unit_tests/scripts/performance/test_executors.py (1)

39-70: Add pytest marker for unit test categorization.

Per coding guidelines, unit tests should use pytest markers to categorize tests. Add @pytest.mark.unit to these tests.

Also, the file is missing a trailing newline at the end (line 70).

♻️ Proposed fix
+import pytest
+
 `@pytest.mark.skipif`(not HAS_NEMO_RUN, reason="nemo_run not installed")
+@pytest.mark.unit
 def test_container_env_includes_perf_vars(tmp_path):
     """PERF_ENV_VARS keys must appear in container_env so they override container defaults."""
     ...


 `@pytest.mark.skipif`(not HAS_NEMO_RUN, reason="nemo_run not installed")
+@pytest.mark.unit
 def test_custom_env_vars_in_container_env(tmp_path):
     """Vars passed via custom_env_vars must also appear in container_env."""
     ...


 `@pytest.mark.skipif`(not HAS_NEMO_RUN, reason="nemo_run not installed")
+@pytest.mark.unit
 def test_container_env_param_forwarded(tmp_path):
     """Keys passed via the container_env parameter must appear in container_env."""
     ...
     assert "UPSTREAM_SET_VAR" in executor.container_env
+

As per coding guidelines: "Use pytest markers to categorize tests (unit, integration, system)" for tests/**/*.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/scripts/performance/test_executors.py` around lines 39 - 70,
Add the pytest unit marker to each test function here by prepending
`@pytest.mark.unit` above test_container_env_includes_perf_vars,
test_custom_env_vars_in_container_env, and test_container_env_param_forwarded
(they already have `@pytest.mark.skipif`); ensure the marker is combined with the
existing skipif (one per test) so tests are categorized as unit tests, and also
add a trailing newline at the end of the file.
scripts/performance/utils/executors.py (1)

63-63: Mutable default argument flagged by static analysis.

Ruff B006 flags container_env: List[str] = [] as a mutable default. However, this follows the existing pattern on lines 60-62 (custom_mounts, custom_env_vars, custom_srun_args). The code is safe here because container_env is only read (passed to set()), not mutated.

Consider addressing all mutable defaults in a separate refactor if desired, but this is consistent with the current codebase style.

♻️ Optional fix to align with best practices
 def slurm_executor(
     gpu: str,
     account: str,
     partition: str,
     log_dir: str,
     nodes: int,
     num_gpus_per_node: int,
     time_limit: str = "00:30:00",
     container_image: str = "nvcr.io/nvidia/nemo:dev",
-    custom_mounts: List[str] = [],
-    custom_env_vars: Dict[str, str] = {},
-    custom_srun_args: List[str] = [],
-    container_env: List[str] = [],
+    custom_mounts: List[str] | None = None,
+    custom_env_vars: Dict[str, str] | None = None,
+    custom_srun_args: List[str] | None = None,
+    container_env: List[str] | None = None,
     hf_token: str = None,
     ...
 ) -> run.SlurmExecutor:
+    custom_mounts = custom_mounts if custom_mounts is not None else []
+    custom_env_vars = custom_env_vars if custom_env_vars is not None else {}
+    custom_srun_args = custom_srun_args if custom_srun_args is not None else []
+    container_env = container_env if container_env is not None else []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/utils/executors.py` at line 63, The function parameter
container_env: List[str] = [] uses a mutable default which triggers Ruff B006;
change it to container_env: Optional[List[str]] = None and inside the function
coerce it to an empty list (e.g., container_env = container_env or []) before it
is used (same treatment for the similar parameters custom_mounts,
custom_env_vars, custom_srun_args if you want to fully eliminate mutable
defaults) so callers keep behavior but the default is no longer a shared mutable
object; update import typing to include Optional if needed and ensure any code
using set(container_env) still works after the None-to-list coercion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/performance/utils/executors.py`:
- Line 63: The function parameter container_env: List[str] = [] uses a mutable
default which triggers Ruff B006; change it to container_env:
Optional[List[str]] = None and inside the function coerce it to an empty list
(e.g., container_env = container_env or []) before it is used (same treatment
for the similar parameters custom_mounts, custom_env_vars, custom_srun_args if
you want to fully eliminate mutable defaults) so callers keep behavior but the
default is no longer a shared mutable object; update import typing to include
Optional if needed and ensure any code using set(container_env) still works
after the None-to-list coercion.

In `@tests/unit_tests/scripts/performance/test_executors.py`:
- Around line 39-70: Add the pytest unit marker to each test function here by
prepending `@pytest.mark.unit` above test_container_env_includes_perf_vars,
test_custom_env_vars_in_container_env, and test_container_env_param_forwarded
(they already have `@pytest.mark.skipif`); ensure the marker is combined with the
existing skipif (one per test) so tests are categorized as unit tests, and also
add a trailing newline at the end of the file.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0747f5ba-2264-4353-b0df-f10c0fd68586

📥 Commits

Reviewing files that changed from the base of the PR and between c8eb587 and 0ea3fc1.

📒 Files selected for processing (4)
  • scripts/performance/argument_parser.py
  • scripts/performance/setup_experiment.py
  • scripts/performance/utils/executors.py
  • tests/unit_tests/scripts/performance/test_executors.py

Otherwise depending on merge order the container_env field ends up still
pointing to PERF_ENV_VARS.

Signed-off-by: Alex Filby <afilby@nvidia.com>
@sudostock
Copy link
Copy Markdown
Contributor Author

I also added the perf_env dict copy change from #2847 to prevent a bug depending on order they get merged and to pre-empt the claude review flag every time PERF_ENV_VAR is mutated in a change.

@ko3n1g
Copy link
Copy Markdown
Contributor

ko3n1g commented Apr 1, 2026

@malay-nagda can you take a look?

container_mounts=mounts,
env_vars=PERF_ENV_VARS,
env_vars=perf_env,
container_env=sorted(set(perf_env.keys()) | set(container_env)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to just set container_env=perf_env here instead of adding an extra user arg?
Users can specify vars to override via custom_env_vars and they will still be picked up by container_env here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as the primary bug is lack of container_env=perf_env. The main downside to not having the standalone flag is for scripting.

Using -E or -ce plus the container_env=perf_env would handle one of the main pain points today with something like NCCL_DEBUG.

I'm fine with removing the flag, we can revisit if the lack of flag causes us enough friction.

@malay-nagda malay-nagda requested a review from ko3n1g April 1, 2026 07:13
@yaoyu-33 yaoyu-33 added feature New capabilities, enhancements, or enablement work area:training Training loop, callbacks, and runtime integration needs-review PR is ready for code review and waiting on a reviewer labels Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants