add container_env to SlurmExecutor for sbatch workloads by sudostock · Pull Request #3034 · NVIDIA-NeMo/Megatron-Bridge

sudostock · 2026-03-30T23:08:27Z

NeMo Run automatically sets --container-env for srun workloads but not sbatch. This means env var overrides that conflict with container defaults are silently dropped for sbatch jobs.

Changes:

Set container_env on SlurmExecutor using PERF_ENV_VARS keys, ensuring performance-critical vars always override container defaults
Add container_env parameter to slurm_executor() for callers that need additional keys
Add --container_env CLI flag to setup_experiment.py for key-only forwarding when the value is already present in the job environment

Background:
Slurm by default does --export=ALL which passes the entire env at submission time into the job env. Pyxis/Enroot does the same thing. Except when the container already has that variable set in which case it keeps the container version set. Unless the user tells it to override with --container-env flag.

Currently there is a gap if any of the values set in PERF_ENV_VARs ever already exists in the container they will silently be dropped. Additionally there's no easy way to override something like NCCL_DEBUG=INFO which is set 'WARN' in the container.

I proposed fixing this directly in Nemo Run (#394) to fix the incongruity between the srun and sbatch ops. In offline discussions the recommendation was to fix it in NeMo/Mbridge instead.

Summary by CodeRabbit

Release Notes

New Features
- Users can now specify environment variables via command-line that should override container image defaults when running performance experiments, enabling better customization of execution environments.
Tests
- Added comprehensive test coverage for environment variable propagation in container execution workflows, ensuring proper handling, deduplication, and consistent ordering of variables.

copy-pr-bot · 2026-03-30T23:08:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Additionally adds a flag for users to flag their own vars that need override. Signed-off-by: Alex Filby <afilby@nvidia.com>

Signed-off-by: Alex Filby <afilby@nvidia.com>

coderabbitai · 2026-04-01T00:43:01Z

📝 Walkthrough

Walkthrough

A new container_env CLI argument and parameter enable users to specify environment variable names that should override container image values. This parameter threads through argument_parser.py, setup_experiment.py, and slurm_executor(), where it is merged with default performance environment variables for container execution.

Changes

Cohort / File(s)	Summary
CLI Argument Definition `scripts/performance/argument_parser.py`	Added `--container_env` CLI argument to accept comma-separated environment variable names with default empty list.
Parameter Threading `scripts/performance/setup_experiment.py`	Extended `main()` signature with `container_env` parameter and forwarded it from CLI args to `slurm_executor()`.
Executor Implementation `scripts/performance/utils/executors.py`	Added `container_env` parameter to `slurm_executor()` and merged it with `PERF_ENV_VARS` keys using deduplication and sorted ordering.
Test Suite `tests/unit_tests/scripts/performance/test_executors.py`	Added three unit tests validating that default performance variables, custom environment variables, and user-supplied `container_env` values are all present in the executor's `container_env`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	❓ Inconclusive	Test file existence cannot be verified in repository, and PR description lacks documented test execution results or CI/CD verification.	Verify test file exists in PR, confirm test execution results are documented, and clarify if change qualifies as major requiring explicit test result documentation.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding container_env parameter to SlurmExecutor for sbatch workloads, which is the core focus of all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch afilby/container-env

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

tests/unit_tests/scripts/performance/test_executors.py (1)

39-70: Add pytest marker for unit test categorization.

Per coding guidelines, unit tests should use pytest markers to categorize tests. Add @pytest.mark.unit to these tests.

Also, the file is missing a trailing newline at the end (line 70).

♻️ Proposed fix

+import pytest
+
 `@pytest.mark.skipif`(not HAS_NEMO_RUN, reason="nemo_run not installed")
+@pytest.mark.unit
 def test_container_env_includes_perf_vars(tmp_path):
     """PERF_ENV_VARS keys must appear in container_env so they override container defaults."""
     ...


 `@pytest.mark.skipif`(not HAS_NEMO_RUN, reason="nemo_run not installed")
+@pytest.mark.unit
 def test_custom_env_vars_in_container_env(tmp_path):
     """Vars passed via custom_env_vars must also appear in container_env."""
     ...


 `@pytest.mark.skipif`(not HAS_NEMO_RUN, reason="nemo_run not installed")
+@pytest.mark.unit
 def test_container_env_param_forwarded(tmp_path):
     """Keys passed via the container_env parameter must appear in container_env."""
     ...
     assert "UPSTREAM_SET_VAR" in executor.container_env
+

As per coding guidelines: "Use pytest markers to categorize tests (unit, integration, system)" for tests/**/*.py.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/scripts/performance/test_executors.py` around lines 39 - 70,
Add the pytest unit marker to each test function here by prepending
`@pytest.mark.unit` above test_container_env_includes_perf_vars,
test_custom_env_vars_in_container_env, and test_container_env_param_forwarded
(they already have `@pytest.mark.skipif`); ensure the marker is combined with the
existing skipif (one per test) so tests are categorized as unit tests, and also
add a trailing newline at the end of the file.

scripts/performance/utils/executors.py (1)

63-63: Mutable default argument flagged by static analysis.

Ruff B006 flags container_env: List[str] = [] as a mutable default. However, this follows the existing pattern on lines 60-62 (custom_mounts, custom_env_vars, custom_srun_args). The code is safe here because container_env is only read (passed to set()), not mutated.

Consider addressing all mutable defaults in a separate refactor if desired, but this is consistent with the current codebase style.

♻️ Optional fix to align with best practices

 def slurm_executor(
     gpu: str,
     account: str,
     partition: str,
     log_dir: str,
     nodes: int,
     num_gpus_per_node: int,
     time_limit: str = "00:30:00",
     container_image: str = "nvcr.io/nvidia/nemo:dev",
-    custom_mounts: List[str] = [],
-    custom_env_vars: Dict[str, str] = {},
-    custom_srun_args: List[str] = [],
-    container_env: List[str] = [],
+    custom_mounts: List[str] | None = None,
+    custom_env_vars: Dict[str, str] | None = None,
+    custom_srun_args: List[str] | None = None,
+    container_env: List[str] | None = None,
     hf_token: str = None,
     ...
 ) -> run.SlurmExecutor:
+    custom_mounts = custom_mounts if custom_mounts is not None else []
+    custom_env_vars = custom_env_vars if custom_env_vars is not None else {}
+    custom_srun_args = custom_srun_args if custom_srun_args is not None else []
+    container_env = container_env if container_env is not None else []

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/utils/executors.py` at line 63, The function parameter
container_env: List[str] = [] uses a mutable default which triggers Ruff B006;
change it to container_env: Optional[List[str]] = None and inside the function
coerce it to an empty list (e.g., container_env = container_env or []) before it
is used (same treatment for the similar parameters custom_mounts,
custom_env_vars, custom_srun_args if you want to fully eliminate mutable
defaults) so callers keep behavior but the default is no longer a shared mutable
object; update import typing to include Optional if needed and ensure any code
using set(container_env) still works after the None-to-list coercion.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/performance/utils/executors.py`:
- Line 63: The function parameter container_env: List[str] = [] uses a mutable
default which triggers Ruff B006; change it to container_env:
Optional[List[str]] = None and inside the function coerce it to an empty list
(e.g., container_env = container_env or []) before it is used (same treatment
for the similar parameters custom_mounts, custom_env_vars, custom_srun_args if
you want to fully eliminate mutable defaults) so callers keep behavior but the
default is no longer a shared mutable object; update import typing to include
Optional if needed and ensure any code using set(container_env) still works
after the None-to-list coercion.

In `@tests/unit_tests/scripts/performance/test_executors.py`:
- Around line 39-70: Add the pytest unit marker to each test function here by
prepending `@pytest.mark.unit` above test_container_env_includes_perf_vars,
test_custom_env_vars_in_container_env, and test_container_env_param_forwarded
(they already have `@pytest.mark.skipif`); ensure the marker is combined with the
existing skipif (one per test) so tests are categorized as unit tests, and also
add a trailing newline at the end of the file.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0747f5ba-2264-4353-b0df-f10c0fd68586

📥 Commits

Reviewing files that changed from the base of the PR and between c8eb587 and 0ea3fc1.

📒 Files selected for processing (4)

scripts/performance/argument_parser.py
scripts/performance/setup_experiment.py
scripts/performance/utils/executors.py
tests/unit_tests/scripts/performance/test_executors.py

Otherwise depending on merge order the container_env field ends up still pointing to PERF_ENV_VARS. Signed-off-by: Alex Filby <afilby@nvidia.com>

sudostock · 2026-04-01T03:06:08Z

I also added the perf_env dict copy change from #2847 to prevent a bug depending on order they get merged and to pre-empt the claude review flag every time PERF_ENV_VAR is mutated in a change.

ko3n1g · 2026-04-01T07:11:57Z

@malay-nagda can you take a look?

malay-nagda · 2026-04-01T07:12:46Z

scripts/performance/utils/executors.py

        container_mounts=mounts,
-        env_vars=PERF_ENV_VARS,
+        env_vars=perf_env,
+        container_env=sorted(set(perf_env.keys()) | set(container_env)),


Is it possible to just set container_env=perf_env here instead of adding an extra user arg?
Users can specify vars to override via custom_env_vars and they will still be picked up by container_env here?

Yes, as the primary bug is lack of container_env=perf_env. The main downside to not having the standalone flag is for scripting.

Using -E or -ce plus the container_env=perf_env would handle one of the main pain points today with something like NCCL_DEBUG.

I'm fine with removing the flag, we can revisit if the lack of flag causes us enough friction.

sudostock force-pushed the afilby/container-env branch from 5d93fda to 6cbe786 Compare March 30, 2026 23:22

Set nemo vars in slurm container-env field.

c6ebec2

Additionally adds a flag for users to flag their own vars that need override. Signed-off-by: Alex Filby <afilby@nvidia.com>

sudostock force-pushed the afilby/container-env branch from 6cbe786 to c6ebec2 Compare March 31, 2026 22:56

Add unit test for container-env.

0ea3fc1

Signed-off-by: Alex Filby <afilby@nvidia.com>

sudostock marked this pull request as ready for review April 1, 2026 00:38

sudostock requested review from a team, erhoo82 and malay-nagda as code owners April 1, 2026 00:38

coderabbitai bot reviewed Apr 1, 2026

View reviewed changes

Port perf_env change from #2847

653b67b

Otherwise depending on merge order the container_env field ends up still pointing to PERF_ENV_VARS. Signed-off-by: Alex Filby <afilby@nvidia.com>

malay-nagda reviewed Apr 1, 2026

View reviewed changes

malay-nagda requested a review from ko3n1g April 1, 2026 07:13

yaoyu-33 added feature New capabilities, enhancements, or enablement work area:training Training loop, callbacks, and runtime integration needs-review PR is ready for code review and waiting on a reviewer labels Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add container_env to SlurmExecutor for sbatch workloads#3034

add container_env to SlurmExecutor for sbatch workloads#3034
sudostock wants to merge 3 commits intomainfrom
afilby/container-env

sudostock commented Mar 30, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 30, 2026

Uh oh!

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

sudostock commented Apr 1, 2026

Uh oh!

ko3n1g commented Apr 1, 2026

Uh oh!

malay-nagda Apr 1, 2026

Uh oh!

sudostock Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sudostock commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Mar 30, 2026

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sudostock commented Apr 1, 2026

Uh oh!

ko3n1g commented Apr 1, 2026

Uh oh!

malay-nagda Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

sudostock Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sudostock commented Mar 30, 2026 •

edited

Loading

coderabbitai bot commented Apr 1, 2026 •

edited

Loading