add container_env to SlurmExecutor for sbatch workloads#3034
add container_env to SlurmExecutor for sbatch workloads#3034
Conversation
5d93fda to
6cbe786
Compare
Additionally adds a flag for users to flag their own vars that need override. Signed-off-by: Alex Filby <afilby@nvidia.com>
6cbe786 to
c6ebec2
Compare
Signed-off-by: Alex Filby <afilby@nvidia.com>
📝 WalkthroughWalkthroughA new Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/unit_tests/scripts/performance/test_executors.py (1)
39-70: Add pytest marker for unit test categorization.Per coding guidelines, unit tests should use pytest markers to categorize tests. Add
@pytest.mark.unitto these tests.Also, the file is missing a trailing newline at the end (line 70).
♻️ Proposed fix
+import pytest + `@pytest.mark.skipif`(not HAS_NEMO_RUN, reason="nemo_run not installed") +@pytest.mark.unit def test_container_env_includes_perf_vars(tmp_path): """PERF_ENV_VARS keys must appear in container_env so they override container defaults.""" ... `@pytest.mark.skipif`(not HAS_NEMO_RUN, reason="nemo_run not installed") +@pytest.mark.unit def test_custom_env_vars_in_container_env(tmp_path): """Vars passed via custom_env_vars must also appear in container_env.""" ... `@pytest.mark.skipif`(not HAS_NEMO_RUN, reason="nemo_run not installed") +@pytest.mark.unit def test_container_env_param_forwarded(tmp_path): """Keys passed via the container_env parameter must appear in container_env.""" ... assert "UPSTREAM_SET_VAR" in executor.container_env +As per coding guidelines: "Use pytest markers to categorize tests (unit, integration, system)" for
tests/**/*.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/scripts/performance/test_executors.py` around lines 39 - 70, Add the pytest unit marker to each test function here by prepending `@pytest.mark.unit` above test_container_env_includes_perf_vars, test_custom_env_vars_in_container_env, and test_container_env_param_forwarded (they already have `@pytest.mark.skipif`); ensure the marker is combined with the existing skipif (one per test) so tests are categorized as unit tests, and also add a trailing newline at the end of the file.scripts/performance/utils/executors.py (1)
63-63: Mutable default argument flagged by static analysis.Ruff B006 flags
container_env: List[str] = []as a mutable default. However, this follows the existing pattern on lines 60-62 (custom_mounts,custom_env_vars,custom_srun_args). The code is safe here becausecontainer_envis only read (passed toset()), not mutated.Consider addressing all mutable defaults in a separate refactor if desired, but this is consistent with the current codebase style.
♻️ Optional fix to align with best practices
def slurm_executor( gpu: str, account: str, partition: str, log_dir: str, nodes: int, num_gpus_per_node: int, time_limit: str = "00:30:00", container_image: str = "nvcr.io/nvidia/nemo:dev", - custom_mounts: List[str] = [], - custom_env_vars: Dict[str, str] = {}, - custom_srun_args: List[str] = [], - container_env: List[str] = [], + custom_mounts: List[str] | None = None, + custom_env_vars: Dict[str, str] | None = None, + custom_srun_args: List[str] | None = None, + container_env: List[str] | None = None, hf_token: str = None, ... ) -> run.SlurmExecutor: + custom_mounts = custom_mounts if custom_mounts is not None else [] + custom_env_vars = custom_env_vars if custom_env_vars is not None else {} + custom_srun_args = custom_srun_args if custom_srun_args is not None else [] + container_env = container_env if container_env is not None else []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/utils/executors.py` at line 63, The function parameter container_env: List[str] = [] uses a mutable default which triggers Ruff B006; change it to container_env: Optional[List[str]] = None and inside the function coerce it to an empty list (e.g., container_env = container_env or []) before it is used (same treatment for the similar parameters custom_mounts, custom_env_vars, custom_srun_args if you want to fully eliminate mutable defaults) so callers keep behavior but the default is no longer a shared mutable object; update import typing to include Optional if needed and ensure any code using set(container_env) still works after the None-to-list coercion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@scripts/performance/utils/executors.py`:
- Line 63: The function parameter container_env: List[str] = [] uses a mutable
default which triggers Ruff B006; change it to container_env:
Optional[List[str]] = None and inside the function coerce it to an empty list
(e.g., container_env = container_env or []) before it is used (same treatment
for the similar parameters custom_mounts, custom_env_vars, custom_srun_args if
you want to fully eliminate mutable defaults) so callers keep behavior but the
default is no longer a shared mutable object; update import typing to include
Optional if needed and ensure any code using set(container_env) still works
after the None-to-list coercion.
In `@tests/unit_tests/scripts/performance/test_executors.py`:
- Around line 39-70: Add the pytest unit marker to each test function here by
prepending `@pytest.mark.unit` above test_container_env_includes_perf_vars,
test_custom_env_vars_in_container_env, and test_container_env_param_forwarded
(they already have `@pytest.mark.skipif`); ensure the marker is combined with the
existing skipif (one per test) so tests are categorized as unit tests, and also
add a trailing newline at the end of the file.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0747f5ba-2264-4353-b0df-f10c0fd68586
📒 Files selected for processing (4)
scripts/performance/argument_parser.pyscripts/performance/setup_experiment.pyscripts/performance/utils/executors.pytests/unit_tests/scripts/performance/test_executors.py
Otherwise depending on merge order the container_env field ends up still pointing to PERF_ENV_VARS. Signed-off-by: Alex Filby <afilby@nvidia.com>
|
I also added the perf_env dict copy change from #2847 to prevent a bug depending on order they get merged and to pre-empt the claude review flag every time PERF_ENV_VAR is mutated in a change. |
|
@malay-nagda can you take a look? |
| container_mounts=mounts, | ||
| env_vars=PERF_ENV_VARS, | ||
| env_vars=perf_env, | ||
| container_env=sorted(set(perf_env.keys()) | set(container_env)), |
There was a problem hiding this comment.
Is it possible to just set container_env=perf_env here instead of adding an extra user arg?
Users can specify vars to override via custom_env_vars and they will still be picked up by container_env here?
There was a problem hiding this comment.
Yes, as the primary bug is lack of container_env=perf_env. The main downside to not having the standalone flag is for scripting.
Using -E or -ce plus the container_env=perf_env would handle one of the main pain points today with something like NCCL_DEBUG.
I'm fine with removing the flag, we can revisit if the lack of flag causes us enough friction.
NeMo Run automatically sets --container-env for srun workloads but not sbatch. This means env var overrides that conflict with container defaults are silently dropped for sbatch jobs.
Changes:
Background:
Slurm by default does
--export=ALLwhich passes the entire env at submission time into the job env. Pyxis/Enroot does the same thing. Except when the container already has that variable set in which case it keeps the container version set. Unless the user tells it to override with--container-envflag.Currently there is a gap if any of the values set in PERF_ENV_VARs ever already exists in the container they will silently be dropped. Additionally there's no easy way to override something like
NCCL_DEBUG=INFOwhich is set 'WARN' in the container.I proposed fixing this directly in Nemo Run (#394) to fix the incongruity between the srun and sbatch ops. In offline discussions the recommendation was to fix it in NeMo/Mbridge instead.
Summary by CodeRabbit
Release Notes
New Features
Tests