Update IOI to support interleaving + small improvements by SeanNaren · Pull Request #919 · NVIDIA-NeMo/Skills

SeanNaren · 2025-10-09T10:48:05Z

Adds interleaving support for IOI
Expose UWSGI parameters that can be set in the dataset init.py file
Update dataset metadata to reduce bloat in the main JSONL file
Added additional summarization logic (per problem/per subtask breakdown)

Summary by CodeRabbit

New Features
- IOI25 dataset configuration and class-based IOI evaluator added; benchmarks can provide per-benchmark sandbox environment overrides.
Improvements
- Safer, environment-aware sandbox defaults with a stateful sandbox option and echoed effective UWSGI settings; sandbox-related vars now propagate into run jobs.
- Evaluation now tracks, prints, and returns per-problem and per-subtask scores.
Dataset
- IOI24 metadata: "score" renamed to "subtask_score" and workflow fields moved to subtask level.
Refactor
- IOI evaluation reworked into a sandboxed, parallel, class-based flow.

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

coderabbitai · 2025-10-09T10:48:28Z

Walkthrough

Refactors IOI evaluation into an async, class-based, sandbox-backed evaluator with precompilation and multiprocessing; records per-problem/subtask metrics; propagates per-benchmark sandbox env var overrides through the pipeline into docker; updates dataset prepare schema (score → subtask_score) and makes UWSGI startup vars environment-aware with a STATEFUL_SANDBOX knob.

Changes

Cohort / File(s)	Summary
Sandbox startup & local run `dockerfiles/sandbox/start-with-nginx.sh`, `nemo_skills/code_execution/local_sandbox/start_local_sandbox.sh`	Add `STATEFUL_SANDBOX` knob; conditionally set/export `UWSGI_PROCESSES` and `UWSGI_CHEAPER` based on it; echo effective UWSGI settings; conditionally pass `UWSGI_CPU_AFFINITY` and `UWSGI_PROCESSES` into `docker run`.
Dataset configs (IOI) `nemo_skills/dataset/ioi24/__init__.py`, `nemo_skills/dataset/ioi24/prepare.py`, `nemo_skills/dataset/ioi25/__init__.py`	Add `SANDBOX_ENV_VARS` (now includes `STATEFUL_SANDBOX=0`) to ioi24/ioi25 modules. In ioi24 `prepare.py`: rename per-test `score` → `subtask_score`, remove top-level `run`/`compile`, and move `run`/`compile`/`grader_files` into subtask entries.
Evaluator registry `nemo_skills/evaluation/evaluator/__init__.py`	Swap function-based ioi import for class: `eval_ioi` → `IOIEvaluator`; register `"ioi": IOIEvaluator` in class map (removed former function mapping).
IOI evaluator implementation `nemo_skills/evaluation/evaluator/ioi.py`	Replace procedural evaluator with `IOIEvaluator` (inherits `BaseEvaluator`): async runtime init, `_evaluate_entry`, `eval_full`/`eval_single`, sandbox exec helpers, precompile/grader cache, per-thread event-loop handling, multiprocessing batch execution, file/runtime checks, and expanded config fields (`test_file`, `num_workers`, `test_batch_size`, `overwrite`).
Metrics (IOI) `nemo_skills/evaluation/metrics/ioi_metrics.py`	Build per-subtask scores incrementally, introduce `self.problem_scores` and `print_problem_scores`, have `get_metrics` populate and return metrics, and reset clears `problem_scores`.
Pipeline env propagation `nemo_skills/pipeline/utils/eval.py`, `nemo_skills/pipeline/utils/exp.py`, `nemo_skills/pipeline/eval.py`	Add `BenchmarkArgs.sandbox_env_overrides`; read `SANDBOX_ENV_VARS` from benchmark modules; aggregate per-job overrides with first‑wins merging and conflict detection (raise on conflict); pass `job_sandbox_env_overrides` through job creation and have `add_task` apply KEY=VALUE overrides when composing sandbox env.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Pipeline as Pipeline (eval/exp)
  participant Dataset as Dataset Module (ioi24/ioi25)
  participant Job as Job Prep
  participant Sandbox as Sandbox Runner
  participant Docker as Docker

  User->>Pipeline: start evaluation
  Pipeline->>Dataset: import module / read SANDBOX_ENV_VARS
  Dataset-->>Pipeline: SANDBOX_ENV_VARS (e.g., UWSGI_*, STATEFUL_SANDBOX)
  Pipeline->>Job: incorporate sandbox_env_overrides into job args
  Job->>Sandbox: add_task(..., sandbox_env_overrides)
  Sandbox->>Docker: docker run -e KEY=VAL ... (conditionally include UWSGI_*)
  Docker-->>Sandbox: container starts with provided env
  Sandbox-->>Pipeline: service available / runs evaluator jobs

sequenceDiagram
  autonumber
  actor Orchestrator as IOIEvaluator
  participant FS as Filesystem
  participant SB as Sandbox
  participant Pool as Multiprocessing Pool

  Orchestrator->>Orchestrator: _initialize_runtime()
  Orchestrator->>FS: load metadata/tests
  Orchestrator->>SB: _precompile_grader() (cached)
  loop each entry
    Orchestrator->>SB: _evaluate_entry(entry) -> compile/run
    SB-->>Orchestrator: compile/run outputs
    Orchestrator->>Pool: run test batches
    Pool-->>Orchestrator: batch results
    Orchestrator->>Orchestrator: aggregate subtask scores
  end
  Orchestrator-->>Orchestrator: eval_full() returns structured results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I twitch my whiskers at the build,
env knobs set and sandboxes filled.
Async hops and cached compile,
subtasks scored in tidy file.
I nibble logs and drum my paw—pipeline hums, hoorah! 🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly references the primary addition of interleaving support for IOI and acknowledges the miscellaneous smaller improvements, aligning with the pull request objectives and providing clear context for reviewers.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ioi_interleaved

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50821b7 and 344dca9.

📒 Files selected for processing (3)

dockerfiles/sandbox/start-with-nginx.sh (1 hunks)
nemo_skills/dataset/ioi24/__init__.py (1 hunks)
nemo_skills/dataset/ioi25/__init__.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

dockerfiles/sandbox/start-with-nginx.sh
nemo_skills/dataset/ioi24/init.py
nemo_skills/dataset/ioi25/init.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e5877c and 75268e6.

📒 Files selected for processing (10)

dockerfiles/sandbox/start-with-nginx.sh (1 hunks)
nemo_skills/code_execution/local_sandbox/start_local_sandbox.sh (1 hunks)
nemo_skills/dataset/ioi24/__init__.py (1 hunks)
nemo_skills/dataset/ioi24/prepare.py (2 hunks)
nemo_skills/dataset/ioi25/__init__.py (1 hunks)
nemo_skills/evaluation/evaluator/__init__.py (2 hunks)
nemo_skills/evaluation/evaluator/ioi.py (5 hunks)
nemo_skills/evaluation/metrics/ioi_metrics.py (3 hunks)
nemo_skills/pipeline/utils/eval.py (1 hunks)
nemo_skills/pipeline/utils/exp.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

nemo_skills/evaluation/evaluator/ioi.py (3)

nemo_skills/code_execution/sandbox.py (4)

LocalSandbox (337-411)

unroll_files (39-42)

execute_code (140-254)

close (77-79)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluator (26-88)

nemo_skills/file_utils.py (1)

jdump (68-100)

nemo_skills/evaluation/evaluator/__init__.py (1)

nemo_skills/evaluation/evaluator/ioi.py (1)

IOIEvaluator (247-396)

🪛 Ruff (0.13.3)

nemo_skills/evaluation/evaluator/ioi.py

67-68: try-except-pass detected, consider logging the exception

(S110)

67-67: Do not catch blind exception: Exception

(BLE001)

70-70: Avoid specifying long messages outside the exception class

(TRY003)

91-91: Probable insecure usage of temporary file or directory: "/tmp/ioi_pre_"

(S108)

121-121: Probable insecure usage of temporary file or directory: "/tmp/ioi_run_"

(S108)

149-149: Abstract raise to an inner function

(TRY301)

149-149: Create your own exception

(TRY002)

149-149: Avoid specifying long messages outside the exception class

(TRY003)

263-263: Comment contains ambiguous ‐ (HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF003)

271-275: Avoid specifying long messages outside the exception class

(TRY003)

350-350: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

385-385: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

coderabbitai · 2025-10-09T10:55:30Z

nemo_skills/evaluation/evaluator/ioi.py

+            for s, o in zip(all_samples, outputs):
+                s["test_case_results"] = o["test_case_results"]
+                s["eval_status"] = o["eval_status"]



⚠️ Potential issue | 🔴 Critical

Fix missing eval_status propagation

_evaluate_entry never sets eval_status, so o["eval_status"] raises a KeyError the first time this block runs, aborting the evaluation. Include an eval_status field in the per-entry result (or guard this assignment) so the merge succeeds.

return { "name": entry["name"], "subtask": entry["subtask"], "test_case_results": test_case_results, + "eval_status": "completed", }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for s, o in zip(all_samples, outputs):

s["test_case_results"] = o["test_case_results"]

s["eval_status"] = o["eval_status"]

return {

"name": entry["name"],

"subtask": entry["subtask"],

"test_case_results": test_case_results,

"eval_status": "completed",

}

🧰 Tools

🪛 Ruff (0.13.3)

385-385: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In nemo_skills/evaluation/evaluator/ioi.py around lines 385 to 388, the merge loop assumes each per-entry result has an "eval_status" key but _evaluate_entry never sets it, causing a KeyError; update _evaluate_entry to include an "eval_status" field (e.g., "success"/"failed"/"skipped" or similar) in its returned dict for every code path, or alternatively change the merge to use a safe getter (e.g., s["eval_status"] = o.get("eval_status", "<default>")) so the assignment never raises; ensure the chosen default value is consistent with downstream consumers.

coderabbitai · 2025-10-09T10:55:30Z

nemo_skills/evaluation/metrics/ioi_metrics.py

+                max_subtasks[sub["subtask"]] = sub["score"]
+            max_total = sum(max_subtasks.values())
+            print(f"# {name}: {achieved_total}/{max_total}")
+            for subtask, achieved in achieved_subtasks.items():
+                print(f"  {subtask}: {achieved}/{max_subtasks[subtask]}")


⚠️ Potential issue | 🟠 Major

Fix max score denominator in print_problem_scores.

Line 103 sets max_subtasks[sub["subtask"]] = sub["score"], but sub["score"] is the achieved score for that submission. We end up printing achieved/achieved even when a subtask wasn’t solved. Please derive the denominator from the per-subtask maximum (e.g., result["max_score"] within test_case_results, falling back to a stored subtask_score) so the report reflects the real target.

🤖 Prompt for AI Agents

In nemo_skills/evaluation/metrics/ioi_metrics.py around lines 103 to 107, the code currently sets max_subtasks[sub["subtask"]] = sub["score"] which uses the achieved score as the denominator; change this so the denominator is the per-subtask maximum instead: when iterating test_case_results pull the max value from result.get("max_score") (falling back to any stored subtask_score constant if missing) and assign max_subtasks[subtask] to that maximum (or the max of existing and this value if multiple results map to the same subtask); then recompute max_total as the sum of these true maxima so the printed lines show achieved/max for each subtask and overall.

nemo_skills/pipeline/utils/eval.py

Kipok · 2025-10-09T21:49:37Z

dockerfiles/sandbox/start-with-nginx.sh

+# defaults that work for most cases.
+
+# shell parameter expansion with default value
+: "${UWSGI_PROCESSES:=2}"   # default to 2 processes per worker if unset


@gwarmstrong any problem for ipython session affinity with this default?

Can we keep default to 1 if you're anyway overriding this in your benchmarks?

yeah changing this would cause problems for session affinity, agree with @Kipok

nemo_skills/dataset/ioi25/__init__.py

Kipok · 2025-10-09T22:01:27Z

nemo_skills/pipeline/utils/exp.py

        }
+        for override in cluster_config.get("module_env_vars", []):
+            key, value = override.split("=", 1)
+            sandbox_env_updates.setdefault(key, value)


if these are just for the sandbox, let's rename ENV_VARS to SANDBOX_ENV_VARS ?

nemo_skills/pipeline/utils/eval.py

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

nemo_skills/pipeline/utils/eval.py (1)

190-193: Normalize ENV_VARS inputs and support SANDBOX_ENV_VARS alias.

Handle dict/list/str forms, validate entries, and prefer a clearer alias SANDBOX_ENV_VARS while keeping ENV_VARS/env_vars for backward compatibility.

Apply this diff:

-    env_vars_from_module = getattr(benchmark_module, "ENV_VARS", getattr(benchmark_module, "env_vars", []))
-    sandbox_env_overrides = list(env_vars_from_module) if env_vars_from_module else []
+    # Prefer explicit SANDBOX_ENV_VARS; fallback to legacy names
+    env_vars_from_module = (
+        getattr(benchmark_module, "SANDBOX_ENV_VARS", None)
+        or getattr(benchmark_module, "ENV_VARS", None)
+        or getattr(benchmark_module, "env_vars", None)
+        or []
+    )
+    sandbox_env_overrides: list[str] = []
+    if isinstance(env_vars_from_module, dict):
+        sandbox_env_overrides = [f"{k}={v}" for k, v in env_vars_from_module.items()]
+    elif isinstance(env_vars_from_module, (list, tuple, set)):
+        for item in env_vars_from_module:
+            if isinstance(item, str) and "=" in item:
+                sandbox_env_overrides.append(item)
+            else:
+                LOG.warning("Invalid ENV_VARS item %r; expected 'KEY=VALUE'. Skipping.", item)
+    elif isinstance(env_vars_from_module, str):
+        if "=" in env_vars_from_module:
+            sandbox_env_overrides = [env_vars_from_module]
+        else:
+            LOG.warning("Invalid ENV_VARS string %r; expected 'KEY=VALUE'. Skipping.", env_vars_from_module)

🧹 Nitpick comments (3)

nemo_skills/pipeline/utils/exp.py (1)

369-369: Document new parameter behavior (precedence and reserved keys).

Please extend the add_task docstring to describe sandbox_env_overrides (format KEY=VALUE, precedence vs defaults, reserved keys like LISTEN_PORT/NGINX_PORT not overridable).

nemo_skills/pipeline/utils/eval.py (1)

528-536: Validate per-job overrides; guard malformed entries.

Keep “first wins” order but skip bad entries with a warning to avoid ValueError on split.

Apply this diff:

-                    for b in ordered_benchmarks:
-                        for override in benchmarks_dict[b].sandbox_env_overrides:
-                            key, value = override.split("=", 1)
-                            if key not in env_map:
-                                env_map[key] = value
+                    for b in ordered_benchmarks:
+                        for override in benchmarks_dict[b].sandbox_env_overrides:
+                            if "=" not in override:
+                                LOG.warning("Invalid sandbox override %r in benchmark %s; skipping.", override, b)
+                                continue
+                            key, value = override.split("=", 1)
+                            if key not in env_map:
+                                env_map[key] = value

nemo_skills/pipeline/eval.py (1)

365-365: LGTM: forward overrides to task creation.

Optional: pass None when sandbox is disabled to make intent explicit.
sandbox_env_overrides = job_sandbox_env_overrides if (job_needs_sandbox or with_sandbox) else None

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75268e6 and 8c6b14d.

📒 Files selected for processing (3)

nemo_skills/pipeline/eval.py (2 hunks)
nemo_skills/pipeline/utils/eval.py (5 hunks)
nemo_skills/pipeline/utils/exp.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (4)

nemo_skills/pipeline/utils/eval.py (3)

105-107: LGTM: per-benchmark sandbox overrides field added.

Good use of default_factory for an immutable default.

240-241: LGTM: propagate overrides into BenchmarkArgs.

Correctly threads sandbox_env_overrides.

549-549: LGTM: job-level overrides carried in job tuple.

Keeps scheduling surface clean and localized per job.

nemo_skills/pipeline/eval.py (1)

345-346: LGTM: unpack job_sandbox_env_overrides.

Tuple shape stays consistent with prepare_eval_commands.

coderabbitai · 2025-10-10T12:23:07Z

nemo_skills/pipeline/utils/exp.py

+        if sandbox_env_overrides:
+            for override in sandbox_env_overrides:
+                key, value = override.split("=", 1)
+                sandbox_env_updates.setdefault(key, value)
        current_env_vars = cluster_config.get("env_vars", []).copy()


⚠️ Potential issue | 🟡 Minor

Harden override parsing and make precedence explicit (skip malformed, protect ports).

Validate KEY=VALUE format; skip and warn on bad entries.

Keep LISTEN_PORT/NGINX_PORT owned by scheduler; don’t allow overrides.

Assign overrides for other keys (not setdefault).

Apply this diff:

- if sandbox_env_overrides: - for override in sandbox_env_overrides: - key, value = override.split("=", 1) - sandbox_env_updates.setdefault(key, value) + if sandbox_env_overrides: + for override in sandbox_env_overrides: + if "=" not in override: + LOG.warning("Invalid sandbox env override %r; expected KEY=VALUE. Skipping.", override) + continue + key, value = override.split("=", 1) + if key in ("LISTEN_PORT", "NGINX_PORT"): + LOG.debug("Ignoring override of reserved key %s", key) + continue + sandbox_env_updates[key] = value

Optional: simplify PYTHONPATH handling using normalized env vars (outside the selected lines):

# Prefer normalized env vars over raw strings env_dict = get_env_variables(cluster_config) if env_dict.get("PYTHONPATH"): sandbox_env_updates["PYTHONPATH"] = f'{env_dict["PYTHONPATH"]}:/app'

🤖 Prompt for AI Agents

In nemo_skills/pipeline/utils/exp.py around lines 529-533, the override parsing should validate entries, reject malformed KEY=VALUE pairs with a warning, and prevent overrides of LISTEN_PORT and NGINX_PORT (these must remain owned by the scheduler); for valid entries assign sandbox_env_updates[key] = value (do not use setdefault) so overrides take precedence. Implement: for each override, check it contains exactly one '=' and a non-empty key; if invalid, log a warning and continue; if key in {"LISTEN_PORT","NGINX_PORT"} log a debug/warning and skip; otherwise set sandbox_env_updates[key] = value. Optionally (outside these lines) normalize env vars via get_env_variables(cluster_config) and, if PYTHONPATH exists, set sandbox_env_updates["PYTHONPATH"] = f'{env_dict["PYTHONPATH"]}:/app'.

nemo_skills/pipeline/utils/eval.py

Kipok · 2025-10-10T18:12:52Z

dockerfiles/sandbox/start-with-nginx.sh

+# defaults that work for most cases.
+
+# shell parameter expansion with default value
+: "${UWSGI_PROCESSES:=2}"   # default to 2 processes per worker if unset


Can we keep default to 1 if you're anyway overriding this in your benchmarks?

nemo_skills/dataset/ioi25/__init__.py

nemo_skills/dataset/ioi24/__init__.py

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c6b14d and 8f79870.

📒 Files selected for processing (3)

dockerfiles/sandbox/start-with-nginx.sh (1 hunks)
nemo_skills/dataset/ioi25/__init__.py (1 hunks)
nemo_skills/pipeline/utils/eval.py (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

nemo_skills/dataset/ioi25/init.py
dockerfiles/sandbox/start-with-nginx.sh

🧰 Additional context used

🪛 Ruff (0.13.3)

nemo_skills/pipeline/utils/eval.py

192-192: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)

536-540: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (1)

nemo_skills/pipeline/utils/eval.py (1)

528-556: LGTM: Per-job aggregation with conflict detection.

The approach of storing sandbox environment overrides per-benchmark and aggregating them at the job level with first-wins conflict detection properly addresses the past review concern about leaking env vars between benchmarks. The ValueError on conflicting values aligns with the maintainer's preference to fail explicitly rather than silently override.

Based on past review comments.

nemo_skills/pipeline/utils/eval.py

Kipok · 2025-10-11T15:31:08Z

nemo_skills/dataset/ioi24/__init__.py

 EVAL_ARGS = "++eval_type=ioi"
+
+# environment variables required by this benchmark
+ENV_VARS = ["UWSGI_PROCESSES=1024", "UWSGI_CPU_AFFINITY=8", "UWSGI_CHEAPER=1023", "NUM_WORKERS=1"]


Suggested change

ENV_VARS = ["UWSGI_PROCESSES=1024", "UWSGI_CPU_AFFINITY=8", "UWSGI_CHEAPER=1023", "NUM_WORKERS=1"]

SANDBOX_ENV_VARS = ["UWSGI_PROCESSES=1024", "UWSGI_CPU_AFFINITY=8", "UWSGI_CHEAPER=1023", "NUM_WORKERS=1"]

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

nemo_skills/pipeline/utils/eval.py (1)
536-540: Consider defining a custom exception class.

The static analysis tool suggests avoiding long error messages outside the exception class. While the current implementation is clear and functional, you could optionally refactor this into a custom exception class if you prefer to follow this style guideline.

Example refactor:
class ConflictingSandboxEnvOverridesError(ValueError):
    def __init__(self, key: str, existing_value: str, existing_source: str, new_value: str, new_source: str):
        super().__init__(
            f"Conflicting sandbox environment overrides for key '{key}': "
            f"'{existing_value}' (from {existing_source}) vs '{new_value}' (from {new_source}). "
            "Please submit the benchmarks as separate jobs or increase num_jobs so they do not share a job."
        )
Then use:
raise ConflictingSandboxEnvOverridesError(key, env_map[key], env_source[key], value, b)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0229b22 and 74c1267.

📒 Files selected for processing (2)

nemo_skills/dataset/ioi24/__init__.py (1 hunks)
nemo_skills/pipeline/utils/eval.py (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/dataset/ioi24/init.py

🧰 Additional context used

🪛 Ruff (0.13.3)

nemo_skills/pipeline/utils/eval.py

536-540: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (5)

nemo_skills/pipeline/utils/eval.py (5)

23-23: LGTM!

The Dict import is necessary for the type hints used in the sandbox environment aggregation logic (lines 530-531).

106-107: LGTM!

The sandbox_env_overrides field is well-documented and correctly uses default_factory=list for the mutable default, following dataclass best practices.

191-193: LGTM!

The use of getattr with a default value of [] correctly prevents AttributeError when SANDBOX_ENV_VARS is not defined in the benchmark module. The conditional list conversion handles all cases appropriately.

241-241: LGTM!

The sandbox_env_overrides parameter is correctly passed to the BenchmarkArgs constructor.

528-556: LGTM!

The per-job sandbox environment aggregation logic is well-implemented:

Correctly aggregates overrides from participating benchmarks in deterministic order

Uses first-wins strategy with conflict detection to prevent silent errors

Properly handles values containing = using maxsplit=1

Creates job_sandbox_env_overrides list for the job batch

All past review concerns have been addressed:

Env vars are no longer leaked between benchmarks (per-job aggregation)

The error message formatting is correct (line 537 has the f-prefix)

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

… into ioi_interleaved

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

SeanNaren added 2 commits October 9, 2025 03:44

update ioi

3d56875

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Merge branch 'main' into ioi_interleaved

75268e6

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

Kipok requested changes Oct 9, 2025

View reviewed changes

SeanNaren added 2 commits October 10, 2025 05:14

separate overrides for sandbox env

7e2a068

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Merge branch 'main' into ioi_interleaved

8c6b14d

coderabbitai bot reviewed Oct 10, 2025

View reviewed changes

Kipok requested changes Oct 10, 2025

View reviewed changes

SeanNaren added 3 commits October 11, 2025 05:07

address feedback

b6e5e47

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Merge branch 'main' into ioi_interleaved

8f79870

address feedback

0229b22

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

coderabbitai bot reviewed Oct 11, 2025

View reviewed changes

nemo_skills/pipeline/utils/eval.py Show resolved Hide resolved

nemo_skills/pipeline/utils/eval.py Show resolved Hide resolved

Kipok reviewed Oct 11, 2025

View reviewed changes

SeanNaren added 2 commits October 11, 2025 08:52

address feedback

74c1267

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

address feedback

50821b7

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

coderabbitai bot reviewed Oct 11, 2025

View reviewed changes

SeanNaren added 4 commits October 13, 2025 11:25

Merge branch 'main' into ioi_interleaved

97aa7a5

update

3becce1

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Merge branch 'ioi_interleaved' of https://github.com/NVIDIA/NeMo-Skills…

344dca9

… into ioi_interleaved

Merge branch 'main' into ioi_interleaved

5a62745

Kipok approved these changes Oct 13, 2025

View reviewed changes

Kipok merged commit 3046e17 into main Oct 13, 2025
7 checks passed

Kipok deleted the ioi_interleaved branch October 13, 2025 21:05

dgtm777 pushed a commit that referenced this pull request Oct 29, 2025

Update IOI to support interleaving + small improvements (#919)

3d04a78

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

coderabbitai bot mentioned this pull request Dec 4, 2025

Port ICPC changes to IOI #1046

Merged

coderabbitai bot mentioned this pull request Dec 18, 2025

Revert "Port ICPC changes to IOI (#1046)" #1130

Closed

coderabbitai bot mentioned this pull request Feb 11, 2026

feat: migrate sandbox from uwsgi to gunicorn #1234

Open

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Update IOI to support interleaving + small improvements (#919)

98fd44c

Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

-            for s, o in zip(all_samples, outputs):
-                s["test_case_results"] = o["test_case_results"]
-                s["eval_status"] = o["eval_status"]
+         return {
+             "name": entry["name"],
+             "subtask": entry["subtask"],
+             "test_case_results": test_case_results,
+             "eval_status": "completed",
+         }

	ENV_VARS = ["UWSGI_PROCESSES=1024", "UWSGI_CPU_AFFINITY=8", "UWSGI_CHEAPER=1023", "NUM_WORKERS=1"]
	SANDBOX_ENV_VARS = ["UWSGI_PROCESSES=1024", "UWSGI_CPU_AFFINITY=8", "UWSGI_CHEAPER=1023", "NUM_WORKERS=1"]

Conversation

SeanNaren commented Oct 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kipok Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Kipok Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kipok Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kipok Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kipok Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SeanNaren commented Oct 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading