Skip to content

Update IOI to support interleaving + small improvements#919

Merged
Kipok merged 13 commits intomainfrom
ioi_interleaved
Oct 13, 2025
Merged

Update IOI to support interleaving + small improvements#919
Kipok merged 13 commits intomainfrom
ioi_interleaved

Conversation

@SeanNaren
Copy link
Collaborator

@SeanNaren SeanNaren commented Oct 9, 2025

  • Adds interleaving support for IOI
  • Expose UWSGI parameters that can be set in the dataset init.py file
  • Update dataset metadata to reduce bloat in the main JSONL file
  • Added additional summarization logic (per problem/per subtask breakdown)

Summary by CodeRabbit

  • New Features

    • IOI25 dataset configuration and class-based IOI evaluator added; benchmarks can provide per-benchmark sandbox environment overrides.
  • Improvements

    • Safer, environment-aware sandbox defaults with a stateful sandbox option and echoed effective UWSGI settings; sandbox-related vars now propagate into run jobs.
    • Evaluation now tracks, prints, and returns per-problem and per-subtask scores.
  • Dataset

    • IOI24 metadata: "score" renamed to "subtask_score" and workflow fields moved to subtask level.
  • Refactor

    • IOI evaluation reworked into a sandboxed, parallel, class-based flow.

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 9, 2025

Walkthrough

Refactors IOI evaluation into an async, class-based, sandbox-backed evaluator with precompilation and multiprocessing; records per-problem/subtask metrics; propagates per-benchmark sandbox env var overrides through the pipeline into docker; updates dataset prepare schema (score → subtask_score) and makes UWSGI startup vars environment-aware with a STATEFUL_SANDBOX knob.

Changes

Cohort / File(s) Summary
Sandbox startup & local run
dockerfiles/sandbox/start-with-nginx.sh, nemo_skills/code_execution/local_sandbox/start_local_sandbox.sh
Add STATEFUL_SANDBOX knob; conditionally set/export UWSGI_PROCESSES and UWSGI_CHEAPER based on it; echo effective UWSGI settings; conditionally pass UWSGI_CPU_AFFINITY and UWSGI_PROCESSES into docker run.
Dataset configs (IOI)
nemo_skills/dataset/ioi24/__init__.py, nemo_skills/dataset/ioi24/prepare.py, nemo_skills/dataset/ioi25/__init__.py
Add SANDBOX_ENV_VARS (now includes STATEFUL_SANDBOX=0) to ioi24/ioi25 modules. In ioi24 prepare.py: rename per-test scoresubtask_score, remove top-level run/compile, and move run/compile/grader_files into subtask entries.
Evaluator registry
nemo_skills/evaluation/evaluator/__init__.py
Swap function-based ioi import for class: eval_ioiIOIEvaluator; register "ioi": IOIEvaluator in class map (removed former function mapping).
IOI evaluator implementation
nemo_skills/evaluation/evaluator/ioi.py
Replace procedural evaluator with IOIEvaluator (inherits BaseEvaluator): async runtime init, _evaluate_entry, eval_full/eval_single, sandbox exec helpers, precompile/grader cache, per-thread event-loop handling, multiprocessing batch execution, file/runtime checks, and expanded config fields (test_file, num_workers, test_batch_size, overwrite).
Metrics (IOI)
nemo_skills/evaluation/metrics/ioi_metrics.py
Build per-subtask scores incrementally, introduce self.problem_scores and print_problem_scores, have get_metrics populate and return metrics, and reset clears problem_scores.
Pipeline env propagation
nemo_skills/pipeline/utils/eval.py, nemo_skills/pipeline/utils/exp.py, nemo_skills/pipeline/eval.py
Add BenchmarkArgs.sandbox_env_overrides; read SANDBOX_ENV_VARS from benchmark modules; aggregate per-job overrides with first‑wins merging and conflict detection (raise on conflict); pass job_sandbox_env_overrides through job creation and have add_task apply KEY=VALUE overrides when composing sandbox env.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Pipeline as Pipeline (eval/exp)
  participant Dataset as Dataset Module (ioi24/ioi25)
  participant Job as Job Prep
  participant Sandbox as Sandbox Runner
  participant Docker as Docker

  User->>Pipeline: start evaluation
  Pipeline->>Dataset: import module / read SANDBOX_ENV_VARS
  Dataset-->>Pipeline: SANDBOX_ENV_VARS (e.g., UWSGI_*, STATEFUL_SANDBOX)
  Pipeline->>Job: incorporate sandbox_env_overrides into job args
  Job->>Sandbox: add_task(..., sandbox_env_overrides)
  Sandbox->>Docker: docker run -e KEY=VAL ... (conditionally include UWSGI_*)
  Docker-->>Sandbox: container starts with provided env
  Sandbox-->>Pipeline: service available / runs evaluator jobs
Loading
sequenceDiagram
  autonumber
  actor Orchestrator as IOIEvaluator
  participant FS as Filesystem
  participant SB as Sandbox
  participant Pool as Multiprocessing Pool

  Orchestrator->>Orchestrator: _initialize_runtime()
  Orchestrator->>FS: load metadata/tests
  Orchestrator->>SB: _precompile_grader() (cached)
  loop each entry
    Orchestrator->>SB: _evaluate_entry(entry) -> compile/run
    SB-->>Orchestrator: compile/run outputs
    Orchestrator->>Pool: run test batches
    Pool-->>Orchestrator: batch results
    Orchestrator->>Orchestrator: aggregate subtask scores
  end
  Orchestrator-->>Orchestrator: eval_full() returns structured results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I twitch my whiskers at the build,
env knobs set and sandboxes filled.
Async hops and cached compile,
subtasks scored in tidy file.
I nibble logs and drum my paw—pipeline hums, hoorah! 🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly references the primary addition of interleaving support for IOI and acknowledges the miscellaneous smaller improvements, aligning with the pull request objectives and providing clear context for reviewers.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ioi_interleaved

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50821b7 and 344dca9.

📒 Files selected for processing (3)
  • dockerfiles/sandbox/start-with-nginx.sh (1 hunks)
  • nemo_skills/dataset/ioi24/__init__.py (1 hunks)
  • nemo_skills/dataset/ioi25/__init__.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • dockerfiles/sandbox/start-with-nginx.sh
  • nemo_skills/dataset/ioi24/init.py
  • nemo_skills/dataset/ioi25/init.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e5877c and 75268e6.

📒 Files selected for processing (10)
  • dockerfiles/sandbox/start-with-nginx.sh (1 hunks)
  • nemo_skills/code_execution/local_sandbox/start_local_sandbox.sh (1 hunks)
  • nemo_skills/dataset/ioi24/__init__.py (1 hunks)
  • nemo_skills/dataset/ioi24/prepare.py (2 hunks)
  • nemo_skills/dataset/ioi25/__init__.py (1 hunks)
  • nemo_skills/evaluation/evaluator/__init__.py (2 hunks)
  • nemo_skills/evaluation/evaluator/ioi.py (5 hunks)
  • nemo_skills/evaluation/metrics/ioi_metrics.py (3 hunks)
  • nemo_skills/pipeline/utils/eval.py (1 hunks)
  • nemo_skills/pipeline/utils/exp.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/evaluation/evaluator/ioi.py (3)
nemo_skills/code_execution/sandbox.py (4)
  • LocalSandbox (337-411)
  • unroll_files (39-42)
  • execute_code (140-254)
  • close (77-79)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluator (26-88)
nemo_skills/file_utils.py (1)
  • jdump (68-100)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/ioi.py (1)
  • IOIEvaluator (247-396)
🪛 Ruff (0.13.3)
nemo_skills/evaluation/evaluator/ioi.py

67-68: try-except-pass detected, consider logging the exception

(S110)


67-67: Do not catch blind exception: Exception

(BLE001)


70-70: Avoid specifying long messages outside the exception class

(TRY003)


91-91: Probable insecure usage of temporary file or directory: "/tmp/ioi_pre_"

(S108)


121-121: Probable insecure usage of temporary file or directory: "/tmp/ioi_run_"

(S108)


149-149: Abstract raise to an inner function

(TRY301)


149-149: Create your own exception

(TRY002)


149-149: Avoid specifying long messages outside the exception class

(TRY003)


263-263: Comment contains ambiguous (HYPHEN). Did you mean - (HYPHEN-MINUS)?

(RUF003)


271-275: Avoid specifying long messages outside the exception class

(TRY003)


350-350: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


385-385: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit

Comment on lines +385 to 388
for s, o in zip(all_samples, outputs):
s["test_case_results"] = o["test_case_results"]
s["eval_status"] = o["eval_status"]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix missing eval_status propagation

_evaluate_entry never sets eval_status, so o["eval_status"] raises a KeyError the first time this block runs, aborting the evaluation. Include an eval_status field in the per-entry result (or guard this assignment) so the merge succeeds.

         return {
             "name": entry["name"],
             "subtask": entry["subtask"],
             "test_case_results": test_case_results,
+            "eval_status": "completed",
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for s, o in zip(all_samples, outputs):
s["test_case_results"] = o["test_case_results"]
s["eval_status"] = o["eval_status"]
return {
"name": entry["name"],
"subtask": entry["subtask"],
"test_case_results": test_case_results,
"eval_status": "completed",
}
🧰 Tools
🪛 Ruff (0.13.3)

385-385: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/ioi.py around lines 385 to 388, the merge
loop assumes each per-entry result has an "eval_status" key but _evaluate_entry
never sets it, causing a KeyError; update _evaluate_entry to include an
"eval_status" field (e.g., "success"/"failed"/"skipped" or similar) in its
returned dict for every code path, or alternatively change the merge to use a
safe getter (e.g., s["eval_status"] = o.get("eval_status", "<default>")) so the
assignment never raises; ensure the chosen default value is consistent with
downstream consumers.

Comment on lines +103 to +107
max_subtasks[sub["subtask"]] = sub["score"]
max_total = sum(max_subtasks.values())
print(f"# {name}: {achieved_total}/{max_total}")
for subtask, achieved in achieved_subtasks.items():
print(f" {subtask}: {achieved}/{max_subtasks[subtask]}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix max score denominator in print_problem_scores.

Line 103 sets max_subtasks[sub["subtask"]] = sub["score"], but sub["score"] is the achieved score for that submission. We end up printing achieved/achieved even when a subtask wasn’t solved. Please derive the denominator from the per-subtask maximum (e.g., result["max_score"] within test_case_results, falling back to a stored subtask_score) so the report reflects the real target.

🤖 Prompt for AI Agents
In nemo_skills/evaluation/metrics/ioi_metrics.py around lines 103 to 107, the
code currently sets max_subtasks[sub["subtask"]] = sub["score"] which uses the
achieved score as the denominator; change this so the denominator is the
per-subtask maximum instead: when iterating test_case_results pull the max value
from result.get("max_score") (falling back to any stored subtask_score constant
if missing) and assign max_subtasks[subtask] to that maximum (or the max of
existing and this value if multiple results map to the same subtask); then
recompute max_total as the sum of these true maxima so the printed lines show
achieved/max for each subtask and overall.

# defaults that work for most cases.

# shell parameter expansion with default value
: "${UWSGI_PROCESSES:=2}" # default to 2 processes per worker if unset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gwarmstrong any problem for ipython session affinity with this default?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep default to 1 if you're anyway overriding this in your benchmarks?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah changing this would cause problems for session affinity, agree with @Kipok

}
for override in cluster_config.get("module_env_vars", []):
key, value = override.split("=", 1)
sandbox_env_updates.setdefault(key, value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these are just for the sandbox, let's rename ENV_VARS to SANDBOX_ENV_VARS ?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
nemo_skills/pipeline/utils/eval.py (1)

190-193: Normalize ENV_VARS inputs and support SANDBOX_ENV_VARS alias.

Handle dict/list/str forms, validate entries, and prefer a clearer alias SANDBOX_ENV_VARS while keeping ENV_VARS/env_vars for backward compatibility.

Apply this diff:

-    env_vars_from_module = getattr(benchmark_module, "ENV_VARS", getattr(benchmark_module, "env_vars", []))
-    sandbox_env_overrides = list(env_vars_from_module) if env_vars_from_module else []
+    # Prefer explicit SANDBOX_ENV_VARS; fallback to legacy names
+    env_vars_from_module = (
+        getattr(benchmark_module, "SANDBOX_ENV_VARS", None)
+        or getattr(benchmark_module, "ENV_VARS", None)
+        or getattr(benchmark_module, "env_vars", None)
+        or []
+    )
+    sandbox_env_overrides: list[str] = []
+    if isinstance(env_vars_from_module, dict):
+        sandbox_env_overrides = [f"{k}={v}" for k, v in env_vars_from_module.items()]
+    elif isinstance(env_vars_from_module, (list, tuple, set)):
+        for item in env_vars_from_module:
+            if isinstance(item, str) and "=" in item:
+                sandbox_env_overrides.append(item)
+            else:
+                LOG.warning("Invalid ENV_VARS item %r; expected 'KEY=VALUE'. Skipping.", item)
+    elif isinstance(env_vars_from_module, str):
+        if "=" in env_vars_from_module:
+            sandbox_env_overrides = [env_vars_from_module]
+        else:
+            LOG.warning("Invalid ENV_VARS string %r; expected 'KEY=VALUE'. Skipping.", env_vars_from_module)
🧹 Nitpick comments (3)
nemo_skills/pipeline/utils/exp.py (1)

369-369: Document new parameter behavior (precedence and reserved keys).

Please extend the add_task docstring to describe sandbox_env_overrides (format KEY=VALUE, precedence vs defaults, reserved keys like LISTEN_PORT/NGINX_PORT not overridable).

nemo_skills/pipeline/utils/eval.py (1)

528-536: Validate per-job overrides; guard malformed entries.

Keep “first wins” order but skip bad entries with a warning to avoid ValueError on split.

Apply this diff:

-                    for b in ordered_benchmarks:
-                        for override in benchmarks_dict[b].sandbox_env_overrides:
-                            key, value = override.split("=", 1)
-                            if key not in env_map:
-                                env_map[key] = value
+                    for b in ordered_benchmarks:
+                        for override in benchmarks_dict[b].sandbox_env_overrides:
+                            if "=" not in override:
+                                LOG.warning("Invalid sandbox override %r in benchmark %s; skipping.", override, b)
+                                continue
+                            key, value = override.split("=", 1)
+                            if key not in env_map:
+                                env_map[key] = value
nemo_skills/pipeline/eval.py (1)

365-365: LGTM: forward overrides to task creation.

Optional: pass None when sandbox is disabled to make intent explicit.

sandbox_env_overrides = job_sandbox_env_overrides if (job_needs_sandbox or with_sandbox) else None
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 75268e6 and 8c6b14d.

📒 Files selected for processing (3)
  • nemo_skills/pipeline/eval.py (2 hunks)
  • nemo_skills/pipeline/utils/eval.py (5 hunks)
  • nemo_skills/pipeline/utils/exp.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (4)
nemo_skills/pipeline/utils/eval.py (3)

105-107: LGTM: per-benchmark sandbox overrides field added.

Good use of default_factory for an immutable default.


240-241: LGTM: propagate overrides into BenchmarkArgs.

Correctly threads sandbox_env_overrides.


549-549: LGTM: job-level overrides carried in job tuple.

Keeps scheduling surface clean and localized per job.

nemo_skills/pipeline/eval.py (1)

345-346: LGTM: unpack job_sandbox_env_overrides.

Tuple shape stays consistent with prepare_eval_commands.

Comment on lines +529 to 533
if sandbox_env_overrides:
for override in sandbox_env_overrides:
key, value = override.split("=", 1)
sandbox_env_updates.setdefault(key, value)
current_env_vars = cluster_config.get("env_vars", []).copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Harden override parsing and make precedence explicit (skip malformed, protect ports).

  • Validate KEY=VALUE format; skip and warn on bad entries.
  • Keep LISTEN_PORT/NGINX_PORT owned by scheduler; don’t allow overrides.
  • Assign overrides for other keys (not setdefault).

Apply this diff:

-        if sandbox_env_overrides:
-            for override in sandbox_env_overrides:
-                key, value = override.split("=", 1)
-                sandbox_env_updates.setdefault(key, value)
+        if sandbox_env_overrides:
+            for override in sandbox_env_overrides:
+                if "=" not in override:
+                    LOG.warning("Invalid sandbox env override %r; expected KEY=VALUE. Skipping.", override)
+                    continue
+                key, value = override.split("=", 1)
+                if key in ("LISTEN_PORT", "NGINX_PORT"):
+                    LOG.debug("Ignoring override of reserved key %s", key)
+                    continue
+                sandbox_env_updates[key] = value

Optional: simplify PYTHONPATH handling using normalized env vars (outside the selected lines):

# Prefer normalized env vars over raw strings
env_dict = get_env_variables(cluster_config)
if env_dict.get("PYTHONPATH"):
    sandbox_env_updates["PYTHONPATH"] = f'{env_dict["PYTHONPATH"]}:/app'
🤖 Prompt for AI Agents
In nemo_skills/pipeline/utils/exp.py around lines 529-533, the override parsing
should validate entries, reject malformed KEY=VALUE pairs with a warning, and
prevent overrides of LISTEN_PORT and NGINX_PORT (these must remain owned by the
scheduler); for valid entries assign sandbox_env_updates[key] = value (do not
use setdefault) so overrides take precedence. Implement: for each override,
check it contains exactly one '=' and a non-empty key; if invalid, log a warning
and continue; if key in {"LISTEN_PORT","NGINX_PORT"} log a debug/warning and
skip; otherwise set sandbox_env_updates[key] = value. Optionally (outside these
lines) normalize env vars via get_env_variables(cluster_config) and, if
PYTHONPATH exists, set sandbox_env_updates["PYTHONPATH"] =
f'{env_dict["PYTHONPATH"]}:/app'.

# defaults that work for most cases.

# shell parameter expansion with default value
: "${UWSGI_PROCESSES:=2}" # default to 2 processes per worker if unset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep default to 1 if you're anyway overriding this in your benchmarks?

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c6b14d and 8f79870.

📒 Files selected for processing (3)
  • dockerfiles/sandbox/start-with-nginx.sh (1 hunks)
  • nemo_skills/dataset/ioi25/__init__.py (1 hunks)
  • nemo_skills/pipeline/utils/eval.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • nemo_skills/dataset/ioi25/init.py
  • dockerfiles/sandbox/start-with-nginx.sh
🧰 Additional context used
🪛 Ruff (0.13.3)
nemo_skills/pipeline/utils/eval.py

192-192: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)


536-540: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (1)
nemo_skills/pipeline/utils/eval.py (1)

528-556: LGTM: Per-job aggregation with conflict detection.

The approach of storing sandbox environment overrides per-benchmark and aggregating them at the job level with first-wins conflict detection properly addresses the past review concern about leaking env vars between benchmarks. The ValueError on conflicting values aligns with the maintainer's preference to fail explicitly rather than silently override.

Based on past review comments.

EVAL_ARGS = "++eval_type=ioi"

# environment variables required by this benchmark
ENV_VARS = ["UWSGI_PROCESSES=1024", "UWSGI_CPU_AFFINITY=8", "UWSGI_CHEAPER=1023", "NUM_WORKERS=1"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ENV_VARS = ["UWSGI_PROCESSES=1024", "UWSGI_CPU_AFFINITY=8", "UWSGI_CHEAPER=1023", "NUM_WORKERS=1"]
SANDBOX_ENV_VARS = ["UWSGI_PROCESSES=1024", "UWSGI_CPU_AFFINITY=8", "UWSGI_CHEAPER=1023", "NUM_WORKERS=1"]

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_skills/pipeline/utils/eval.py (1)

536-540: Consider defining a custom exception class.

The static analysis tool suggests avoiding long error messages outside the exception class. While the current implementation is clear and functional, you could optionally refactor this into a custom exception class if you prefer to follow this style guideline.

Example refactor:

class ConflictingSandboxEnvOverridesError(ValueError):
    def __init__(self, key: str, existing_value: str, existing_source: str, new_value: str, new_source: str):
        super().__init__(
            f"Conflicting sandbox environment overrides for key '{key}': "
            f"'{existing_value}' (from {existing_source}) vs '{new_value}' (from {new_source}). "
            "Please submit the benchmarks as separate jobs or increase num_jobs so they do not share a job."
        )

Then use:

raise ConflictingSandboxEnvOverridesError(key, env_map[key], env_source[key], value, b)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0229b22 and 74c1267.

📒 Files selected for processing (2)
  • nemo_skills/dataset/ioi24/__init__.py (1 hunks)
  • nemo_skills/pipeline/utils/eval.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/dataset/ioi24/init.py
🧰 Additional context used
🪛 Ruff (0.13.3)
nemo_skills/pipeline/utils/eval.py

536-540: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (5)
nemo_skills/pipeline/utils/eval.py (5)

23-23: LGTM!

The Dict import is necessary for the type hints used in the sandbox environment aggregation logic (lines 530-531).


106-107: LGTM!

The sandbox_env_overrides field is well-documented and correctly uses default_factory=list for the mutable default, following dataclass best practices.


191-193: LGTM!

The use of getattr with a default value of [] correctly prevents AttributeError when SANDBOX_ENV_VARS is not defined in the benchmark module. The conditional list conversion handles all cases appropriately.


241-241: LGTM!

The sandbox_env_overrides parameter is correctly passed to the BenchmarkArgs constructor.


528-556: LGTM!

The per-job sandbox environment aggregation logic is well-implemented:

  • Correctly aggregates overrides from participating benchmarks in deterministic order
  • Uses first-wins strategy with conflict detection to prevent silent errors
  • Properly handles values containing = using maxsplit=1
  • Creates job_sandbox_env_overrides list for the job batch

All past review concerns have been addressed:

  • Env vars are no longer leaked between benchmarks (per-job aggregation)
  • The error message formatting is correct (line 537 has the f-prefix)

@Kipok Kipok merged commit 3046e17 into main Oct 13, 2025
7 checks passed
@Kipok Kipok deleted the ioi_interleaved branch October 13, 2025 21:05
dgtm777 pushed a commit that referenced this pull request Oct 29, 2025
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Dec 4, 2025
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants