Update IOI to support interleaving + small improvements#919
Conversation
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
WalkthroughRefactors IOI evaluation into an async, class-based, sandbox-backed evaluator with precompilation and multiprocessing; records per-problem/subtask metrics; propagates per-benchmark sandbox env var overrides through the pipeline into docker; updates dataset prepare schema (score → subtask_score) and makes UWSGI startup vars environment-aware with a STATEFUL_SANDBOX knob. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Pipeline as Pipeline (eval/exp)
participant Dataset as Dataset Module (ioi24/ioi25)
participant Job as Job Prep
participant Sandbox as Sandbox Runner
participant Docker as Docker
User->>Pipeline: start evaluation
Pipeline->>Dataset: import module / read SANDBOX_ENV_VARS
Dataset-->>Pipeline: SANDBOX_ENV_VARS (e.g., UWSGI_*, STATEFUL_SANDBOX)
Pipeline->>Job: incorporate sandbox_env_overrides into job args
Job->>Sandbox: add_task(..., sandbox_env_overrides)
Sandbox->>Docker: docker run -e KEY=VAL ... (conditionally include UWSGI_*)
Docker-->>Sandbox: container starts with provided env
Sandbox-->>Pipeline: service available / runs evaluator jobs
sequenceDiagram
autonumber
actor Orchestrator as IOIEvaluator
participant FS as Filesystem
participant SB as Sandbox
participant Pool as Multiprocessing Pool
Orchestrator->>Orchestrator: _initialize_runtime()
Orchestrator->>FS: load metadata/tests
Orchestrator->>SB: _precompile_grader() (cached)
loop each entry
Orchestrator->>SB: _evaluate_entry(entry) -> compile/run
SB-->>Orchestrator: compile/run outputs
Orchestrator->>Pool: run test batches
Pool-->>Orchestrator: batch results
Orchestrator->>Orchestrator: aggregate subtask scores
end
Orchestrator-->>Orchestrator: eval_full() returns structured results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (3)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
dockerfiles/sandbox/start-with-nginx.sh(1 hunks)nemo_skills/code_execution/local_sandbox/start_local_sandbox.sh(1 hunks)nemo_skills/dataset/ioi24/__init__.py(1 hunks)nemo_skills/dataset/ioi24/prepare.py(2 hunks)nemo_skills/dataset/ioi25/__init__.py(1 hunks)nemo_skills/evaluation/evaluator/__init__.py(2 hunks)nemo_skills/evaluation/evaluator/ioi.py(5 hunks)nemo_skills/evaluation/metrics/ioi_metrics.py(3 hunks)nemo_skills/pipeline/utils/eval.py(1 hunks)nemo_skills/pipeline/utils/exp.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/evaluation/evaluator/ioi.py (3)
nemo_skills/code_execution/sandbox.py (4)
LocalSandbox(337-411)unroll_files(39-42)execute_code(140-254)close(77-79)nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluator(26-88)nemo_skills/file_utils.py (1)
jdump(68-100)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/ioi.py (1)
IOIEvaluator(247-396)
🪛 Ruff (0.13.3)
nemo_skills/evaluation/evaluator/ioi.py
67-68: try-except-pass detected, consider logging the exception
(S110)
67-67: Do not catch blind exception: Exception
(BLE001)
70-70: Avoid specifying long messages outside the exception class
(TRY003)
91-91: Probable insecure usage of temporary file or directory: "/tmp/ioi_pre_"
(S108)
121-121: Probable insecure usage of temporary file or directory: "/tmp/ioi_run_"
(S108)
149-149: Abstract raise to an inner function
(TRY301)
149-149: Create your own exception
(TRY002)
149-149: Avoid specifying long messages outside the exception class
(TRY003)
263-263: Comment contains ambiguous ‐ (HYPHEN). Did you mean - (HYPHEN-MINUS)?
(RUF003)
271-275: Avoid specifying long messages outside the exception class
(TRY003)
350-350: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
385-385: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
| for s, o in zip(all_samples, outputs): | ||
| s["test_case_results"] = o["test_case_results"] | ||
| s["eval_status"] = o["eval_status"] | ||
|
|
There was a problem hiding this comment.
Fix missing eval_status propagation
_evaluate_entry never sets eval_status, so o["eval_status"] raises a KeyError the first time this block runs, aborting the evaluation. Include an eval_status field in the per-entry result (or guard this assignment) so the merge succeeds.
return {
"name": entry["name"],
"subtask": entry["subtask"],
"test_case_results": test_case_results,
+ "eval_status": "completed",
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for s, o in zip(all_samples, outputs): | |
| s["test_case_results"] = o["test_case_results"] | |
| s["eval_status"] = o["eval_status"] | |
| return { | |
| "name": entry["name"], | |
| "subtask": entry["subtask"], | |
| "test_case_results": test_case_results, | |
| "eval_status": "completed", | |
| } |
🧰 Tools
🪛 Ruff (0.13.3)
385-385: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/ioi.py around lines 385 to 388, the merge
loop assumes each per-entry result has an "eval_status" key but _evaluate_entry
never sets it, causing a KeyError; update _evaluate_entry to include an
"eval_status" field (e.g., "success"/"failed"/"skipped" or similar) in its
returned dict for every code path, or alternatively change the merge to use a
safe getter (e.g., s["eval_status"] = o.get("eval_status", "<default>")) so the
assignment never raises; ensure the chosen default value is consistent with
downstream consumers.
| max_subtasks[sub["subtask"]] = sub["score"] | ||
| max_total = sum(max_subtasks.values()) | ||
| print(f"# {name}: {achieved_total}/{max_total}") | ||
| for subtask, achieved in achieved_subtasks.items(): | ||
| print(f" {subtask}: {achieved}/{max_subtasks[subtask]}") |
There was a problem hiding this comment.
Fix max score denominator in print_problem_scores.
Line 103 sets max_subtasks[sub["subtask"]] = sub["score"], but sub["score"] is the achieved score for that submission. We end up printing achieved/achieved even when a subtask wasn’t solved. Please derive the denominator from the per-subtask maximum (e.g., result["max_score"] within test_case_results, falling back to a stored subtask_score) so the report reflects the real target.
🤖 Prompt for AI Agents
In nemo_skills/evaluation/metrics/ioi_metrics.py around lines 103 to 107, the
code currently sets max_subtasks[sub["subtask"]] = sub["score"] which uses the
achieved score as the denominator; change this so the denominator is the
per-subtask maximum instead: when iterating test_case_results pull the max value
from result.get("max_score") (falling back to any stored subtask_score constant
if missing) and assign max_subtasks[subtask] to that maximum (or the max of
existing and this value if multiple results map to the same subtask); then
recompute max_total as the sum of these true maxima so the printed lines show
achieved/max for each subtask and overall.
| # defaults that work for most cases. | ||
|
|
||
| # shell parameter expansion with default value | ||
| : "${UWSGI_PROCESSES:=2}" # default to 2 processes per worker if unset |
There was a problem hiding this comment.
@gwarmstrong any problem for ipython session affinity with this default?
There was a problem hiding this comment.
Can we keep default to 1 if you're anyway overriding this in your benchmarks?
There was a problem hiding this comment.
yeah changing this would cause problems for session affinity, agree with @Kipok
nemo_skills/pipeline/utils/exp.py
Outdated
| } | ||
| for override in cluster_config.get("module_env_vars", []): | ||
| key, value = override.split("=", 1) | ||
| sandbox_env_updates.setdefault(key, value) |
There was a problem hiding this comment.
if these are just for the sandbox, let's rename ENV_VARS to SANDBOX_ENV_VARS ?
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
nemo_skills/pipeline/utils/eval.py (1)
190-193: Normalize ENV_VARS inputs and support SANDBOX_ENV_VARS alias.Handle dict/list/str forms, validate entries, and prefer a clearer alias SANDBOX_ENV_VARS while keeping ENV_VARS/env_vars for backward compatibility.
Apply this diff:
- env_vars_from_module = getattr(benchmark_module, "ENV_VARS", getattr(benchmark_module, "env_vars", [])) - sandbox_env_overrides = list(env_vars_from_module) if env_vars_from_module else [] + # Prefer explicit SANDBOX_ENV_VARS; fallback to legacy names + env_vars_from_module = ( + getattr(benchmark_module, "SANDBOX_ENV_VARS", None) + or getattr(benchmark_module, "ENV_VARS", None) + or getattr(benchmark_module, "env_vars", None) + or [] + ) + sandbox_env_overrides: list[str] = [] + if isinstance(env_vars_from_module, dict): + sandbox_env_overrides = [f"{k}={v}" for k, v in env_vars_from_module.items()] + elif isinstance(env_vars_from_module, (list, tuple, set)): + for item in env_vars_from_module: + if isinstance(item, str) and "=" in item: + sandbox_env_overrides.append(item) + else: + LOG.warning("Invalid ENV_VARS item %r; expected 'KEY=VALUE'. Skipping.", item) + elif isinstance(env_vars_from_module, str): + if "=" in env_vars_from_module: + sandbox_env_overrides = [env_vars_from_module] + else: + LOG.warning("Invalid ENV_VARS string %r; expected 'KEY=VALUE'. Skipping.", env_vars_from_module)
🧹 Nitpick comments (3)
nemo_skills/pipeline/utils/exp.py (1)
369-369: Document new parameter behavior (precedence and reserved keys).Please extend the add_task docstring to describe sandbox_env_overrides (format KEY=VALUE, precedence vs defaults, reserved keys like LISTEN_PORT/NGINX_PORT not overridable).
nemo_skills/pipeline/utils/eval.py (1)
528-536: Validate per-job overrides; guard malformed entries.Keep “first wins” order but skip bad entries with a warning to avoid ValueError on split.
Apply this diff:
- for b in ordered_benchmarks: - for override in benchmarks_dict[b].sandbox_env_overrides: - key, value = override.split("=", 1) - if key not in env_map: - env_map[key] = value + for b in ordered_benchmarks: + for override in benchmarks_dict[b].sandbox_env_overrides: + if "=" not in override: + LOG.warning("Invalid sandbox override %r in benchmark %s; skipping.", override, b) + continue + key, value = override.split("=", 1) + if key not in env_map: + env_map[key] = valuenemo_skills/pipeline/eval.py (1)
365-365: LGTM: forward overrides to task creation.Optional: pass None when sandbox is disabled to make intent explicit.
sandbox_env_overrides = job_sandbox_env_overrides if (job_needs_sandbox or with_sandbox) else None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
nemo_skills/pipeline/eval.py(2 hunks)nemo_skills/pipeline/utils/eval.py(5 hunks)nemo_skills/pipeline/utils/exp.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (4)
nemo_skills/pipeline/utils/eval.py (3)
105-107: LGTM: per-benchmark sandbox overrides field added.Good use of default_factory for an immutable default.
240-241: LGTM: propagate overrides into BenchmarkArgs.Correctly threads sandbox_env_overrides.
549-549: LGTM: job-level overrides carried in job tuple.Keeps scheduling surface clean and localized per job.
nemo_skills/pipeline/eval.py (1)
345-346: LGTM: unpack job_sandbox_env_overrides.Tuple shape stays consistent with prepare_eval_commands.
| if sandbox_env_overrides: | ||
| for override in sandbox_env_overrides: | ||
| key, value = override.split("=", 1) | ||
| sandbox_env_updates.setdefault(key, value) | ||
| current_env_vars = cluster_config.get("env_vars", []).copy() |
There was a problem hiding this comment.
Harden override parsing and make precedence explicit (skip malformed, protect ports).
- Validate KEY=VALUE format; skip and warn on bad entries.
- Keep LISTEN_PORT/NGINX_PORT owned by scheduler; don’t allow overrides.
- Assign overrides for other keys (not setdefault).
Apply this diff:
- if sandbox_env_overrides:
- for override in sandbox_env_overrides:
- key, value = override.split("=", 1)
- sandbox_env_updates.setdefault(key, value)
+ if sandbox_env_overrides:
+ for override in sandbox_env_overrides:
+ if "=" not in override:
+ LOG.warning("Invalid sandbox env override %r; expected KEY=VALUE. Skipping.", override)
+ continue
+ key, value = override.split("=", 1)
+ if key in ("LISTEN_PORT", "NGINX_PORT"):
+ LOG.debug("Ignoring override of reserved key %s", key)
+ continue
+ sandbox_env_updates[key] = valueOptional: simplify PYTHONPATH handling using normalized env vars (outside the selected lines):
# Prefer normalized env vars over raw strings
env_dict = get_env_variables(cluster_config)
if env_dict.get("PYTHONPATH"):
sandbox_env_updates["PYTHONPATH"] = f'{env_dict["PYTHONPATH"]}:/app'🤖 Prompt for AI Agents
In nemo_skills/pipeline/utils/exp.py around lines 529-533, the override parsing
should validate entries, reject malformed KEY=VALUE pairs with a warning, and
prevent overrides of LISTEN_PORT and NGINX_PORT (these must remain owned by the
scheduler); for valid entries assign sandbox_env_updates[key] = value (do not
use setdefault) so overrides take precedence. Implement: for each override,
check it contains exactly one '=' and a non-empty key; if invalid, log a warning
and continue; if key in {"LISTEN_PORT","NGINX_PORT"} log a debug/warning and
skip; otherwise set sandbox_env_updates[key] = value. Optionally (outside these
lines) normalize env vars via get_env_variables(cluster_config) and, if
PYTHONPATH exists, set sandbox_env_updates["PYTHONPATH"] =
f'{env_dict["PYTHONPATH"]}:/app'.
| # defaults that work for most cases. | ||
|
|
||
| # shell parameter expansion with default value | ||
| : "${UWSGI_PROCESSES:=2}" # default to 2 processes per worker if unset |
There was a problem hiding this comment.
Can we keep default to 1 if you're anyway overriding this in your benchmarks?
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
dockerfiles/sandbox/start-with-nginx.sh(1 hunks)nemo_skills/dataset/ioi25/__init__.py(1 hunks)nemo_skills/pipeline/utils/eval.py(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- nemo_skills/dataset/ioi25/init.py
- dockerfiles/sandbox/start-with-nginx.sh
🧰 Additional context used
🪛 Ruff (0.13.3)
nemo_skills/pipeline/utils/eval.py
192-192: Do not call getattr with a constant attribute value. It is not any safer than normal property access.
Replace getattr with attribute access
(B009)
536-540: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (1)
nemo_skills/pipeline/utils/eval.py (1)
528-556: LGTM: Per-job aggregation with conflict detection.The approach of storing sandbox environment overrides per-benchmark and aggregating them at the job level with first-wins conflict detection properly addresses the past review concern about leaking env vars between benchmarks. The ValueError on conflicting values aligns with the maintainer's preference to fail explicitly rather than silently override.
Based on past review comments.
| EVAL_ARGS = "++eval_type=ioi" | ||
|
|
||
| # environment variables required by this benchmark | ||
| ENV_VARS = ["UWSGI_PROCESSES=1024", "UWSGI_CPU_AFFINITY=8", "UWSGI_CHEAPER=1023", "NUM_WORKERS=1"] |
There was a problem hiding this comment.
| ENV_VARS = ["UWSGI_PROCESSES=1024", "UWSGI_CPU_AFFINITY=8", "UWSGI_CHEAPER=1023", "NUM_WORKERS=1"] | |
| SANDBOX_ENV_VARS = ["UWSGI_PROCESSES=1024", "UWSGI_CPU_AFFINITY=8", "UWSGI_CHEAPER=1023", "NUM_WORKERS=1"] |
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
nemo_skills/pipeline/utils/eval.py (1)
536-540: Consider defining a custom exception class.The static analysis tool suggests avoiding long error messages outside the exception class. While the current implementation is clear and functional, you could optionally refactor this into a custom exception class if you prefer to follow this style guideline.
Example refactor:
class ConflictingSandboxEnvOverridesError(ValueError): def __init__(self, key: str, existing_value: str, existing_source: str, new_value: str, new_source: str): super().__init__( f"Conflicting sandbox environment overrides for key '{key}': " f"'{existing_value}' (from {existing_source}) vs '{new_value}' (from {new_source}). " "Please submit the benchmarks as separate jobs or increase num_jobs so they do not share a job." )Then use:
raise ConflictingSandboxEnvOverridesError(key, env_map[key], env_source[key], value, b)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/dataset/ioi24/__init__.py(1 hunks)nemo_skills/pipeline/utils/eval.py(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/dataset/ioi24/init.py
🧰 Additional context used
🪛 Ruff (0.13.3)
nemo_skills/pipeline/utils/eval.py
536-540: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (5)
nemo_skills/pipeline/utils/eval.py (5)
23-23: LGTM!The
Dictimport is necessary for the type hints used in the sandbox environment aggregation logic (lines 530-531).
106-107: LGTM!The
sandbox_env_overridesfield is well-documented and correctly usesdefault_factory=listfor the mutable default, following dataclass best practices.
191-193: LGTM!The use of
getattrwith a default value of[]correctly preventsAttributeErrorwhenSANDBOX_ENV_VARSis not defined in the benchmark module. The conditional list conversion handles all cases appropriately.
241-241: LGTM!The
sandbox_env_overridesparameter is correctly passed to theBenchmarkArgsconstructor.
528-556: LGTM!The per-job sandbox environment aggregation logic is well-implemented:
- Correctly aggregates overrides from participating benchmarks in deterministic order
- Uses first-wins strategy with conflict detection to prevent silent errors
- Properly handles values containing
=usingmaxsplit=1- Creates
job_sandbox_env_overrideslist for the job batchAll past review concerns have been addressed:
- Env vars are no longer leaked between benchmarks (per-job aggregation)
- The error message formatting is correct (line 537 has the
f-prefix)
… into ioi_interleaved
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
New Features
Improvements
Dataset
Refactor