Revert "Port ICPC changes to IOI (#1046)"#1130
Conversation
This reverts commit 9985b2e. Signed-off-by: George Armstrong <georgea@nvidia.com>
ccbfdf1 to
6cfe8f2
Compare
📝 WalkthroughWalkthroughThis pull request refactors the IOI evaluation pipeline by consolidating sandbox setup into batch shell scripts, removing the Changes
Sequence Diagram(s)sequenceDiagram
participant Evaluator as IOI Evaluator
participant Setup as _initialize_runtime
participant Sandbox as Sandbox
participant Metrics as IOI Metrics
Evaluator->>Setup: Initialize runtime (lazy)
Setup->>Sandbox: Start container
Setup->>Sandbox: Execute setup_script<br/>(batch file creation<br/>& precompilation)
Sandbox-->>Setup: Return paths & metadata
Setup-->>Evaluator: Return (sandbox, metadata, pool)
Evaluator->>Evaluator: Load samples from<br/>eval_cfg.input_file
loop For each sample entry
Evaluator->>Sandbox: run_test_case<br/>(batch setup via shell)<br/>+ compile solution<br/>+ run tests
Sandbox-->>Evaluator: test_case_results
Evaluator->>Metrics: Collect subtask scores
end
Evaluator->>Metrics: get_metrics
Metrics->>Metrics: simulate_round_robin_score<br/>(select top 50 submissions)
Metrics->>Metrics: get_problem_score<br/>(sum per-problem)
Metrics-->>Evaluator: metrics_dict<br/>(total_score,<br/>round_robin_score)
Evaluator->>Evaluator: Write eval_status<br/>to output
Evaluator->>Sandbox: Close pool
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
docs/evaluation/code.md (1)
226-229: Add language specifiers to code blocks.The fenced code blocks showing evaluation results should specify a language for proper rendering.
🔎 Apply this diff to add language specifiers:
-``` +```text ------------------------------------------------------ ioi24 ------------------------------------------------------ evaluation_mode | num_entries | avg_tokens | gen_seconds | correct | total_score | round_robin_score pass@1[avg-of-50] | 39 | 40387 | 7410 | 0.51% ± 1.04% | 303.47 | 261.01 pass@50 | 39 | 40387 | 7410 | 2.56% | 303.47 | 261.01</details> </blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between eed4f7e452b57f055bf82aa41d800356bde73369 and 6cfe8f2b696df0d8816b92f04362201a7ea553c0. </details> <details> <summary>📒 Files selected for processing (6)</summary> * `.github/workflows/tests.yml` (1 hunks) * `docs/evaluation/code.md` (2 hunks) * `nemo_skills/dataset/ioi24/prepare.py` (2 hunks) * `nemo_skills/dataset/ioi25/__init__.py` (1 hunks) * `nemo_skills/evaluation/evaluator/ioi.py` (7 hunks) * `nemo_skills/evaluation/metrics/ioi_metrics.py` (2 hunks) </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🧬 Code graph analysis (2)</summary> <details> <summary>nemo_skills/evaluation/evaluator/ioi.py (2)</summary><blockquote> <details> <summary>nemo_skills/code_execution/sandbox.py (1)</summary> * `execute_code` (131-245) </details> <details> <summary>nemo_skills/evaluation/evaluator/base.py (2)</summary> * `BaseEvaluator` (34-91) * `BaseEvaluatorConfig` (27-31) </details> </blockquote></details> <details> <summary>nemo_skills/evaluation/metrics/ioi_metrics.py (1)</summary><blockquote> <details> <summary>nemo_skills/evaluation/metrics/map_metrics.py (1)</summary> * `get_metrics` (74-103) </details> </blockquote></details> </details><details> <summary>🪛 markdownlint-cli2 (0.18.1)</summary> <details> <summary>docs/evaluation/code.md</summary> 190-190: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 225-225: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🪛 Ruff (0.14.8)</summary> <details> <summary>nemo_skills/evaluation/evaluator/ioi.py</summary> 91-91: Probable insecure usage of temporary file or directory: "/tmp/ioi_pre_" (S108) --- 121-121: Probable insecure usage of temporary file or directory: "/tmp/ioi_run_" (S108) --- 149-149: Abstract `raise` to an inner function (TRY301) --- 149-149: Create your own exception (TRY002) --- 149-149: Avoid specifying long messages outside the exception class (TRY003) --- 385-385: `zip()` without an explicit `strict=` parameter Add explicit value for parameter `strict=` (B905) </details> </details> </details> <details> <summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)</summary> * GitHub Check: gpu-tests-qwen * GitHub Check: unit-tests * GitHub Check: pre-commit </details> <details> <summary>🔇 Additional comments (11)</summary><blockquote> <details> <summary>.github/workflows/tests.yml (1)</summary><blockquote> `88-88`: **LGTM - Simplified sandbox container setup.** The removal of the host volume mount aligns with the updated sandbox initialization flow in the IOI evaluator, which now handles file setup internally via batch shell scripts. </blockquote></details> <details> <summary>nemo_skills/dataset/ioi24/prepare.py (2)</summary><blockquote> `53-54`: **LGTM - Simplified filename convention.** Using `--split` directly for output filenames is more consistent with other benchmarks and reduces redundancy. --- `85-86`: **LGTM - Metadata filename follows the same pattern.** Consistent with the data file naming change above. </blockquote></details> <details> <summary>nemo_skills/dataset/ioi25/__init__.py (2)</summary><blockquote> `15-17`: **Note the TODO comment.** The module indicates that IOI25 data files are still in progress. Ensure this is tracked appropriately. --- `19-31`: **LGTM - Well-structured configuration constants.** The constants provide a clean interface for IOI25 dataset configuration, including evaluation settings and required sandbox environment variables. </blockquote></details> <details> <summary>docs/evaluation/code.md (1)</summary><blockquote> `188-192`: **LGTM - Documentation updated to match new naming convention.** The documentation correctly reflects the change from `ioi24.jsonl` to `test.jsonl` and the use of the `--split` parameter. </blockquote></details> <details> <summary>nemo_skills/evaluation/metrics/ioi_metrics.py (3)</summary><blockquote> `48-76`: **LGTM - Round-robin scoring implementation looks correct.** The method correctly: 1. Computes aggregate scores for each submission 2. Selects the top 50 by aggregate score 3. Computes max score per subtask among selected submissions 4. Returns the sum of subtask maxima --- `78-90`: **LGTM - Metrics aggregation updated to include round-robin score.** The metrics now include both `total_score` and `round_robin_score`, aligning with the updated documentation. --- `20-22`: **The IOIMetrics constructor no longer accepts kwargs, but the metrics factory always passes them.** The constructor was recently changed from `__init__(self, **kwargs)` back to `__init__(self)` via a revert (commit 6cfe8f2). This intentional revert removed kwargs support that was temporarily added by a port from ICPC metrics. While the factory function at `get_metrics` always calls `metrics_cls(**kwargs)`, currently no metrics_kwargs are configured for the "ioi" metric type in dataset configs, so this incompatibility poses no runtime risk in the current state. However, if kwargs support is needed in the future, the constructor signature should be updated to handle them (either by accepting `**kwargs` like ICPC metrics does, or by extracting specific parameters). </blockquote></details> <details> <summary>nemo_skills/evaluation/evaluator/ioi.py (2)</summary><blockquote> `91-116`: **LGTM - Consolidated precompilation setup.** The refactor to use batch shell scripts for file creation and compilation is cleaner and more efficient than individual file writes. --- `119-212`: **LGTM - Improved test execution with batch file operations.** The consolidated approach using heredoc syntax (`cat <<'_EOT_'`) for file creation is more efficient and reduces sandbox round-trips. </blockquote></details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
| return { | ||
| "name": entry["name"], | ||
| "subtask": entry["subtask"], | ||
| "test_case_results": test_case_results, | ||
| "input_case_results": input_outputs, | ||
| } | ||
|
|
||
| async def eval_full(self, input_files): # type: ignore[override] | ||
| for jsonl_file in unroll_files(input_files): | ||
| with open(jsonl_file, "r", encoding="utf-8") as f: | ||
| all_samples = [json.loads(line) for line in f] | ||
| async def eval_full(self): # type: ignore[override] | ||
| jsonl_file = self.eval_cfg.input_file | ||
| with open(jsonl_file, "r", encoding="utf-8") as f: | ||
| all_samples = [json.loads(line) for line in f] | ||
|
|
||
| tasks = [self._evaluate_entry(s) for s in all_samples] | ||
| outputs = await asyncio.gather(*tasks) | ||
| tasks = [self._evaluate_entry(s) for s in all_samples] | ||
| outputs = await asyncio.gather(*tasks) | ||
|
|
||
| for s, o in zip(all_samples, outputs): | ||
| s["test_case_results"] = o["test_case_results"] | ||
| s["input_case_results"] = o["input_case_results"] | ||
| for s, o in zip(all_samples, outputs): | ||
| s["test_case_results"] = o["test_case_results"] | ||
| s["eval_status"] = o["eval_status"] | ||
|
|
||
| jdump(all_samples, jsonl_file, mode="wt") | ||
| jdump(all_samples, jsonl_file, mode="wt") | ||
|
|
There was a problem hiding this comment.
Critical bug: eval_status is never set.
Line 387 assigns o["eval_status"] to s["eval_status"], but _evaluate_entry (lines 371-375) never returns an eval_status field. This will cause a KeyError at runtime.
🔎 Apply this diff to add eval_status to the return value:
return {
"name": entry["name"],
"subtask": entry["subtask"],
"test_case_results": test_case_results,
+ "eval_status": "completed", # or determine based on test results
}🧰 Tools
🪛 Ruff (0.14.8)
385-385: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/ioi.py around lines 371-390, _evaluate_entry
currently returns a dict without an eval_status key but eval_full expects
o["eval_status"], causing a KeyError; update _evaluate_entry to compute an
eval_status (for example, set to "passed" if all test_case_results indicate
success, otherwise "failed" or "partial") and include that value in the returned
dict so eval_full can assign s["eval_status"] safely.
| async def eval_full(self): # type: ignore[override] | ||
| jsonl_file = self.eval_cfg.input_file | ||
| with open(jsonl_file, "r", encoding="utf-8") as f: | ||
| all_samples = [json.loads(line) for line in f] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the ioi.py file and examine the eval_full method
fd -n "ioi.py" --type=f | head -5Repository: NVIDIA-NeMo/Skills
Length of output: 292
🏁 Script executed:
# Find the ioi.py file
fd "ioi.py" --type fRepository: NVIDIA-NeMo/Skills
Length of output: 101
🏁 Script executed:
# Examine the ioi.py file around the eval_full method
cat -n nemo_skills/evaluation/evaluator/ioi.py | sed -n '370,390p'Repository: NVIDIA-NeMo/Skills
Length of output: 905
🏁 Script executed:
# Find BaseEvaluatorConfig definition
fd "evaluator" --type d | grep -v __pycache__ | head -10Repository: NVIDIA-NeMo/Skills
Length of output: 122
🏁 Script executed:
# Search for BaseEvaluatorConfig definition
rg "class BaseEvaluatorConfig" --type py -A 20Repository: NVIDIA-NeMo/Skills
Length of output: 1706
🏁 Script executed:
# Check how IOIEvaluator is instantiated and used
rg "IOIEvaluator" --type py -B 2 -A 5Repository: NVIDIA-NeMo/Skills
Length of output: 3881
🏁 Script executed:
# Check where eval_full is called on IOIEvaluator
rg "eval_full" --type py -B 3 -A 3 | grep -A 5 -B 5 "ioi\|IOI"Repository: NVIDIA-NeMo/Skills
Length of output: 1244
🏁 Script executed:
# Search for how the evaluator is used in the main evaluation pipeline
rg "eval_full\(\)" --type py -B 5 -A 2Repository: NVIDIA-NeMo/Skills
Length of output: 1302
🏁 Script executed:
# Check the test to see how IOIEvaluator is instantiated
cat -n tests/test_code_execution.py | sed -n '1,50p'Repository: NVIDIA-NeMo/Skills
Length of output: 1938
🏁 Script executed:
# Find the IOIEvaluator test and see the full context
rg "IOIEvaluator" tests/ -B 5 -A 15Repository: NVIDIA-NeMo/Skills
Length of output: 1804
🏁 Script executed:
# Verify the entire eval_full method implementation
cat -n nemo_skills/evaluation/evaluator/ioi.py | sed -n '377,400p'Repository: NVIDIA-NeMo/Skills
Length of output: 940
Add None check for input_file in eval_full method.
The eval_full method loads from self.eval_cfg.input_file without verifying it's not None. Per BaseEvaluatorConfig, input_file can be None for interleaved evals, which would cause a runtime error when attempting to open the file.
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/ioi.py around lines 377 to 380, add a None
check for self.eval_cfg.input_file before opening it: if input_file is None,
raise a clear ValueError (or handle the interleaved-eval path) with a
descriptive message indicating that eval_full requires an input_file and that
interleaved evaluations should use the appropriate method; otherwise proceed to
open and load the JSONL file as before. Ensure the check prevents attempting to
open None and provides actionable guidance in the error text.
| for s, o in zip(all_samples, outputs): | ||
| s["test_case_results"] = o["test_case_results"] | ||
| s["input_case_results"] = o["input_case_results"] | ||
| for s, o in zip(all_samples, outputs): |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add strict=True to zip for safety.
Using zip() without strict=True can silently truncate if the iterables have different lengths, leading to subtle bugs.
🔎 Apply this diff:
- for s, o in zip(all_samples, outputs):
+ for s, o in zip(all_samples, outputs, strict=True):
s["test_case_results"] = o["test_case_results"]
s["eval_status"] = o["eval_status"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for s, o in zip(all_samples, outputs): | |
| for s, o in zip(all_samples, outputs, strict=True): | |
| s["test_case_results"] = o["test_case_results"] | |
| s["eval_status"] = o["eval_status"] |
🧰 Tools
🪛 Ruff (0.14.8)
385-385: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/ioi.py around line 385, the for-loop uses
zip(all_samples, outputs) which can silently truncate if the iterables differ in
length; update the call to zip(all_samples, outputs, strict=True) to enforce
equal lengths at runtime (or, if supporting Python <3.10, add an explicit length
check like assert len(all_samples) == len(outputs) before the loop) so
mismatched lengths raise an error instead of silently dropping items.
| def get_problem_score(self, submissions) -> float: | ||
| """ | ||
| For a given problem (list of submissions), compute the score as follows: | ||
| - For each subtask, take the maximum score over all submissions. | ||
| - Sum these maximum scores to get the problem score. | ||
| """ | ||
| if not submissions: | ||
| return 0.0, {} | ||
| return 0.0 | ||
| subtask_scores = {} | ||
|
|
||
| for submission in submissions: | ||
| for subtask, result in submission["test_case_results"].items(): | ||
| subtask_scores[subtask] = max(subtask_scores.get(subtask, 0), result["score"]) | ||
| return sum(subtask_scores.values()), subtask_scores |
There was a problem hiding this comment.
Fix inconsistent return type.
Line 40 returns 0.0 for empty submissions, but the function signature indicates it should return a tuple (float, dict). This type inconsistency will cause runtime errors when callers try to unpack the result.
🔎 Apply this diff to fix the return type:
def get_problem_score(self, submissions) -> float:
"""
For a given problem (list of submissions), compute the score as follows:
- For each subtask, take the maximum score over all submissions.
- Sum these maximum scores to get the problem score.
"""
if not submissions:
- return 0.0
+ return 0.0, {}
subtask_scores = {}🤖 Prompt for AI Agents
nemo_skills/evaluation/metrics/ioi_metrics.py around lines 33 to 46: the
function currently returns 0.0 for empty submissions but normally returns
(float, dict), so make the return types consistent by returning a tuple for the
empty-case (0.0, {}) instead of 0.0, update the function annotation from ->
float to -> Tuple[float, Dict[str, float]] and add the necessary typing imports
(Tuple, Dict) at the top of the file.
|
Verified--the issue is that the excluded datasets weren't updated when the dataset name was changed in #1046 |
Testing if GPU test failures were introduced in #1046
Summary by CodeRabbit
New Features
Documentation
Refactor
Chores
✏️ Tip: You can customize this high-level summary in your review settings.