Skip to content

Revert "Port ICPC changes to IOI (#1046)"#1130

Closed
gwarmstrong wants to merge 1 commit intomainfrom
georgea/revert-ioi-update
Closed

Revert "Port ICPC changes to IOI (#1046)"#1130
gwarmstrong wants to merge 1 commit intomainfrom
georgea/revert-ioi-update

Conversation

@gwarmstrong
Copy link
Collaborator

@gwarmstrong gwarmstrong commented Dec 18, 2025

Testing if GPU test failures were introduced in #1046

Summary by CodeRabbit

  • New Features

    • Added round_robin_score metric for comprehensive evaluation result reporting.
  • Documentation

    • Updated evaluation documentation with revised data paths, result locations, and expanded metrics display.
  • Refactor

    • Simplified data preparation workflow and streamlined the evaluation pipeline for improved efficiency.
  • Chores

    • Optimized CI/CD Docker container configuration.

✏️ Tip: You can customize this high-level summary in your review settings.

This reverts commit 9985b2e.

Signed-off-by: George Armstrong <georgea@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 18, 2025

📝 Walkthrough

Walkthrough

This pull request refactors the IOI evaluation pipeline by consolidating sandbox setup into batch shell scripts, removing the input_file configuration from the evaluator, simplifying dataset naming conventions to use split parameter instead of suffix, introducing IOI25 module constants, and overhauling metrics calculation to include round-robin scoring while removing clustering functionality.

Changes

Cohort / File(s) Summary
Test Workflow Simplification
.github/workflows/tests.yml
Removed setup of shared runtime directory and host volume mounting; simplified Docker run command to use generic nemo-skills-sandbox-image without local directory mounting.
Documentation Updates
docs/evaluation/code.md
Updated data preparation examples and result paths from ioi24-specific naming to generic test split; revised metadata references and evaluation results paths; updated metrics table with new headers including round_robin_score.
Dataset Preparation
nemo_skills/dataset/ioi24/prepare.py
Removed --suffix CLI argument; changed output filenames from ioi{suffix}.jsonl to {split}.jsonl and metadata from ioi{suffix}_metadata.json to {split}_metadata.json for consistency with split-based naming.
IOI25 Configuration Module
nemo_skills/dataset/ioi25/__init__.py
Introduced new module with evaluation configuration constants: GENERATION_ARGS, DATASET_GROUP, METRICS_TYPE, and SANDBOX_ENV_VARS.
IOI Evaluator Refactoring
nemo_skills/evaluation/evaluator/ioi.py
Removed input_file configuration; consolidated sandbox setup and file creation into batch shell scripts executed in-sandbox; replaced per-file local writes with batch precompilation; simplified precompiled cache structure; eliminated run_input_case method; updated eval_full to load input via eval_cfg.input_file and use lazy runtime initialization.
IOI Metrics Overhaul
nemo_skills/evaluation/metrics/ioi_metrics.py
Removed clustering logic, extract_final_cpp_block, extract_info, and get_clusters methods; simplified constructor; added simulate_round_robin_score method; modified get_problem_score to return tuples; updated get_metrics to compute and return total_score and round_robin_score; restructured internal state management.

Sequence Diagram(s)

sequenceDiagram
    participant Evaluator as IOI Evaluator
    participant Setup as _initialize_runtime
    participant Sandbox as Sandbox
    participant Metrics as IOI Metrics
    
    Evaluator->>Setup: Initialize runtime (lazy)
    Setup->>Sandbox: Start container
    Setup->>Sandbox: Execute setup_script<br/>(batch file creation<br/>& precompilation)
    Sandbox-->>Setup: Return paths & metadata
    Setup-->>Evaluator: Return (sandbox, metadata, pool)
    
    Evaluator->>Evaluator: Load samples from<br/>eval_cfg.input_file
    
    loop For each sample entry
        Evaluator->>Sandbox: run_test_case<br/>(batch setup via shell)<br/>+ compile solution<br/>+ run tests
        Sandbox-->>Evaluator: test_case_results
        Evaluator->>Metrics: Collect subtask scores
    end
    
    Evaluator->>Metrics: get_metrics
    Metrics->>Metrics: simulate_round_robin_score<br/>(select top 50 submissions)
    Metrics->>Metrics: get_problem_score<br/>(sum per-problem)
    Metrics-->>Evaluator: metrics_dict<br/>(total_score,<br/>round_robin_score)
    
    Evaluator->>Evaluator: Write eval_status<br/>to output
    Evaluator->>Sandbox: Close pool
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • nemo_skills/evaluation/evaluator/ioi.py: Substantial architectural refactoring with consolidated sandbox setup, removal of input_file parameter flow, and simplified precompiled cache handling—requires careful verification of sandbox interaction changes and lazy runtime initialization logic.
  • nemo_skills/evaluation/metrics/ioi_metrics.py: Significant metrics calculation overhaul including removal of clustering logic and introduction of round-robin scoring algorithm—verify correctness of score aggregation and ensure backward compatibility considerations.
  • Integration: Verify interaction between evaluator's simplified setup flow, metrics' new scoring, and test workflow changes, particularly around data loading and result reporting paths.

Possibly related PRs

Suggested reviewers

  • mehrzads

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.25% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the main change: reverting a previous commit (PR #1046) that ported ICPC changes to IOI.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch georgea/revert-ioi-update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
docs/evaluation/code.md (1)

226-229: Add language specifiers to code blocks.

The fenced code blocks showing evaluation results should specify a language for proper rendering.

🔎 Apply this diff to add language specifiers:
-```
+```text
 ------------------------------------------------------ ioi24 ------------------------------------------------------
 evaluation_mode   | num_entries | avg_tokens | gen_seconds | correct       | total_score        | round_robin_score
 pass@1[avg-of-50] | 39          | 40387      | 7410        | 0.51% ± 1.04% | 303.47             | 261.01
 pass@50           | 39          | 40387      | 7410        | 2.56%         | 303.47             | 261.01

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between eed4f7e452b57f055bf82aa41d800356bde73369 and 6cfe8f2b696df0d8816b92f04362201a7ea553c0.

</details>

<details>
<summary>📒 Files selected for processing (6)</summary>

* `.github/workflows/tests.yml` (1 hunks)
* `docs/evaluation/code.md` (2 hunks)
* `nemo_skills/dataset/ioi24/prepare.py` (2 hunks)
* `nemo_skills/dataset/ioi25/__init__.py` (1 hunks)
* `nemo_skills/evaluation/evaluator/ioi.py` (7 hunks)
* `nemo_skills/evaluation/metrics/ioi_metrics.py` (2 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧬 Code graph analysis (2)</summary>

<details>
<summary>nemo_skills/evaluation/evaluator/ioi.py (2)</summary><blockquote>

<details>
<summary>nemo_skills/code_execution/sandbox.py (1)</summary>

* `execute_code` (131-245)

</details>
<details>
<summary>nemo_skills/evaluation/evaluator/base.py (2)</summary>

* `BaseEvaluator` (34-91)
* `BaseEvaluatorConfig` (27-31)

</details>

</blockquote></details>
<details>
<summary>nemo_skills/evaluation/metrics/ioi_metrics.py (1)</summary><blockquote>

<details>
<summary>nemo_skills/evaluation/metrics/map_metrics.py (1)</summary>

* `get_metrics` (74-103)

</details>

</blockquote></details>

</details><details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>docs/evaluation/code.md</summary>

190-190: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

225-225: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>
<details>
<summary>🪛 Ruff (0.14.8)</summary>

<details>
<summary>nemo_skills/evaluation/evaluator/ioi.py</summary>

91-91: Probable insecure usage of temporary file or directory: "/tmp/ioi_pre_"

(S108)

---

121-121: Probable insecure usage of temporary file or directory: "/tmp/ioi_run_"

(S108)

---

149-149: Abstract `raise` to an inner function

(TRY301)

---

149-149: Create your own exception

(TRY002)

---

149-149: Avoid specifying long messages outside the exception class

(TRY003)

---

385-385: `zip()` without an explicit `strict=` parameter

Add explicit value for parameter `strict=`

(B905)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)</summary>

* GitHub Check: gpu-tests-qwen
* GitHub Check: unit-tests
* GitHub Check: pre-commit

</details>

<details>
<summary>🔇 Additional comments (11)</summary><blockquote>

<details>
<summary>.github/workflows/tests.yml (1)</summary><blockquote>

`88-88`: **LGTM - Simplified sandbox container setup.**

The removal of the host volume mount aligns with the updated sandbox initialization flow in the IOI evaluator, which now handles file setup internally via batch shell scripts.

</blockquote></details>
<details>
<summary>nemo_skills/dataset/ioi24/prepare.py (2)</summary><blockquote>

`53-54`: **LGTM - Simplified filename convention.**

Using `--split` directly for output filenames is more consistent with other benchmarks and reduces redundancy.

---

`85-86`: **LGTM - Metadata filename follows the same pattern.**

Consistent with the data file naming change above.

</blockquote></details>
<details>
<summary>nemo_skills/dataset/ioi25/__init__.py (2)</summary><blockquote>

`15-17`: **Note the TODO comment.**

The module indicates that IOI25 data files are still in progress. Ensure this is tracked appropriately.

---

`19-31`: **LGTM - Well-structured configuration constants.**

The constants provide a clean interface for IOI25 dataset configuration, including evaluation settings and required sandbox environment variables.

</blockquote></details>
<details>
<summary>docs/evaluation/code.md (1)</summary><blockquote>

`188-192`: **LGTM - Documentation updated to match new naming convention.**

The documentation correctly reflects the change from `ioi24.jsonl` to `test.jsonl` and the use of the `--split` parameter.

</blockquote></details>
<details>
<summary>nemo_skills/evaluation/metrics/ioi_metrics.py (3)</summary><blockquote>

`48-76`: **LGTM - Round-robin scoring implementation looks correct.**

The method correctly:
1. Computes aggregate scores for each submission
2. Selects the top 50 by aggregate score
3. Computes max score per subtask among selected submissions
4. Returns the sum of subtask maxima

---

`78-90`: **LGTM - Metrics aggregation updated to include round-robin score.**

The metrics now include both `total_score` and `round_robin_score`, aligning with the updated documentation.

---

`20-22`: **The IOIMetrics constructor no longer accepts kwargs, but the metrics factory always passes them.**

The constructor was recently changed from `__init__(self, **kwargs)` back to `__init__(self)` via a revert (commit 6cfe8f2). This intentional revert removed kwargs support that was temporarily added by a port from ICPC metrics. While the factory function at `get_metrics` always calls `metrics_cls(**kwargs)`, currently no metrics_kwargs are configured for the "ioi" metric type in dataset configs, so this incompatibility poses no runtime risk in the current state. However, if kwargs support is needed in the future, the constructor signature should be updated to handle them (either by accepting `**kwargs` like ICPC metrics does, or by extracting specific parameters).

</blockquote></details>
<details>
<summary>nemo_skills/evaluation/evaluator/ioi.py (2)</summary><blockquote>

`91-116`: **LGTM - Consolidated precompilation setup.**

The refactor to use batch shell scripts for file creation and compilation is cleaner and more efficient than individual file writes.

---

`119-212`: **LGTM - Improved test execution with batch file operations.**

The consolidated approach using heredoc syntax (`cat <<'_EOT_'`) for file creation is more efficient and reduces sandbox round-trips.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines 371 to 390
return {
"name": entry["name"],
"subtask": entry["subtask"],
"test_case_results": test_case_results,
"input_case_results": input_outputs,
}

async def eval_full(self, input_files): # type: ignore[override]
for jsonl_file in unroll_files(input_files):
with open(jsonl_file, "r", encoding="utf-8") as f:
all_samples = [json.loads(line) for line in f]
async def eval_full(self): # type: ignore[override]
jsonl_file = self.eval_cfg.input_file
with open(jsonl_file, "r", encoding="utf-8") as f:
all_samples = [json.loads(line) for line in f]

tasks = [self._evaluate_entry(s) for s in all_samples]
outputs = await asyncio.gather(*tasks)
tasks = [self._evaluate_entry(s) for s in all_samples]
outputs = await asyncio.gather(*tasks)

for s, o in zip(all_samples, outputs):
s["test_case_results"] = o["test_case_results"]
s["input_case_results"] = o["input_case_results"]
for s, o in zip(all_samples, outputs):
s["test_case_results"] = o["test_case_results"]
s["eval_status"] = o["eval_status"]

jdump(all_samples, jsonl_file, mode="wt")
jdump(all_samples, jsonl_file, mode="wt")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical bug: eval_status is never set.

Line 387 assigns o["eval_status"] to s["eval_status"], but _evaluate_entry (lines 371-375) never returns an eval_status field. This will cause a KeyError at runtime.

🔎 Apply this diff to add eval_status to the return value:
     return {
         "name": entry["name"],
         "subtask": entry["subtask"],
         "test_case_results": test_case_results,
+        "eval_status": "completed",  # or determine based on test results
     }
🧰 Tools
🪛 Ruff (0.14.8)

385-385: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/ioi.py around lines 371-390, _evaluate_entry
currently returns a dict without an eval_status key but eval_full expects
o["eval_status"], causing a KeyError; update _evaluate_entry to compute an
eval_status (for example, set to "passed" if all test_case_results indicate
success, otherwise "failed" or "partial") and include that value in the returned
dict so eval_full can assign s["eval_status"] safely.

Comment on lines +377 to +380
async def eval_full(self): # type: ignore[override]
jsonl_file = self.eval_cfg.input_file
with open(jsonl_file, "r", encoding="utf-8") as f:
all_samples = [json.loads(line) for line in f]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find the ioi.py file and examine the eval_full method
fd -n "ioi.py" --type=f | head -5

Repository: NVIDIA-NeMo/Skills

Length of output: 292


🏁 Script executed:

# Find the ioi.py file
fd "ioi.py" --type f

Repository: NVIDIA-NeMo/Skills

Length of output: 101


🏁 Script executed:

# Examine the ioi.py file around the eval_full method
cat -n nemo_skills/evaluation/evaluator/ioi.py | sed -n '370,390p'

Repository: NVIDIA-NeMo/Skills

Length of output: 905


🏁 Script executed:

# Find BaseEvaluatorConfig definition
fd "evaluator" --type d | grep -v __pycache__ | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 122


🏁 Script executed:

# Search for BaseEvaluatorConfig definition
rg "class BaseEvaluatorConfig" --type py -A 20

Repository: NVIDIA-NeMo/Skills

Length of output: 1706


🏁 Script executed:

# Check how IOIEvaluator is instantiated and used
rg "IOIEvaluator" --type py -B 2 -A 5

Repository: NVIDIA-NeMo/Skills

Length of output: 3881


🏁 Script executed:

# Check where eval_full is called on IOIEvaluator
rg "eval_full" --type py -B 3 -A 3 | grep -A 5 -B 5 "ioi\|IOI"

Repository: NVIDIA-NeMo/Skills

Length of output: 1244


🏁 Script executed:

# Search for how the evaluator is used in the main evaluation pipeline
rg "eval_full\(\)" --type py -B 5 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1302


🏁 Script executed:

# Check the test to see how IOIEvaluator is instantiated
cat -n tests/test_code_execution.py | sed -n '1,50p'

Repository: NVIDIA-NeMo/Skills

Length of output: 1938


🏁 Script executed:

# Find the IOIEvaluator test and see the full context
rg "IOIEvaluator" tests/ -B 5 -A 15

Repository: NVIDIA-NeMo/Skills

Length of output: 1804


🏁 Script executed:

# Verify the entire eval_full method implementation
cat -n nemo_skills/evaluation/evaluator/ioi.py | sed -n '377,400p'

Repository: NVIDIA-NeMo/Skills

Length of output: 940


Add None check for input_file in eval_full method.

The eval_full method loads from self.eval_cfg.input_file without verifying it's not None. Per BaseEvaluatorConfig, input_file can be None for interleaved evals, which would cause a runtime error when attempting to open the file.

🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/ioi.py around lines 377 to 380, add a None
check for self.eval_cfg.input_file before opening it: if input_file is None,
raise a clear ValueError (or handle the interleaved-eval path) with a
descriptive message indicating that eval_full requires an input_file and that
interleaved evaluations should use the appropriate method; otherwise proceed to
open and load the JSONL file as before. Ensure the check prevents attempting to
open None and provides actionable guidance in the error text.

for s, o in zip(all_samples, outputs):
s["test_case_results"] = o["test_case_results"]
s["input_case_results"] = o["input_case_results"]
for s, o in zip(all_samples, outputs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add strict=True to zip for safety.

Using zip() without strict=True can silently truncate if the iterables have different lengths, leading to subtle bugs.

🔎 Apply this diff:
-        for s, o in zip(all_samples, outputs):
+        for s, o in zip(all_samples, outputs, strict=True):
             s["test_case_results"] = o["test_case_results"]
             s["eval_status"] = o["eval_status"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for s, o in zip(all_samples, outputs):
for s, o in zip(all_samples, outputs, strict=True):
s["test_case_results"] = o["test_case_results"]
s["eval_status"] = o["eval_status"]
🧰 Tools
🪛 Ruff (0.14.8)

385-385: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In nemo_skills/evaluation/evaluator/ioi.py around line 385, the for-loop uses
zip(all_samples, outputs) which can silently truncate if the iterables differ in
length; update the call to zip(all_samples, outputs, strict=True) to enforce
equal lengths at runtime (or, if supporting Python <3.10, add an explicit length
check like assert len(all_samples) == len(outputs) before the loop) so
mismatched lengths raise an error instead of silently dropping items.

Comment on lines 33 to 46
def get_problem_score(self, submissions) -> float:
"""
For a given problem (list of submissions), compute the score as follows:
- For each subtask, take the maximum score over all submissions.
- Sum these maximum scores to get the problem score.
"""
if not submissions:
return 0.0, {}
return 0.0
subtask_scores = {}

for submission in submissions:
for subtask, result in submission["test_case_results"].items():
subtask_scores[subtask] = max(subtask_scores.get(subtask, 0), result["score"])
return sum(subtask_scores.values()), subtask_scores
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix inconsistent return type.

Line 40 returns 0.0 for empty submissions, but the function signature indicates it should return a tuple (float, dict). This type inconsistency will cause runtime errors when callers try to unpack the result.

🔎 Apply this diff to fix the return type:
     def get_problem_score(self, submissions) -> float:
         """
         For a given problem (list of submissions), compute the score as follows:
           - For each subtask, take the maximum score over all submissions.
           - Sum these maximum scores to get the problem score.
         """
         if not submissions:
-            return 0.0
+            return 0.0, {}
         subtask_scores = {}
🤖 Prompt for AI Agents
nemo_skills/evaluation/metrics/ioi_metrics.py around lines 33 to 46: the
function currently returns 0.0 for empty submissions but normally returns
(float, dict), so make the return types consistent by returning a tuple for the
empty-case (0.0, {}) instead of 0.0, update the function annotation from ->
float to -> Tuple[float, Dict[str, float]] and add the necessary typing imports
(Tuple, Dict) at the top of the file.

@gwarmstrong
Copy link
Collaborator Author

Verified--the issue is that the excluded datasets weren't updated when the dataset name was changed in #1046

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant