Skip to content

Port ICPC changes to IOI#1046

Merged
SeanNaren merged 30 commits intomainfrom
feat/update_ioi
Dec 17, 2025
Merged

Port ICPC changes to IOI#1046
SeanNaren merged 30 commits intomainfrom
feat/update_ioi

Conversation

@SeanNaren
Copy link
Collaborator

@SeanNaren SeanNaren commented Nov 18, 2025

Updates IOI with fixes from ICPC.

Summary by CodeRabbit

  • New Features

    • IOI evaluation now supports per-input test cases with dedicated input files
    • Enhanced IOI metrics with submission clustering and detailed per-subtask scoring
  • Documentation

    • Updated IOI evaluation workflow with new command syntax and output formats
  • Bug Fixes

    • Improved error handling and diagnostics for IOI evaluation
  • Chores

    • CI workflow improvements for test infrastructure

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@SeanNaren SeanNaren requested a review from mehrzads November 18, 2025 16:45
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

📝 Walkthrough

Walkthrough

This pull request refactors the IOI evaluation system to support input-file-driven testing and local-based precompilation. Changes include: enabling shared runtime directory mounting in CI, updating IOI dataset preparation with suffix parameters, removing IOI25 dataset configuration, modifying evaluator workflows to use local file-based compilation, enhancing metrics with clustering and per-subtask scoring, and updating documentation accordingly.

Changes

Cohort / File(s) Summary
CI Configuration
.github/workflows/tests.yml
Adds preparation and mounting of shared runtime directory /nemo_run into the nemo-skills-sandbox-image container with permissions set to 777.
Documentation
docs/evaluation/code.md
Updates IOI evaluation documentation: dataset filenames changed from test.jsonl/test_metadata.json to ioi24.jsonl/ioi24_metadata.json; split flag updated to --split=ioi24; added --eval_subfolder parameter; results path structure changed from eval-results/ioi24/metrics.json to eval-results/ioi24/ioi/metrics.json; metrics table structure reorganized with per-subtask columns.
Dataset Preparation
nemo_skills/dataset/ioi/prepare.py
Adds --suffix CLI argument (default "24"); output filenames now use ioi{suffix}.jsonl and ioi{suffix}_metadata.json pattern instead of {split}.jsonl.
Dataset Configuration
nemo_skills/dataset/ioi25/__init__.py
Deletes file; removes exported configuration constants (GENERATION_ARGS, DATASET_GROUP, METRICS_TYPE, SANDBOX_ENV_VARS).
ICPC Evaluator
nemo_skills/evaluation/evaluator/icpc.py
Removes sha256_hex hashing of stdout in run_input_case; now preserves raw stdout output.
IOI Evaluator
nemo_skills/evaluation/evaluator/ioi.py
Major refactor introducing: input file support via new input_file config field; local file-based precompilation workflow replacing in-sandbox script assembly; shared filesystem directory (/nemo_run/) for precompiled artifacts; new run_input_case() function for per-input test execution; updated eval_full() signature to accept input_files; enhanced input validation and error logging; added imports (hashlib, shutil, unroll_files).
IOI Metrics
nemo_skills/evaluation/metrics/ioi_metrics.py
Adds clustering functionality: new extract_final_cpp_block() helper, extract_info(), get_clusters() methods; updates __init__() to accept kwargs and store cluster_folder; changes get_problem_score() return type to tuple; adds evaluations_to_print() method; introduces per_problem_subtask_scores data structure; updates get_metrics(), reset(), and print_problem_scores() to support clustering and per-subtask scoring.

Sequence Diagram

sequenceDiagram
    participant User
    participant IOIEvaluator
    participant LocalFS as Local FS<br/>(/nemo_run)
    participant Sandbox
    participant IOIMetrics

    User->>IOIEvaluator: eval_full(input_files)
    IOIEvaluator->>IOIEvaluator: Load input_data from file
    IOIEvaluator->>IOIEvaluator: Initialize runtime

    loop For each input_file
        IOIEvaluator->>IOIEvaluator: run_input_case()
        IOIEvaluator->>LocalFS: Create unique run directory
        IOIEvaluator->>LocalFS: Copy precompiled grader/artifacts
        IOIEvaluator->>LocalFS: Write contestant solution
        IOIEvaluator->>LocalFS: Write test inputs
        
        IOIEvaluator->>Sandbox: Execute compile.sh
        Sandbox-->>IOIEvaluator: Compilation result (compile_output, errors)
        
        IOIEvaluator->>Sandbox: Execute run.sh with inputs
        Sandbox-->>IOIEvaluator: Run result (stdout, stderr, exit code)
        
        IOIEvaluator->>IOIEvaluator: Aggregate into input_case_results
    end

    IOIEvaluator->>IOIMetrics: Pass submissions with results
    IOIMetrics->>IOIMetrics: extract_info() per submission
    IOIMetrics->>IOIMetrics: get_clusters() group by stdout
    IOIMetrics->>LocalFS: Write cluster JSONL
    IOIMetrics->>IOIMetrics: get_problem_score() per-subtask
    IOIMetrics-->>User: Return metrics & per-subtask scoring
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Key areas requiring focused attention:

  • IOI evaluator refactor (ioi.py): Dense logic density in new input file handling, local-based precompilation/execution workflow, and integration between eval_full(), run_input_case(), and filesystem operations; verify correct artifact isolation and sandbox invocation patterns
  • IOI metrics clustering (ioi_metrics.py): New get_clusters() logic and per-subtask score aggregation; ensure correctness of grouping and max-score computation per subtask
  • Shared filesystem integration: Verify /nemo_run/ mount in CI workflow and its usage across evaluator for artifact persistence and sandbox execution
  • Input file validation and error handling: Check FileNotFoundError messages and input data loading logic
  • Backwards compatibility: Ensure signature changes to eval_full() and get_problem_score() don't break existing callers

Possibly related PRs

Suggested reviewers

  • Kipok
  • activatedgeek

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: porting ICPC changes to IOI evaluation, which is reflected across the multiple IOI-related file modifications throughout the changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/update_ioi

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
docs/evaluation/code.md (1)

202-220: Fix IOI eval command formatting and add fenced languages

Two small issues here:

  1. The line with \ # set the folder... isn’t a valid shell line continuation if copy‑pasted.
  2. Both this command block and the metrics example block are missing language annotations (MD040).

Suggested adjustment:

-```
-ns eval \
+```bash
+ns eval \
@@
-    --output_dir=<OUTPUT_DIR> \
-    --eval_subfolder=eval-results/ioi24/ \ # set the folder if you want to differentiate subsets.
+    --output_dir=<OUTPUT_DIR> \
+    --eval_subfolder=eval-results/ioi24/ \
+    # set the folder if you want to differentiate subsets.
     --extra_eval_args="++eval_config.test_file=<PATH_TO_METADATA_TEST_DIR>/ioi24_metadata.json" \
@@
-```
+```

And for the metrics table:

-```
+```text
 ---------------------------------- ioi -----------------------------------
@@
-Sphinx's Riddle | 235               | 48          | 235.00
-```
+Sphinx's Riddle | 235               | 48          | 235.00
+```

This keeps the examples copy‑pasteable and satisfies the linter.

Also applies to: 224-236

nemo_skills/evaluation/evaluator/icpc.py (1)

456-460: eval_full assumes eval_status that _evaluate_entry doesn’t provide

Here you assign:

for s, o in zip(all_samples, outputs):
    s["test_case_results"] = o["test_case_results"]
    s["input_case_results"] = o["input_case_results"]
    s["eval_status"] = o["eval_status"]

But _evaluate_entry currently returns only name, test_case_results, and input_case_results, so o["eval_status"] will raise a KeyError as soon as eval_full is used.

If you don’t rely on eval_status downstream, a minimal fix is:

-            for s, o in zip(all_samples, outputs):
-                s["test_case_results"] = o["test_case_results"]
-                s["input_case_results"] = o["input_case_results"]
-                s["eval_status"] = o["eval_status"]
+            for s, o in zip(all_samples, outputs):
+                s["test_case_results"] = o["test_case_results"]
+                s["input_case_results"] = o["input_case_results"]
+                if "eval_status" in o:
+                    s["eval_status"] = o["eval_status"]

Alternatively, you could add an eval_status field to _evaluate_entry’s return and keep this assignment. As written, this is a runtime error path.

🧹 Nitpick comments (6)
nemo_skills/dataset/ioi/prepare.py (1)

30-31: Suffix-based IOI filenames are consistent; consider factoring the base name

The new --suffix argument and ioi{args.suffix}*.{jsonl,json} filenames line up with the IOI24 docs and downstream evaluator expectations.

If you expect to vary the suffix (e.g., IOI25), you might slightly simplify by factoring the base once:

-    args = parser.parse_args()
+    args = parser.parse_args()
+    split_name = f"ioi{args.suffix}"
...
-    with open(os.path.join(data_dir, f"ioi{args.suffix}.jsonl"), "w") as f:
+    with open(os.path.join(data_dir, f"{split_name}.jsonl"), "w") as f:
...
-    with open(os.path.join(data_dir, f"ioi{args.suffix}_metadata.json"), "w") as f:
+    with open(os.path.join(data_dir, f"{split_name}_metadata.json"), "w") as f:

Purely optional; current code is otherwise fine.

Also applies to: 54-87

docs/evaluation/code.md (1)

188-192: Add a language to the IOI prepare_data code block

To satisfy markdownlint (MD040) and improve syntax highlighting, it’s better to tag this block as shell:

-```
-ns prepare_data ioi
-```
+```bash
+ns prepare_data ioi
+```

The surrounding text about generating ioi24.jsonl / ioi24_metadata.json is consistent with the new --suffix default in prepare.py.

nemo_skills/evaluation/metrics/ioi_metrics.py (3)

29-34: Preserve BaseMetrics kwargs and avoid unconditional printing in IOIMetrics.init

__init__(self, **kwargs) currently drops any arguments meant for BaseMetrics (e.g., max_k) because super().__init__() is called without them, and it always prints the cluster folder.

Consider something like:

-    def __init__(self, **kwargs):
-        super().__init__()
-        self.reset()
-        self.cluster_folder = kwargs.get("cluster_folder", None)
-        print(f"Cluster folder: {self.cluster_folder}")
+    def __init__(self, **kwargs):
+        self.cluster_folder = kwargs.pop("cluster_folder", None)
+        super().__init__(**kwargs)
+        self.reset()

and, if you still want visibility, log cluster_folder via your logging setup instead of print. This keeps the constructor compatible with any existing BaseMetrics options.


53-90: Clustering logic looks good; consider using a clearer identifier and extension for cluster files

The grouping by run_stdout plus per‑subtask max_score/max_score_solutions updates looks sound.

Two minor nits around the output:

  • id is taken from submission.get("id", id) on each iteration, so the final value is effectively the last non‑missing id. If problems can have multiple ids or some submissions lack it, the filename "{id}_cluster.jsonl" may be surprising or collide across problems.
  • You’re writing a single JSON object with json.dump, but the extension is .jsonl.

If useful, you could instead do:

-        clusters, id = self.get_clusters(submissions)
+        clusters, last_id = self.get_clusters(submissions)
@@
-                output_file = os.path.join(self.cluster_folder, f"{id}_cluster.jsonl")
+                safe_name = name.replace("/", "_")
+                output_file = os.path.join(self.cluster_folder, f"{safe_name}_{last_id}_clusters.json")

This is optional polish; current behavior is functionally fine.

Also applies to: 107-130


92-106: Align get_problem_score signature/docs with its tuple return and consider total_score casting

get_problem_score now returns (score, subtasks) but is annotated and documented as returning just a float:

def get_problem_score(self, submissions) -> float:
    ...
    return sum(subtask_scores.values()), subtask_scores

Since callers rely on unpacking (score, subtasks), it’d be clearer to update both the return annotation and docstring accordingly, e.g.:

-    def get_problem_score(self, submissions) -> float:
+    def get_problem_score(self, submissions) -> tuple[float, dict]:
@@
-        For a given problem (list of submissions), compute the score as follows:
+        For a given problem (list of submissions), return:
+          - total score: sum of per‑subtask maxima across submissions
+          - per‑subtask score dict

Also, m["total_score"] = int(total_score) will truncate if any subtask scores are fractional. If IOI scoring guarantees integer totals this is fine; otherwise, you may want to preserve the float.

The new per_problem_subtask_scores and print_problem_scores wiring looks consistent with the evaluator output shape.

Also applies to: 135-156, 164-173

nemo_skills/evaluation/evaluator/ioi.py (1)

202-269: IOI input_runtime flow is solid; verify a few assumptions (unroll_files, input_file format, worker count)

Overall the input‑driven path looks good: run_input_case uses run_files to reconstruct the toolchain, you load input_file once in _initialize_runtime, and eval_full leverages unroll_files to handle globs.

A few details worth double‑checking / possibly tightening:

  • run_files expectations (lines 202‑247): run_input_case assumes that run_files contains scripts whose basenames are compile and run, and then hard‑codes ./compile and ./run < input.txt>. If future metadata ships scripts named compile.sh / run.sh or in subdirectories, this will break. If there’s variability here, consider:

    • using the exact filenames from run_files instead of hard‑coded compile/run, or
    • normalizing them at data‑prep time.
  • Exception handling in run_input_case (lines 262‑269): catching Exception and returning only strings is convenient but can make debugging harder. Even a brief print or structured log in the except block would help trace failures without changing the external schema.

  • Pool sizing vs. config (lines 347‑350): multiprocessing.Pool uses processes=self.eval_cfg.test_batch_size; the num_workers config field is currently unused. If you intend num_workers to control process count independently of batch size, you may want:

    processes = self.eval_cfg.num_workers or self.eval_cfg.test_batch_size
    pool_local = multiprocessing.Pool(processes=processes, initializer=init_worker)
  • unroll_files import (lines 27‑28, 476‑489): here unroll_files is imported from nemo_skills.utils, while the provided snippet shows its definition in nemo_skills/file_utils.py. Please confirm that nemo_skills.utils re‑exports unroll_files; otherwise this import will fail at runtime and should be switched to from nemo_skills.file_utils import jdump, unroll_files.

  • input_file structure (lines 339‑347, 445‑467): you index self.inputdata with str(entry["id"]). That’s fine as long as the input JSON is keyed by stringified dataset ids; worth ensuring the generator for that file matches this convention.

None of these are blockers, but they’re good to validate before relying on this path heavily.

Also applies to: 318-355, 476-489, 27-28

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df9fbd9 and 90f6044.

📒 Files selected for processing (7)
  • .github/workflows/tests.yml (1 hunks)
  • docs/evaluation/code.md (2 hunks)
  • nemo_skills/dataset/ioi/prepare.py (3 hunks)
  • nemo_skills/dataset/ioi25/__init__.py (0 hunks)
  • nemo_skills/evaluation/evaluator/icpc.py (1 hunks)
  • nemo_skills/evaluation/evaluator/ioi.py (8 hunks)
  • nemo_skills/evaluation/metrics/ioi_metrics.py (2 hunks)
💤 Files with no reviewable changes (1)
  • nemo_skills/dataset/ioi25/init.py
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/ioi.py (2)
nemo_skills/file_utils.py (1)
  • unroll_files (21-32)
nemo_skills/utils.py (1)
  • nested_dataclass (69-102)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md

190-190: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


226-226: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.7)
nemo_skills/evaluation/evaluator/ioi.py

114-114: os.chmod setting a permissive mask 0o755 on file or directory

(S103)


119-119: os.chmod setting a permissive mask 0o755 on file or directory

(S103)


198-199: try-except-pass detected, consider logging the exception

(S110)


198-198: Do not catch blind exception: Exception

(BLE001)


216-216: os.chmod setting a permissive mask 0o755 on file or directory

(S103)


260-260: Consider moving this statement to an else block

(TRY300)


262-262: Do not catch blind exception: Exception

(BLE001)


341-344: Avoid specifying long messages outside the exception class

(TRY003)


462-462: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


484-484: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (3)
.github/workflows/tests.yml (1)

88-91: CI /nemo_run setup and mount look appropriate

Creating /nemo_run and mounting it into nemo-skills-sandbox-image matches the new IOI/ICPC evaluators’ use of that shared filesystem. In the ephemeral GitHub runner context, chmod 777 here is acceptable and there’s no additional cleanup needed.

nemo_skills/evaluation/evaluator/icpc.py (1)

239-251: Using raw run_stdout for input cases is reasonable

Switching run_stdout from a hash to the raw stdout improves debuggability and supports clustering/grouping on actual outputs. With max_output_characters=1_000_000 the per‑run size is bounded; just be aware that JSONL files for large input sets will grow accordingly, which seems acceptable given the use case.

nemo_skills/evaluation/evaluator/ioi.py (1)

87-124: /nemo_run‑based precompile and run directories for IOI look consistent

The new _precompile_grader and run_test_case paths that write under /nemo_run/ioi_pre_* and /nemo_run/ioi_run_* align with the CI changes that mount /nemo_run into the sandbox container. Using os.getpid() and time.time_ns() in the directory names should avoid collisions across workers, and copying precompiled assets per run keeps the checker/grader reuse clear.

No blocking issues here; the structure mirrors the ICPC workflow in a reasonable way.

Also applies to: 127-135, 129-135

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@SeanNaren SeanNaren enabled auto-merge (squash) December 17, 2025 10:45
@SeanNaren SeanNaren merged commit 9985b2e into main Dec 17, 2025
5 checks passed
@SeanNaren SeanNaren deleted the feat/update_ioi branch December 17, 2025 11:00
gwarmstrong added a commit that referenced this pull request Dec 18, 2025
gwarmstrong added a commit that referenced this pull request Dec 18, 2025
This reverts commit 9985b2e.

Signed-off-by: George Armstrong <georgea@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
wasiahmad pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
blahblahasdf pushed a commit to blahblahasdf/Skills that referenced this pull request Jan 8, 2026
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
Signed-off-by: dlord <dlord@nvidia.com>
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Feb 4, 2026
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants