Skip to content

Evaluation on OJBench#848

Merged
Kipok merged 60 commits intomainfrom
feat/ojbench
Oct 1, 2025
Merged

Evaluation on OJBench#848
Kipok merged 60 commits intomainfrom
feat/ojbench

Conversation

@wasiahmad
Copy link
Collaborator

@wasiahmad wasiahmad commented Sep 25, 2025

In this PR, we are adding evaluation support for OJBench through nemo-skills. Primary changes are:

  • Dataset download through nemo-skills/dataset/ojbench
  • Evaluation logic implemented at nemo_skills/evaluation/evaluator/code.py

Summary by CodeRabbit

  • New Features

    • OJBench support: evaluator with async sandboxed runner, new OJBench metrics, and registration.
    • Dataset preparation tool to fetch/transform prompts into language-specific test sets.
    • Global and per-job CLI options to preserve sandbox mounts, propagated across pipeline and task creation.
  • Bug Fixes / Defaults

    • Default evaluation split changed to use the Python test split.
  • Chores

    • Sandbox base image updated with PyPy tooling and a preinstalled judge server.
  • Documentation

    • Added OJBench docs with setup and sample run guidance.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 25, 2025

Walkthrough

Adds OJBench dataset support (prep, evaluator, metrics, docs), updates default eval split, installs PyPy3 and judge-server in the sandbox Dockerfile, and threads a new keep_mounts_for_sandbox flag through pipeline CLI, task creation, and data-prep tooling.

Changes

Cohort / File(s) Summary
Sandbox image updates
dockerfiles/Dockerfile.sandbox
Adds build-essential and libseccomp-dev, moves pypy3 binary path to /usr/bin/pypy3, runs ensurepip for pypy3, and installs the DMOJ judge-server via pip from a specific commit; retains Lean 4 toolchain steps.
OJBench dataset config & prep
nemo_skills/dataset/ojbench/__init__.py, nemo_skills/dataset/ojbench/prepare.py
Changes default EVAL_SPLIT to "test_python"; adds prepare.py to clone/update a Hugging Face repo (requires HF_TOKEN), transform prompts/full.jsonl (rename promptquestion, add subset_for_metrics), and emit per-language test_python.jsonl/test_cpp.jsonl.
Evaluator registration
nemo_skills/evaluation/evaluator/__init__.py
Registers eval_ojbench in the EVALUATOR_MAP under key "ojbench".
OJBench evaluator
nemo_skills/evaluation/evaluator/ojbench.py
New evaluator module providing OJBenchConfig, async sandbox_context, install_packages, eval_ojbench_async, and sync wrapper eval_ojbench; handles sandbox setup, package install, preprocessing, running OJBench inside sandbox, and merging results into samples.
Metrics addition & mapping
nemo_skills/evaluation/metrics/code_metrics.py, nemo_skills/evaluation/metrics/map_metrics.py
Adds OJBenchMetrics (uses prediction["is_passed"] for accuracy) and registers it in METRICS_MAP as "ojbench".
Pipeline: keep mounts plumbing
nemo_skills/pipeline/eval.py, nemo_skills/pipeline/generate.py, nemo_skills/pipeline/utils/eval.py, nemo_skills/pipeline/utils/exp.py, nemo_skills/pipeline/prepare_data.py, nemo_skills/pipeline/run_cmd.py, nemo_skills/pipeline/start_server.py, nemo_skills/pipeline/train.py
Introduces keep_mounts_for_sandbox CLI flag and BenchmarkArgs field; threads the flag through benchmark args extraction, prepare_eval_commands, task creation (add_task), and related call sites; computes per-job job_needs_sandbox_to_keep_mounts and conditionally preserves sandbox mounts.
Docs
docs/evaluation/code.md
Adds OJBench documentation under supported benchmarks with data preparation steps, sample run commands, and example metrics.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as Pipeline CLI
  participant Prep as prepare_eval_commands
  participant Manager as Task Manager (add_task)
  participant Sandbox as Sandbox Manager
  participant Eval as eval_ojbench
  participant OJ as OJBench (in-sandbox)
  participant FS as Filesystem

  User->>CLI: run eval --benchmark ojbench [--keep-mounts-for-sandbox]
  CLI->>Prep: prepare_eval_commands(keep_mounts_for_sandbox)
  Prep-->>CLI: job_batches (needs_sandbox, keep_mounts, mount_paths)

  CLI->>Manager: add_task(..., keep_mounts_for_sandbox, sandbox_mount_paths)
  Manager->>Sandbox: get_sandbox(config, keep_mounts, mount_paths)
  Sandbox-->>Manager: sandbox handle

  Manager->>Eval: eval_ojbench(cfg)
  Eval->>Sandbox: install_packages() (clone & pip install inside sandbox)
  Eval->>FS: preprocess JSONL inputs
  Eval->>OJ: run ojbench evaluator inside sandbox
  OJ-->>Eval: per-file _eval_results.json
  Eval->>FS: merge results into samples
  Eval-->>Manager: evaluation complete

  Manager->>Sandbox: release (respect keep_mounts_for_sandbox)
  Sandbox-->>Manager: released
  Manager-->>CLI: results with is_passed/metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I hop through code and sandbox gates,
PyPy tuned and judges wait,
OJBench hums and verdicts spring,
Metrics count each passing thing.
A carrot for CI — tests sing! 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “Evaluation on OJBench” directly captures the primary focus of the pull request—adding evaluation support for the OJBench benchmark—and concisely reflects the main change introduced by the author. It is specific to the update, avoids generic phrasing, and gives a teammate scanning history an immediate understanding of the PR’s intent.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/ojbench

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7709ed4 and cc803a9.

📒 Files selected for processing (1)
  • nemo_skills/pipeline/eval.py (5 hunks)
🧰 Additional context used
🪛 Ruff (0.13.1)
nemo_skills/pipeline/eval.py

339-339: Unpacked variable job_server_address is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (2)
nemo_skills/pipeline/eval.py (2)

174-177: LGTM! Warning message addresses safety concerns.

The new parameter includes an appropriate warning about the risks of keeping mounts for sandbox execution.


315-315: LGTM! Parameter correctly threaded through the pipeline.

The keep_mounts_for_sandbox parameter is properly propagated to prepare_eval_commands (line 315), combined with the job-level flag using OR logic when creating tasks (line 357), and passed to the judge generation pipeline (line 424).

Also applies to: 333-341, 357-357, 424-424


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wasiahmad wasiahmad marked this pull request as draft September 25, 2025 21:48
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/__init__.py (1)

26-63: Fix the OJBench installer URL before exposing the evaluator

Registering eval_ojbench here will always surface an ImportError, because the implementation calls install_from_git("git+https://github.com/He-Ren/OJBench/tree/main"), and pip does not understand the /tree/main suffix. The install attempt fails every time, so the evaluator never runs. Please update the git spec to the canonical .git@branch form before wiring this evaluator in.

-        install_from_git("git+https://github.com/He-Ren/OJBench/tree/main")
+        install_from_git("git+https://github.com/He-Ren/OJBench.git@main")
🧹 Nitpick comments (2)
nemo_skills/evaluation/metrics/code_metrics.py (1)

118-120: Preserve the original prediction payload in the fallback path.

Returning a bare {"is_passed": False} drops any auxiliary fields (token counts, timing, etc.) that downstream logic might still expect when length filtering swaps in the incorrect sample, and it also triggers the Ruff ARG002 warning. Copy the incoming prediction before overriding is_passed so the structure stays intact.

-    def get_incorrect_sample(self, prediction: dict) -> dict:
-        return {"is_passed": False}
+    def get_incorrect_sample(self, prediction: dict) -> dict:
+        prediction = prediction.copy()
+        prediction["is_passed"] = False
+        return prediction
dockerfiles/Dockerfile.sandbox (1)

79-92: Remove duplicate LISTEN_PORT env

LISTEN_PORT is set twice to 6000; drop the duplicate.

Apply this diff:

-ENV LISTEN_PORT=6000
 RUN echo "uwsgi_read_timeout 14400s;" > /etc/nginx/conf.d/custom_timeout.conf
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4784668 and a3c890d.

📒 Files selected for processing (8)
  • dockerfiles/Dockerfile.sandbox (1 hunks)
  • nemo_skills/dataset/livecodebench-pro/__init__.py (1 hunks)
  • nemo_skills/dataset/ojbench/__init__.py (1 hunks)
  • nemo_skills/dataset/ojbench/prepare.py (1 hunks)
  • nemo_skills/evaluation/evaluator/__init__.py (2 hunks)
  • nemo_skills/evaluation/evaluator/code.py (2 hunks)
  • nemo_skills/evaluation/metrics/code_metrics.py (1 hunks)
  • nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/code.py (1)
  • eval_ojbench (350-397)
nemo_skills/evaluation/metrics/code_metrics.py (3)
nemo_skills/evaluation/metrics/base.py (5)
  • BaseMetrics (23-434)
  • _get_score_dict (124-143)
  • get_incorrect_sample (200-206)
  • update (145-189)
  • _compute_pass_at_k (352-423)
nemo_skills/evaluation/metrics/lean4_metrics.py (3)
  • _get_score_dict (23-24)
  • get_incorrect_sample (26-29)
  • update (46-48)
nemo_skills/evaluation/metrics/ruler_metrics.py (3)
  • _get_score_dict (19-20)
  • get_incorrect_sample (26-29)
  • update (22-24)
nemo_skills/evaluation/evaluator/code.py (1)
nemo_skills/file_utils.py (1)
  • unroll_files (21-32)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/code_metrics.py (1)
  • OJBenchMetrics (114-123)
🪛 Ruff (0.13.1)
nemo_skills/evaluation/metrics/code_metrics.py

118-118: Unused method argument: prediction

(ARG002)

nemo_skills/evaluation/evaluator/code.py

389-389: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/dataset/ojbench/prepare.py

41-41: subprocess call: check for execution of untrusted input

(S603)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (3)
nemo_skills/evaluation/metrics/map_metrics.py (1)

24-58: Mapping addition looks solid.

Importing OJBenchMetrics and wiring it into METRICS_MAP keeps the registry consistent with existing entries.

nemo_skills/dataset/ojbench/__init__.py (1)

16-20: Defaults look good

These defaults align the dataset with the new OJBench evaluator and metrics wiring.

dockerfiles/Dockerfile.sandbox (1)

27-29: Pin judge-server revision and optimize installation

  • Replace hardcoded install with:
- RUN /usr/bin/pypy3 -m pip install git+https://github.com/DMOJ/judge-server.git@f098cd3a49a60186d1fadde5132329ec5f4f2213
+ ARG JUDGE_SERVER_REF=f098cd3a49a60186d1fadde5132329ec5f4f2213
+ RUN /usr/bin/pypy3 -m pip install --no-cache-dir "git+https://github.com/DMOJ/judge-server.git@${JUDGE_SERVER_REF}"
  • Verify the evaluator actually invokes judge-server via the PyPy install (e.g. ensure PATH includes the PyPy bin or it’s called with /usr/bin/pypy3 -m judge_server).

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
…_packages' was never awaited

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
docs/evaluation/code.md (2)

208-210: Use descriptive link text.

Markdownlint flags “here” as non-descriptive. Rename the link so it tells readers what they’ll open.

-- Original benchmark source is [here](https://github.com/He-Ren/OJBench/tree/main).
+- Original benchmark source is [the OJBench repository](https://github.com/He-Ren/OJBench/tree/main).

216-217: Add languages to fenced code blocks.

The new ns command snippets and the JSON metrics block need language identifiers to satisfy MD040 and improve readability.

-```
+```bash
 ns prepare_data --data_dir=<DATA_DIR> --cluster=<CLUSTER_NAME> ojbench

@@
- +bash
ns eval
@@
++inference.tokens_to_generate=32768

@@
-```
+```json
{
  "ojbench": {

Also applies to: 229-244, 249-283

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d527518 and 730efbf.

📒 Files selected for processing (1)
  • docs/evaluation/code.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md

204-204: Link text should be descriptive

(MD059, descriptive-link-text)


209-209: Link text should be descriptive

(MD059, descriptive-link-text)


215-215: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


229-229: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


248-248: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/prepare_data.py (1)

93-96: Quote data_dir in shell command to avoid injection/spacing issues.

data_dir is interpolated into a shell string without quoting; paths with spaces or malicious characters can break or be exploited.

Apply this minimal fix:

+import shlex
@@
-    if data_dir:
-        command += f" && mkdir -p {data_dir} && cp -r /nemo_run/code/nemo_skills/dataset/* {data_dir}"
+    if data_dir:
+        dd = shlex.quote(data_dir)
+        command += f" && mkdir -p {dd} && cp -r /nemo_run/code/nemo_skills/dataset/* {dd}"
♻️ Duplicate comments (1)
nemo_skills/pipeline/eval.py (1)

355-360: sandbox_mount_paths likely expects a list; mount_paths here is a CLI string.

Passing the raw string risks downstream type errors. Convert to a list (e.g., split and strip) or pass the resolved list returned/recorded by resolve_mount_paths.

-                    keep_mounts_for_sandbox=job_needs_sandbox_to_keep_mounts or keep_mounts_for_sandbox,
-                    sandbox_mount_paths=mount_paths,
+                    keep_mounts_for_sandbox=job_needs_sandbox_to_keep_mounts or keep_mounts_for_sandbox,
+                    sandbox_mount_paths=[p.strip() for p in mount_paths.split(",")] if isinstance(mount_paths, str) and mount_paths else mount_paths,

Run to confirm the expected type for sandbox_mount_paths and how resolve_mount_paths handles mount_paths:

#!/bin/bash
# Inspect add_task signature and sandbox_mount_paths type
rg -nP 'def\s+add_task\(' -n nemo_skills -S -g '!**/venv/**' -g '!**/site-packages/**'
rg -n 'sandbox_mount_paths' nemo_skills -n -C3

# Inspect resolve_mount_paths behavior
rg -n 'def resolve_mount_paths' nemo_skills -n -C5
🧹 Nitpick comments (7)
docs/evaluation/code.md (1)

215-217: Consider adding language specifiers to code blocks.

The code blocks would benefit from explicit language markers for proper syntax highlighting:

  • Line 215: Should be ```bash for the shell command
  • Line 229: Should be ```bash for the ns eval command
  • Line 248: Should be ```json for the metrics output

Apply these changes:

At line 215:

-```
+```bash
 ns prepare_data --data_dir=<DATA_DIR> --cluster=<CLUSTER_NAME> ojbench

At line 229:
```diff
-```
+```bash
 ns eval \
     --cluster=<CLUSTER_NAME> \

At line 248:

-```
+```json
 {
   "ojbench": {

Also applies to: 229-244, 248-283

nemo_skills/evaluation/evaluator/ojbench.py (1)

79-91: Consider: Original file format lost before evaluation completes.

The code overwrites the original JSONL file at line 90-91 with preprocessed samples before evaluation runs (lines 107-112). While samples remain in memory, if the process crashes or is interrupted between preprocessing and the final write (line 128-130), the original field names (question, completion) are permanently replaced with OJBench format (prompt, content).

Since past review flagged this as important ("very important to only replace files at the very end") and the preprocessing is only needed for ojbench.judge_jsonl, consider writing preprocessed samples to a temporary file and leaving the original unchanged until final results are merged.

nemo_skills/pipeline/start_server.py (1)

65-68: Guard against invalid/unsafe flag combo

If keep_mounts_for_sandbox is True while with_sandbox is False, the flag is meaningless and could confuse users. Add an explicit check to fail fast.

@@
     try:
         server_type = server_type.value
     except AttributeError:
         pass
+
+    # keep_mounts_for_sandbox only applies when a sandbox is launched
+    if keep_mounts_for_sandbox and not with_sandbox:
+        raise ValueError("keep_mounts_for_sandbox requires --with_sandbox.")
 
     log_dir = check_mounts(cluster_config, log_dir, check_mounted_paths=check_mounted_paths)
nemo_skills/pipeline/generate.py (1)

133-136: Fail fast when keep_mounts_for_sandbox is set without sandbox

Avoid silent no-ops or confusion by rejecting keep_mounts_for_sandbox unless with_sandbox is enabled.

@@
     try:
         server_type = server_type.value
     except AttributeError:
         pass
+
+    if keep_mounts_for_sandbox and not with_sandbox:
+        raise ValueError("keep_mounts_for_sandbox requires --with_sandbox.")
nemo_skills/pipeline/train.py (1)

250-253: Add validation for keep_mounts_for_sandbox usage

Same rationale as other CLIs: reject keep_mounts_for_sandbox when with_sandbox is False.

@@
     try:
         training_algo = training_algo.value
     except AttributeError:
         pass
+
+    if keep_mounts_for_sandbox and not with_sandbox:
+        raise ValueError("keep_mounts_for_sandbox requires --with_sandbox.")
nemo_skills/pipeline/run_cmd.py (1)

61-63: Optional: fix typing for server_gpus.

Default is None but annotated as int. Consider Optional[int] for clarity and static tooling.

-    server_gpus: int = typer.Option(None, help="Number of GPUs to use if hosting the model"),
+    server_gpus: int | None = typer.Option(None, help="Number of GPUs to use if hosting the model"),
nemo_skills/pipeline/utils/eval.py (1)

371-376: Fix condition/message: only warn when the benchmark requires keeping mounts.

Current check uses requires_sandbox, but the message is about “requires sandbox to keep mounts.” Use the keep_mounts_for_sandbox flag instead; also the text says “enabling it” though no state is changed here.

-            if benchmark_args.requires_sandbox and not keep_mounts_for_sandbox:
-                LOG.warning("Found benchmark (%s) which requires sandbox to keep mounts, enabling it.", benchmark)
+            if benchmark_args.keep_mounts_for_sandbox and not keep_mounts_for_sandbox:
+                LOG.warning(
+                    "Benchmark (%s) requires sandbox to keep mounts; mounts will be kept for its jobs.",
+                    benchmark,
+                )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 730efbf and bff6d70.

📒 Files selected for processing (10)
  • docs/evaluation/code.md (1 hunks)
  • nemo_skills/dataset/ojbench/prepare.py (1 hunks)
  • nemo_skills/evaluation/evaluator/ojbench.py (1 hunks)
  • nemo_skills/pipeline/eval.py (5 hunks)
  • nemo_skills/pipeline/generate.py (2 hunks)
  • nemo_skills/pipeline/prepare_data.py (2 hunks)
  • nemo_skills/pipeline/run_cmd.py (2 hunks)
  • nemo_skills/pipeline/start_server.py (2 hunks)
  • nemo_skills/pipeline/train.py (2 hunks)
  • nemo_skills/pipeline/utils/eval.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/ojbench.py (3)
nemo_skills/code_execution/sandbox.py (2)
  • get_sandbox (419-422)
  • close (77-79)
nemo_skills/evaluation/evaluator/code.py (1)
  • preprocess_code (35-91)
nemo_skills/utils.py (2)
  • get_logger_name (130-131)
  • nested_dataclass (49-82)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md

204-204: Link text should be descriptive

(MD059, descriptive-link-text)


209-209: Link text should be descriptive

(MD059, descriptive-link-text)


215-215: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


229-229: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


248-248: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.1)
nemo_skills/evaluation/evaluator/ojbench.py

58-58: Avoid specifying long messages outside the exception class

(TRY003)


64-64: Avoid specifying long messages outside the exception class

(TRY003)


98-98: Use explicit conversion flag

Replace with conversion flag

(RUF010)


100-100: Use explicit conversion flag

Replace with conversion flag

(RUF010)


101-101: Use explicit conversion flag

Replace with conversion flag

(RUF010)


115-115: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/pipeline/eval.py

339-339: Unpacked variable job_server_address is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

nemo_skills/dataset/ojbench/prepare.py

46-46: subprocess call: check for execution of untrusted input

(S603)


46-46: Starting a process with a partial executable path

(S607)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (13)
nemo_skills/dataset/ojbench/prepare.py (3)

22-28: LGTM! Token validation properly addresses past feedback.

The HF_TOKEN validation with clear error messages addresses the previous review concerns effectively.


31-57: LGTM! Token sanitization properly addresses security concerns.

The error handling correctly sanitizes the HF_TOKEN from error messages (lines 52-56), addressing previous security feedback. The static analysis warnings about subprocess calls are false positives in this context—the git executable is validated and the command construction is deterministic.


60-91: LGTM! Subdirectory approach addresses past critical issue.

The clone destination correctly uses a subdirectory OJBench_testdata (line 63), which properly addresses the previous critical feedback about not deleting the package directory. The data processing logic is clean and well-structured.

docs/evaluation/code.md (1)

285-285: LGTM! Correctly references ojbench for repeated runs.

The documentation properly uses --benchmarks=ojbench:N for repeats, which addresses previous feedback.

nemo_skills/evaluation/evaluator/ojbench.py (3)

48-67: LGTM! Error handling properly uses exceptions.

The function correctly raises RuntimeError on failures, which is better than the warning approach mentioned in past reviews. The error messages are clear and helpful for debugging.


114-122: LGTM! Error handling and validation address past feedback.

The code properly:

  • Raises RuntimeError on evaluation failure (line 115) instead of warning
  • Validates result count matches sample count (lines 120-122)
  • Uses strict=True in zip (line 124)

These changes address previous review feedback effectively.


48-67: Sandbox contexts share the same long-lived server, so the editable install persists. Closing a LocalSandbox client only tears down the HTTP session, not the server or its working directory—both contexts connect to the same process, and pip install -e remains available.

Likely an incorrect or invalid review comment.

nemo_skills/pipeline/generate.py (1)

316-316: Good: flag is threaded to task creation

The new keep_mounts_for_sandbox is correctly propagated to add_task alongside with_sandbox. No further issues here.

nemo_skills/pipeline/train.py (1)

385-385: Propagation acknowledged

keep_mounts_for_sandbox is passed through to add_task for training jobs; consistent with the pipeline.

nemo_skills/pipeline/prepare_data.py (1)

58-61: Flag threaded correctly; help text is clear.

The new keep_mounts_for_sandbox option is wired into the CLI with a clear risk warning. Good addition.

nemo_skills/pipeline/run_cmd.py (1)

91-94: Propagation of keep_mounts_for_sandbox looks good.

Option is exposed in CLI and forwarded to add_task; matches other pipelines for consistency.

Also applies to: 180-193

nemo_skills/pipeline/utils/eval.py (2)

184-187: Good: benchmark‑level keep_mounts_for_sandbox is plumbed through.

Reading KEEP_MOUNTS_FOR_SANDBOX from the module and carrying it in BenchmarkArgs looks correct.

Also applies to: 220-234


516-531: Job flag for keep_mounts_for_sandbox is correct.

Aggregating job_needs_sandbox_to_keep_mounts from per-benchmark args cleanly informs task creation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
nemo_skills/evaluation/evaluator/ojbench.py (1)

48-67: Potential issue: Editable install may not persist to evaluation sandbox.

The install_packages function creates its own sandbox context (line 51), performs an editable install of OJBench (line 60), then closes that sandbox (line 45). Later, eval_ojbench_async creates a separate sandbox context (line 78) for evaluation.

If these are different sandbox instances, the editable install (pip install -e OJBench) won't be available in the evaluation sandbox because:

  1. The cloned OJBench directory is in the first sandbox's filesystem
  2. The pip installation metadata points to that now-closed sandbox's paths

Consider one of these approaches:

  1. Perform installation once in the evaluation sandbox (recommended):
 async def eval_ojbench_async(cfg):
     eval_config = OJBenchConfig(**cfg.eval_config)
     problem_dirs = [
         Path(cfg.data_dir, "ojbench/OJBench_testdata/NOI"),
         Path(cfg.data_dir, "ojbench/OJBench_testdata/ICPC"),
     ]
 
-    await install_packages(eval_config)
-
     async with sandbox_context(eval_config.sandbox) as sandbox:
+        # Install packages in the same sandbox used for evaluation
+        LOG.info("Installing required packages for ojbench evaluation...")
+        clone_cmd = "git clone https://github.com/He-Ren/OJBench.git"
+        result, _ = await sandbox.execute_code(clone_cmd, language="shell", timeout=300)
+        if result["process_status"] != "completed":
+            stderr = result.get("stderr", "Unknown error")
+            raise RuntimeError(f"Failed to clone OJBench repo: {stderr}")
+        
+        install_cmd = "pip install -e OJBench"
+        result, _ = await sandbox.execute_code(install_cmd, language="shell", timeout=300)
+        if result["process_status"] != "completed":
+            stderr = result.get("stderr", "Unknown error")
+            raise RuntimeError(f"Failed to install ojbench. Stderr: {stderr}")
+        LOG.info("Successfully installed ojbench.")
+        
         for jsonl_file_str in unroll_files(cfg.input_files):
  1. Use a non-editable install: Change line 60 to pip install OJBench (if the package structure supports this)

  2. Ensure sandbox persistence: If the sandbox implementation supports persistent volumes via keep_mounts_for_sandbox, ensure the OJBench clone directory is mounted persistently.

🧹 Nitpick comments (5)
docs/evaluation/code.md (3)

215-217: Add language specifiers to code blocks.

The code blocks are missing language specifiers, which affects syntax highlighting and accessibility.

Apply this diff:

-```
+```bash
 ns prepare_data --data_dir=<DATA_DIR> --cluster=<CLUSTER_NAME> ojbench

And for the sample run command:

```diff
-```
+```bash
 ns eval \
     --cluster=<CLUSTER_NAME> \

Based on static analysis hints.

Also applies to: 227-242


245-264: Consider adding language specifier to results output block.

The results output block would benefit from a language specifier (e.g., text or yaml) for better rendering.

Apply this diff:

-```
+```text
 ----------------------------- ojbench -----------------------------
 evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy

Based on static analysis hints.


204-209: Consider more descriptive link text.

While "here" is a common pattern in this documentation, more descriptive link text improves accessibility and scannability.

Example alternatives:

Based on static analysis hints.

nemo_skills/evaluation/evaluator/ojbench.py (2)

105-105: Verify num_workers=16 is appropriate.

The code hardcodes num_workers=16 for parallel evaluation. This may be too high for resource-constrained environments or too low for high-core-count systems. Consider making this configurable through OJBenchConfig.

Example:

 @nested_dataclass(kw_only=True)
 class OJBenchConfig:
     sandbox: dict = field(default_factory=lambda: {"sandbox_type": "local"})
     timeout: int = 6
+    num_workers: int = 16

Then use eval_config.num_workers in the eval code construction.


123-125: Consider raising exception instead of continue.

When there's a length mismatch between results and samples, the code logs an error and continues to the next file. This means the job appears to succeed but silently skips files with mismatches. Consider raising an exception instead to make failures explicit and prevent incomplete results from appearing successful.

             if len(results) != len(samples):
-                LOG.error(f"Result count mismatch for {jsonl_file}: {len(results)} results vs {len(samples)} samples")
-                continue
+                raise RuntimeError(
+                    f"Result count mismatch for {jsonl_file}: "
+                    f"{len(results)} results vs {len(samples)} samples"
+                )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d310d2 and 7709ed4.

📒 Files selected for processing (4)
  • docs/evaluation/code.md (1 hunks)
  • nemo_skills/evaluation/evaluator/ojbench.py (1 hunks)
  • nemo_skills/pipeline/prepare_data.py (3 hunks)
  • nemo_skills/pipeline/utils/exp.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/pipeline/utils/exp.py
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/ojbench.py (3)
nemo_skills/code_execution/sandbox.py (2)
  • get_sandbox (419-422)
  • close (77-79)
nemo_skills/evaluation/evaluator/code.py (1)
  • preprocess_code (35-91)
nemo_skills/utils.py (2)
  • get_logger_name (130-131)
  • nested_dataclass (49-82)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md

204-204: Link text should be descriptive

(MD059, descriptive-link-text)


209-209: Link text should be descriptive

(MD059, descriptive-link-text)


215-215: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


227-227: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


245-245: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.1)
nemo_skills/evaluation/evaluator/ojbench.py

58-58: Avoid specifying long messages outside the exception class

(TRY003)


64-64: Avoid specifying long messages outside the exception class

(TRY003)


101-101: Use explicit conversion flag

Replace with conversion flag

(RUF010)


103-103: Use explicit conversion flag

Replace with conversion flag

(RUF010)


104-104: Use explicit conversion flag

Replace with conversion flag

(RUF010)


118-118: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (10)
nemo_skills/pipeline/prepare_data.py (3)

30-30: LGTM: OJBench dataset added to data directory requirement list.

The addition of "ojbench" to DATASETS_REQUIRE_DATA_DIR aligns with the PR objectives and ensures that users provide a data_dir when preparing OJBench data, consistent with other large datasets like "ruler" and "ioi24".


58-61: LGTM: Security warning appropriately emphasizes the data loss risk.

The new keep_mounts_for_sandbox parameter includes a clear warning about the risks of persisting mounts when the sandbox executes LLM-generated commands. The conservative default of False is appropriate.


146-146: LGTM: Parameter correctly forwarded to _run_cmd.

The keep_mounts_for_sandbox parameter is properly forwarded to _run_cmd using explicit named argument passing, consistent with the other parameters in the call.

docs/evaluation/code.md (2)

213-219: LGTM! Clear data preparation instructions.

The documentation properly notes the required --data_dir flag, large download size (15GB), and HF_TOKEN requirement. This aligns with past review feedback.


244-244: Documentation path is accurate. The pipeline’s add_task call writes logs to <OUTPUT_DIR>/eval-results/<benchmark>/summarized-results, and the log files are prefixed main (or main_<idx>), matching main_*. No update needed.

nemo_skills/evaluation/evaluator/ojbench.py (5)

35-35: Verify timeout default is appropriate.

The default timeout of 6 seconds seems very short for code evaluation tasks. At line 113, this is multiplied by the number of samples and adds 60 seconds, but for a single sample this would only be 66 seconds total. Please confirm this is sufficient for OJBench problems, which may involve complex test cases.


79-97: LGTM! Proper preprocessing with safe file handling.

The preprocessing correctly:

  • Writes to a separate eval-input- file instead of overwriting the original
  • Transforms fields appropriately for OJBench format
  • Applies code preprocessing with language-specific handling

This addresses previous concerns about data loss from overwriting input files.


99-115: Good: Clean environment and proper escaping.

The evaluation execution correctly:

  • Uses env -i to ensure a clean environment
  • Applies shlex.quote for safe shell command construction
  • Calculates timeout dynamically based on number of samples
  • Sets reasonable output limits (100k characters)

Note: Static analysis suggests using !r conversion flags instead of repr() (lines 101, 103, 104), but the current approach is equally valid and more explicit.


127-133: LGTM! Safe result merging and file updates.

The result merging correctly:

  • Uses strict=True with zip to catch length mismatches at runtime
  • Merges verdict and pass/fail status into samples
  • Writes back to the original file only after successful evaluation

This addresses previous review concerns about data integrity.


136-138: LGTM! Clean synchronous wrapper.

The synchronous wrapper properly uses asyncio.run() to execute the async evaluation function, following standard Python async patterns.

@wasiahmad wasiahmad requested a review from Kipok October 1, 2025 02:48
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@Kipok Kipok merged commit bfd7d93 into main Oct 1, 2025
6 checks passed
@Kipok Kipok deleted the feat/ojbench branch October 1, 2025 04:27
wasiahmad added a commit that referenced this pull request Oct 1, 2025
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 9, 2025
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
SeanNaren pushed a commit that referenced this pull request Oct 9, 2025
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Dec 18, 2025
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants