Evaluation on OJBench by wasiahmad · Pull Request #848 · NVIDIA-NeMo/Skills

wasiahmad · 2025-09-25T21:47:49Z

In this PR, we are adding evaluation support for OJBench through nemo-skills. Primary changes are:

Dataset download through nemo-skills/dataset/ojbench
Evaluation logic implemented at nemo_skills/evaluation/evaluator/code.py

Summary by CodeRabbit

New Features
- OJBench support: evaluator with async sandboxed runner, new OJBench metrics, and registration.
- Dataset preparation tool to fetch/transform prompts into language-specific test sets.
- Global and per-job CLI options to preserve sandbox mounts, propagated across pipeline and task creation.
Bug Fixes / Defaults
- Default evaluation split changed to use the Python test split.
Chores
- Sandbox base image updated with PyPy tooling and a preinstalled judge server.
Documentation
- Added OJBench docs with setup and sample run guidance.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai · 2025-09-25T21:47:56Z

Walkthrough

Adds OJBench dataset support (prep, evaluator, metrics, docs), updates default eval split, installs PyPy3 and judge-server in the sandbox Dockerfile, and threads a new keep_mounts_for_sandbox flag through pipeline CLI, task creation, and data-prep tooling.

Changes

Cohort / File(s)	Summary
Sandbox image updates `dockerfiles/Dockerfile.sandbox`	Adds build-essential and libseccomp-dev, moves pypy3 binary path to `/usr/bin/pypy3`, runs `ensurepip` for pypy3, and installs the DMOJ judge-server via pip from a specific commit; retains Lean 4 toolchain steps.
OJBench dataset config & prep `nemo_skills/dataset/ojbench/__init__.py`, `nemo_skills/dataset/ojbench/prepare.py`	Changes default `EVAL_SPLIT` to `"test_python"`; adds `prepare.py` to clone/update a Hugging Face repo (requires `HF_TOKEN`), transform `prompts/full.jsonl` (rename `prompt`→`question`, add `subset_for_metrics`), and emit per-language `test_python.jsonl`/`test_cpp.jsonl`.
Evaluator registration `nemo_skills/evaluation/evaluator/__init__.py`	Registers `eval_ojbench` in the `EVALUATOR_MAP` under key `"ojbench"`.
OJBench evaluator `nemo_skills/evaluation/evaluator/ojbench.py`	New evaluator module providing `OJBenchConfig`, async `sandbox_context`, `install_packages`, `eval_ojbench_async`, and sync wrapper `eval_ojbench`; handles sandbox setup, package install, preprocessing, running OJBench inside sandbox, and merging results into samples.
Metrics addition & mapping `nemo_skills/evaluation/metrics/code_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	Adds `OJBenchMetrics` (uses `prediction["is_passed"]` for accuracy) and registers it in `METRICS_MAP` as `"ojbench"`.
Pipeline: keep mounts plumbing `nemo_skills/pipeline/eval.py`, `nemo_skills/pipeline/generate.py`, `nemo_skills/pipeline/utils/eval.py`, `nemo_skills/pipeline/utils/exp.py`, `nemo_skills/pipeline/prepare_data.py`, `nemo_skills/pipeline/run_cmd.py`, `nemo_skills/pipeline/start_server.py`, `nemo_skills/pipeline/train.py`	Introduces `keep_mounts_for_sandbox` CLI flag and `BenchmarkArgs` field; threads the flag through benchmark args extraction, `prepare_eval_commands`, task creation (`add_task`), and related call sites; computes per-job `job_needs_sandbox_to_keep_mounts` and conditionally preserves sandbox mounts.
Docs `docs/evaluation/code.md`	Adds OJBench documentation under supported benchmarks with data preparation steps, sample run commands, and example metrics.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as Pipeline CLI
  participant Prep as prepare_eval_commands
  participant Manager as Task Manager (add_task)
  participant Sandbox as Sandbox Manager
  participant Eval as eval_ojbench
  participant OJ as OJBench (in-sandbox)
  participant FS as Filesystem

  User->>CLI: run eval --benchmark ojbench [--keep-mounts-for-sandbox]
  CLI->>Prep: prepare_eval_commands(keep_mounts_for_sandbox)
  Prep-->>CLI: job_batches (needs_sandbox, keep_mounts, mount_paths)

  CLI->>Manager: add_task(..., keep_mounts_for_sandbox, sandbox_mount_paths)
  Manager->>Sandbox: get_sandbox(config, keep_mounts, mount_paths)
  Sandbox-->>Manager: sandbox handle

  Manager->>Eval: eval_ojbench(cfg)
  Eval->>Sandbox: install_packages() (clone & pip install inside sandbox)
  Eval->>FS: preprocess JSONL inputs
  Eval->>OJ: run ojbench evaluator inside sandbox
  OJ-->>Eval: per-file _eval_results.json
  Eval->>FS: merge results into samples
  Eval-->>Manager: evaluation complete

  Manager->>Sandbox: release (respect keep_mounts_for_sandbox)
  Sandbox-->>Manager: released
  Manager-->>CLI: results with is_passed/metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I hop through code and sandbox gates,
PyPy tuned and judges wait,
OJBench hums and verdicts spring,
Metrics count each passing thing.
A carrot for CI — tests sing! 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “Evaluation on OJBench” directly captures the primary focus of the pull request—adding evaluation support for the OJBench benchmark—and concisely reflects the main change introduced by the author. It is specific to the update, avoids generic phrasing, and gives a teammate scanning history an immediate understanding of the PR’s intent.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/ojbench

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7709ed4 and cc803a9.

📒 Files selected for processing (1)

nemo_skills/pipeline/eval.py (5 hunks)

🧰 Additional context used

🪛 Ruff (0.13.1)

nemo_skills/pipeline/eval.py

339-339: Unpacked variable job_server_address is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (2)

nemo_skills/pipeline/eval.py (2)

174-177: LGTM! Warning message addresses safety concerns.

The new parameter includes an appropriate warning about the risks of keeping mounts for sandbox execution.

315-315: LGTM! Parameter correctly threaded through the pipeline.

The keep_mounts_for_sandbox parameter is properly propagated to prepare_eval_commands (line 315), combined with the job-level flag using OR logic when creating tasks (line 357), and passed to the judge generation pipeline (line 424).

Also applies to: 333-341, 357-357, 424-424

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/evaluation/evaluator/__init__.py (1)
26-63: Fix the OJBench installer URL before exposing the evaluator

Registering eval_ojbench here will always surface an ImportError, because the implementation calls install_from_git("git+https://github.com/He-Ren/OJBench/tree/main"), and pip does not understand the /tree/main suffix. The install attempt fails every time, so the evaluator never runs. Please update the git spec to the canonical .git@branch form before wiring this evaluator in.
-        install_from_git("git+https://github.com/He-Ren/OJBench/tree/main")
+        install_from_git("git+https://github.com/He-Ren/OJBench.git@main")

🧹 Nitpick comments (2)

nemo_skills/evaluation/metrics/code_metrics.py (1)
118-120: Preserve the original prediction payload in the fallback path.

Returning a bare {"is_passed": False} drops any auxiliary fields (token counts, timing, etc.) that downstream logic might still expect when length filtering swaps in the incorrect sample, and it also triggers the Ruff ARG002 warning. Copy the incoming prediction before overriding is_passed so the structure stays intact.
-    def get_incorrect_sample(self, prediction: dict) -> dict:
-        return {"is_passed": False}
+    def get_incorrect_sample(self, prediction: dict) -> dict:
+        prediction = prediction.copy()
+        prediction["is_passed"] = False
+        return prediction
dockerfiles/Dockerfile.sandbox (1)
79-92: Remove duplicate LISTEN_PORT env

LISTEN_PORT is set twice to 6000; drop the duplicate.

Apply this diff:
-ENV LISTEN_PORT=6000
 RUN echo "uwsgi_read_timeout 14400s;" > /etc/nginx/conf.d/custom_timeout.conf

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4784668 and a3c890d.

📒 Files selected for processing (8)

dockerfiles/Dockerfile.sandbox (1 hunks)
nemo_skills/dataset/livecodebench-pro/__init__.py (1 hunks)
nemo_skills/dataset/ojbench/__init__.py (1 hunks)
nemo_skills/dataset/ojbench/prepare.py (1 hunks)
nemo_skills/evaluation/evaluator/__init__.py (2 hunks)
nemo_skills/evaluation/evaluator/code.py (2 hunks)
nemo_skills/evaluation/metrics/code_metrics.py (1 hunks)
nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

nemo_skills/evaluation/evaluator/__init__.py (1)

nemo_skills/evaluation/evaluator/code.py (1)

eval_ojbench (350-397)

nemo_skills/evaluation/metrics/code_metrics.py (3)

nemo_skills/evaluation/metrics/base.py (5)

BaseMetrics (23-434)

_get_score_dict (124-143)

get_incorrect_sample (200-206)

update (145-189)

_compute_pass_at_k (352-423)

nemo_skills/evaluation/metrics/lean4_metrics.py (3)

_get_score_dict (23-24)

get_incorrect_sample (26-29)

update (46-48)

nemo_skills/evaluation/metrics/ruler_metrics.py (3)

_get_score_dict (19-20)

get_incorrect_sample (26-29)

update (22-24)

nemo_skills/evaluation/evaluator/code.py (1)

nemo_skills/file_utils.py (1)

unroll_files (21-32)

nemo_skills/evaluation/metrics/map_metrics.py (1)

nemo_skills/evaluation/metrics/code_metrics.py (1)

OJBenchMetrics (114-123)

🪛 Ruff (0.13.1)

nemo_skills/evaluation/metrics/code_metrics.py

118-118: Unused method argument: prediction

(ARG002)

nemo_skills/evaluation/evaluator/code.py

389-389: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/dataset/ojbench/prepare.py

41-41: subprocess call: check for execution of untrusted input

(S603)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (3)

nemo_skills/evaluation/metrics/map_metrics.py (1)

24-58: Mapping addition looks solid.

Importing OJBenchMetrics and wiring it into METRICS_MAP keeps the registry consistent with existing entries.

nemo_skills/dataset/ojbench/__init__.py (1)

16-20: Defaults look good

These defaults align the dataset with the new OJBench evaluator and metrics wiring.
dockerfiles/Dockerfile.sandbox (1)
27-29: Pin judge-server revision and optimize installation

Replace hardcoded install with:
- RUN /usr/bin/pypy3 -m pip install git+https://github.com/DMOJ/judge-server.git@f098cd3a49a60186d1fadde5132329ec5f4f2213
+ ARG JUDGE_SERVER_REF=f098cd3a49a60186d1fadde5132329ec5f4f2213
+ RUN /usr/bin/pypy3 -m pip install --no-cache-dir "git+https://github.com/DMOJ/judge-server.git@${JUDGE_SERVER_REF}"
Verify the evaluator actually invokes judge-server via the PyPy install (e.g. ensure PATH includes the PyPy bin or it’s called with /usr/bin/pypy3 -m judge_server).

dockerfiles/Dockerfile.sandbox

nemo_skills/dataset/livecodebench-pro/__init__.py

nemo_skills/dataset/ojbench/prepare.py

nemo_skills/evaluation/evaluator/code.py

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

…_packages' was never awaited Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

nemo_skills/dataset/ojbench/prepare.py

nemo_skills/evaluation/evaluator/ojbench.py

nemo_skills/pipeline/utils/eval.py

nemo_skills/pipeline/utils/exp.py

nemo_skills/pipeline/eval.py

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

docs/evaluation/code.md (2)
208-210: Use descriptive link text.

Markdownlint flags “here” as non-descriptive. Rename the link so it tells readers what they’ll open.
-- Original benchmark source is [here](https://github.com/He-Ren/OJBench/tree/main).
+- Original benchmark source is [the OJBench repository](https://github.com/He-Ren/OJBench/tree/main).
216-217: Add languages to fenced code blocks.

The new ns command snippets and the JSON metrics block need language identifiers to satisfy MD040 and improve readability.
-```
+```bash
 ns prepare_data --data_dir=<DATA_DIR> --cluster=<CLUSTER_NAME> ojbench
@@
- +bash
ns eval
@@
++inference.tokens_to_generate=32768
@@
-```
+```json
{
  "ojbench": {
Also applies to: 229-244, 249-283

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d527518 and 730efbf.

📒 Files selected for processing (1)

docs/evaluation/code.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

204-204: Link text should be descriptive

(MD059, descriptive-link-text)

209-209: Link text should be descriptive

(MD059, descriptive-link-text)

215-215: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

229-229: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

248-248: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

docs/evaluation/code.md

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/pipeline/prepare_data.py (1)
93-96: Quote data_dir in shell command to avoid injection/spacing issues.

data_dir is interpolated into a shell string without quoting; paths with spaces or malicious characters can break or be exploited.

Apply this minimal fix:
+import shlex
@@
-    if data_dir:
-        command += f" && mkdir -p {data_dir} && cp -r /nemo_run/code/nemo_skills/dataset/* {data_dir}"
+    if data_dir:
+        dd = shlex.quote(data_dir)
+        command += f" && mkdir -p {dd} && cp -r /nemo_run/code/nemo_skills/dataset/* {dd}"

♻️ Duplicate comments (1)

nemo_skills/pipeline/eval.py (1)

355-360: sandbox_mount_paths likely expects a list; mount_paths here is a CLI string.

Passing the raw string risks downstream type errors. Convert to a list (e.g., split and strip) or pass the resolved list returned/recorded by resolve_mount_paths.

-                    keep_mounts_for_sandbox=job_needs_sandbox_to_keep_mounts or keep_mounts_for_sandbox,
-                    sandbox_mount_paths=mount_paths,
+                    keep_mounts_for_sandbox=job_needs_sandbox_to_keep_mounts or keep_mounts_for_sandbox,
+                    sandbox_mount_paths=[p.strip() for p in mount_paths.split(",")] if isinstance(mount_paths, str) and mount_paths else mount_paths,

Run to confirm the expected type for sandbox_mount_paths and how resolve_mount_paths handles mount_paths:

#!/bin/bash
# Inspect add_task signature and sandbox_mount_paths type
rg -nP 'def\s+add_task\(' -n nemo_skills -S -g '!**/venv/**' -g '!**/site-packages/**'
rg -n 'sandbox_mount_paths' nemo_skills -n -C3

# Inspect resolve_mount_paths behavior
rg -n 'def resolve_mount_paths' nemo_skills -n -C5

🧹 Nitpick comments (7)

docs/evaluation/code.md (1)
215-217: Consider adding language specifiers to code blocks.

The code blocks would benefit from explicit language markers for proper syntax highlighting:

Line 215: Should be ```bash for the shell command

Line 229: Should be ```bash for the ns eval command

Line 248: Should be ```json for the metrics output

Apply these changes:

At line 215:
-```
+```bash
 ns prepare_data --data_dir=<DATA_DIR> --cluster=<CLUSTER_NAME> ojbench
At line 229:
```diff
-```
+```bash
 ns eval \
     --cluster=<CLUSTER_NAME> \
At line 248:
-```
+```json
 {
   "ojbench": {
Also applies to: 229-244, 248-283
nemo_skills/evaluation/evaluator/ojbench.py (1)

79-91: Consider: Original file format lost before evaluation completes.

The code overwrites the original JSONL file at line 90-91 with preprocessed samples before evaluation runs (lines 107-112). While samples remain in memory, if the process crashes or is interrupted between preprocessing and the final write (line 128-130), the original field names (question, completion) are permanently replaced with OJBench format (prompt, content).

Since past review flagged this as important ("very important to only replace files at the very end") and the preprocessing is only needed for ojbench.judge_jsonl, consider writing preprocessed samples to a temporary file and leaving the original unchanged until final results are merged.
nemo_skills/pipeline/start_server.py (1)
65-68: Guard against invalid/unsafe flag combo

If keep_mounts_for_sandbox is True while with_sandbox is False, the flag is meaningless and could confuse users. Add an explicit check to fail fast.
@@
     try:
         server_type = server_type.value
     except AttributeError:
         pass
+
+    # keep_mounts_for_sandbox only applies when a sandbox is launched
+    if keep_mounts_for_sandbox and not with_sandbox:
+        raise ValueError("keep_mounts_for_sandbox requires --with_sandbox.")
 
     log_dir = check_mounts(cluster_config, log_dir, check_mounted_paths=check_mounted_paths)
nemo_skills/pipeline/generate.py (1)
133-136: Fail fast when keep_mounts_for_sandbox is set without sandbox

Avoid silent no-ops or confusion by rejecting keep_mounts_for_sandbox unless with_sandbox is enabled.
@@
     try:
         server_type = server_type.value
     except AttributeError:
         pass
+
+    if keep_mounts_for_sandbox and not with_sandbox:
+        raise ValueError("keep_mounts_for_sandbox requires --with_sandbox.")
nemo_skills/pipeline/train.py (1)
250-253: Add validation for keep_mounts_for_sandbox usage

Same rationale as other CLIs: reject keep_mounts_for_sandbox when with_sandbox is False.
@@
     try:
         training_algo = training_algo.value
     except AttributeError:
         pass
+
+    if keep_mounts_for_sandbox and not with_sandbox:
+        raise ValueError("keep_mounts_for_sandbox requires --with_sandbox.")
nemo_skills/pipeline/run_cmd.py (1)
61-63: Optional: fix typing for server_gpus.

Default is None but annotated as int. Consider Optional[int] for clarity and static tooling.
-    server_gpus: int = typer.Option(None, help="Number of GPUs to use if hosting the model"),
+    server_gpus: int | None = typer.Option(None, help="Number of GPUs to use if hosting the model"),
nemo_skills/pipeline/utils/eval.py (1)
371-376: Fix condition/message: only warn when the benchmark requires keeping mounts.

Current check uses requires_sandbox, but the message is about “requires sandbox to keep mounts.” Use the keep_mounts_for_sandbox flag instead; also the text says “enabling it” though no state is changed here.
-            if benchmark_args.requires_sandbox and not keep_mounts_for_sandbox:
-                LOG.warning("Found benchmark (%s) which requires sandbox to keep mounts, enabling it.", benchmark)
+            if benchmark_args.keep_mounts_for_sandbox and not keep_mounts_for_sandbox:
+                LOG.warning(
+                    "Benchmark (%s) requires sandbox to keep mounts; mounts will be kept for its jobs.",
+                    benchmark,
+                )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 730efbf and bff6d70.

📒 Files selected for processing (10)

docs/evaluation/code.md (1 hunks)
nemo_skills/dataset/ojbench/prepare.py (1 hunks)
nemo_skills/evaluation/evaluator/ojbench.py (1 hunks)
nemo_skills/pipeline/eval.py (5 hunks)
nemo_skills/pipeline/generate.py (2 hunks)
nemo_skills/pipeline/prepare_data.py (2 hunks)
nemo_skills/pipeline/run_cmd.py (2 hunks)
nemo_skills/pipeline/start_server.py (2 hunks)
nemo_skills/pipeline/train.py (2 hunks)
nemo_skills/pipeline/utils/eval.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/evaluation/evaluator/ojbench.py (3)

nemo_skills/code_execution/sandbox.py (2)

get_sandbox (419-422)

close (77-79)

nemo_skills/evaluation/evaluator/code.py (1)

preprocess_code (35-91)

nemo_skills/utils.py (2)

get_logger_name (130-131)

nested_dataclass (49-82)

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

204-204: Link text should be descriptive

(MD059, descriptive-link-text)

209-209: Link text should be descriptive

(MD059, descriptive-link-text)

215-215: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

229-229: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

248-248: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.1)

nemo_skills/evaluation/evaluator/ojbench.py

58-58: Avoid specifying long messages outside the exception class

(TRY003)

64-64: Avoid specifying long messages outside the exception class

(TRY003)

98-98: Use explicit conversion flag

Replace with conversion flag

(RUF010)

100-100: Use explicit conversion flag

Replace with conversion flag

(RUF010)

101-101: Use explicit conversion flag

Replace with conversion flag

(RUF010)

115-115: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/pipeline/eval.py

339-339: Unpacked variable job_server_address is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

nemo_skills/dataset/ojbench/prepare.py

46-46: subprocess call: check for execution of untrusted input

(S603)

46-46: Starting a process with a partial executable path

(S607)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (13)

nemo_skills/dataset/ojbench/prepare.py (3)

22-28: LGTM! Token validation properly addresses past feedback.

The HF_TOKEN validation with clear error messages addresses the previous review concerns effectively.

31-57: LGTM! Token sanitization properly addresses security concerns.

The error handling correctly sanitizes the HF_TOKEN from error messages (lines 52-56), addressing previous security feedback. The static analysis warnings about subprocess calls are false positives in this context—the git executable is validated and the command construction is deterministic.

60-91: LGTM! Subdirectory approach addresses past critical issue.

The clone destination correctly uses a subdirectory OJBench_testdata (line 63), which properly addresses the previous critical feedback about not deleting the package directory. The data processing logic is clean and well-structured.

docs/evaluation/code.md (1)

285-285: LGTM! Correctly references ojbench for repeated runs.

The documentation properly uses --benchmarks=ojbench:N for repeats, which addresses previous feedback.

nemo_skills/evaluation/evaluator/ojbench.py (3)

48-67: LGTM! Error handling properly uses exceptions.

The function correctly raises RuntimeError on failures, which is better than the warning approach mentioned in past reviews. The error messages are clear and helpful for debugging.

114-122: LGTM! Error handling and validation address past feedback.

The code properly:

Raises RuntimeError on evaluation failure (line 115) instead of warning

Validates result count matches sample count (lines 120-122)

Uses strict=True in zip (line 124)

These changes address previous review feedback effectively.

48-67: Sandbox contexts share the same long-lived server, so the editable install persists. Closing a LocalSandbox client only tears down the HTTP session, not the server or its working directory—both contexts connect to the same process, and pip install -e remains available.

Likely an incorrect or invalid review comment.

nemo_skills/pipeline/generate.py (1)

316-316: Good: flag is threaded to task creation

The new keep_mounts_for_sandbox is correctly propagated to add_task alongside with_sandbox. No further issues here.

nemo_skills/pipeline/train.py (1)

385-385: Propagation acknowledged

keep_mounts_for_sandbox is passed through to add_task for training jobs; consistent with the pipeline.

nemo_skills/pipeline/prepare_data.py (1)

58-61: Flag threaded correctly; help text is clear.

The new keep_mounts_for_sandbox option is wired into the CLI with a clear risk warning. Good addition.

nemo_skills/pipeline/run_cmd.py (1)

91-94: Propagation of keep_mounts_for_sandbox looks good.

Option is exposed in CLI and forwarded to add_task; matches other pipelines for consistency.

Also applies to: 180-193

nemo_skills/pipeline/utils/eval.py (2)

184-187: Good: benchmark‑level keep_mounts_for_sandbox is plumbed through.

Reading KEEP_MOUNTS_FOR_SANDBOX from the module and carrying it in BenchmarkArgs looks correct.

Also applies to: 220-234

516-531: Job flag for keep_mounts_for_sandbox is correct.

Aggregating job_needs_sandbox_to_keep_mounts from per-benchmark args cleanly informs task creation.

nemo_skills/pipeline/eval.py

nemo_skills/pipeline/start_server.py

docs/evaluation/code.md

nemo_skills/evaluation/evaluator/ojbench.py

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

nemo_skills/evaluation/evaluator/ojbench.py (1)
48-67: Potential issue: Editable install may not persist to evaluation sandbox.

The install_packages function creates its own sandbox context (line 51), performs an editable install of OJBench (line 60), then closes that sandbox (line 45). Later, eval_ojbench_async creates a separate sandbox context (line 78) for evaluation.

If these are different sandbox instances, the editable install (pip install -e OJBench) won't be available in the evaluation sandbox because:

The cloned OJBench directory is in the first sandbox's filesystem

The pip installation metadata points to that now-closed sandbox's paths

Consider one of these approaches:

Perform installation once in the evaluation sandbox (recommended):
 async def eval_ojbench_async(cfg):
     eval_config = OJBenchConfig(**cfg.eval_config)
     problem_dirs = [
         Path(cfg.data_dir, "ojbench/OJBench_testdata/NOI"),
         Path(cfg.data_dir, "ojbench/OJBench_testdata/ICPC"),
     ]
 
-    await install_packages(eval_config)
-
     async with sandbox_context(eval_config.sandbox) as sandbox:
+        # Install packages in the same sandbox used for evaluation
+        LOG.info("Installing required packages for ojbench evaluation...")
+        clone_cmd = "git clone https://github.com/He-Ren/OJBench.git"
+        result, _ = await sandbox.execute_code(clone_cmd, language="shell", timeout=300)
+        if result["process_status"] != "completed":
+            stderr = result.get("stderr", "Unknown error")
+            raise RuntimeError(f"Failed to clone OJBench repo: {stderr}")
+        
+        install_cmd = "pip install -e OJBench"
+        result, _ = await sandbox.execute_code(install_cmd, language="shell", timeout=300)
+        if result["process_status"] != "completed":
+            stderr = result.get("stderr", "Unknown error")
+            raise RuntimeError(f"Failed to install ojbench. Stderr: {stderr}")
+        LOG.info("Successfully installed ojbench.")
+        
         for jsonl_file_str in unroll_files(cfg.input_files):
Use a non-editable install: Change line 60 to pip install OJBench (if the package structure supports this)

Ensure sandbox persistence: If the sandbox implementation supports persistent volumes via keep_mounts_for_sandbox, ensure the OJBench clone directory is mounted persistently.

🧹 Nitpick comments (5)

docs/evaluation/code.md (3)
215-217: Add language specifiers to code blocks.

The code blocks are missing language specifiers, which affects syntax highlighting and accessibility.

Apply this diff:
-```
+```bash
 ns prepare_data --data_dir=<DATA_DIR> --cluster=<CLUSTER_NAME> ojbench
And for the sample run command:

```diff
-```
+```bash
 ns eval \
     --cluster=<CLUSTER_NAME> \
Based on static analysis hints.

Also applies to: 227-242

245-264: Consider adding language specifier to results output block.

The results output block would benefit from a language specifier (e.g., text or yaml) for better rendering.

Apply this diff:
-```
+```text
 ----------------------------- ojbench -----------------------------
 evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy
Based on static analysis hints.

204-209: Consider more descriptive link text.

While "here" is a common pattern in this documentation, more descriptive link text improves accessibility and scannability.

Example alternatives:

Line 204: "Original benchmark source is on HuggingFace."

Line 209: "Original benchmark source is on GitHub."

Based on static analysis hints.
nemo_skills/evaluation/evaluator/ojbench.py (2)
105-105: Verify num_workers=16 is appropriate.

The code hardcodes num_workers=16 for parallel evaluation. This may be too high for resource-constrained environments or too low for high-core-count systems. Consider making this configurable through OJBenchConfig.

Example:
 @nested_dataclass(kw_only=True)
 class OJBenchConfig:
     sandbox: dict = field(default_factory=lambda: {"sandbox_type": "local"})
     timeout: int = 6
+    num_workers: int = 16
Then use eval_config.num_workers in the eval code construction.

123-125: Consider raising exception instead of continue.

When there's a length mismatch between results and samples, the code logs an error and continues to the next file. This means the job appears to succeed but silently skips files with mismatches. Consider raising an exception instead to make failures explicit and prevent incomplete results from appearing successful.
             if len(results) != len(samples):
-                LOG.error(f"Result count mismatch for {jsonl_file}: {len(results)} results vs {len(samples)} samples")
-                continue
+                raise RuntimeError(
+                    f"Result count mismatch for {jsonl_file}: "
+                    f"{len(results)} results vs {len(samples)} samples"
+                )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d310d2 and 7709ed4.

📒 Files selected for processing (4)

docs/evaluation/code.md (1 hunks)
nemo_skills/evaluation/evaluator/ojbench.py (1 hunks)
nemo_skills/pipeline/prepare_data.py (3 hunks)
nemo_skills/pipeline/utils/exp.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/pipeline/utils/exp.py

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/evaluation/evaluator/ojbench.py (3)

nemo_skills/code_execution/sandbox.py (2)

get_sandbox (419-422)

close (77-79)

nemo_skills/evaluation/evaluator/code.py (1)

preprocess_code (35-91)

nemo_skills/utils.py (2)

get_logger_name (130-131)

nested_dataclass (49-82)

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

204-204: Link text should be descriptive

(MD059, descriptive-link-text)

209-209: Link text should be descriptive

(MD059, descriptive-link-text)

215-215: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

227-227: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

245-245: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.1)

nemo_skills/evaluation/evaluator/ojbench.py

58-58: Avoid specifying long messages outside the exception class

(TRY003)

64-64: Avoid specifying long messages outside the exception class

(TRY003)

101-101: Use explicit conversion flag

Replace with conversion flag

(RUF010)

103-103: Use explicit conversion flag

Replace with conversion flag

(RUF010)

104-104: Use explicit conversion flag

Replace with conversion flag

(RUF010)

118-118: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (10)

nemo_skills/pipeline/prepare_data.py (3)

30-30: LGTM: OJBench dataset added to data directory requirement list.

The addition of "ojbench" to DATASETS_REQUIRE_DATA_DIR aligns with the PR objectives and ensures that users provide a data_dir when preparing OJBench data, consistent with other large datasets like "ruler" and "ioi24".

58-61: LGTM: Security warning appropriately emphasizes the data loss risk.

The new keep_mounts_for_sandbox parameter includes a clear warning about the risks of persisting mounts when the sandbox executes LLM-generated commands. The conservative default of False is appropriate.

146-146: LGTM: Parameter correctly forwarded to _run_cmd.

The keep_mounts_for_sandbox parameter is properly forwarded to _run_cmd using explicit named argument passing, consistent with the other parameters in the call.

docs/evaluation/code.md (2)

213-219: LGTM! Clear data preparation instructions.

The documentation properly notes the required --data_dir flag, large download size (15GB), and HF_TOKEN requirement. This aligns with past review feedback.

244-244: Documentation path is accurate. The pipeline’s add_task call writes logs to <OUTPUT_DIR>/eval-results/<benchmark>/summarized-results, and the log files are prefixed main (or main_<idx>), matching main_*. No update needed.

nemo_skills/evaluation/evaluator/ojbench.py (5)

35-35: Verify timeout default is appropriate.

The default timeout of 6 seconds seems very short for code evaluation tasks. At line 113, this is multiplied by the number of samples and adds 60 seconds, but for a single sample this would only be 66 seconds total. Please confirm this is sufficient for OJBench problems, which may involve complex test cases.

79-97: LGTM! Proper preprocessing with safe file handling.

The preprocessing correctly:

Writes to a separate eval-input- file instead of overwriting the original

Transforms fields appropriately for OJBench format

Applies code preprocessing with language-specific handling

This addresses previous concerns about data loss from overwriting input files.

99-115: Good: Clean environment and proper escaping.

The evaluation execution correctly:

Uses env -i to ensure a clean environment

Applies shlex.quote for safe shell command construction

Calculates timeout dynamically based on number of samples

Sets reasonable output limits (100k characters)

Note: Static analysis suggests using !r conversion flags instead of repr() (lines 101, 103, 104), but the current approach is equally valid and more explicit.

127-133: LGTM! Safe result merging and file updates.

The result merging correctly:

Uses strict=True with zip to catch length mismatches at runtime

Merges verdict and pass/fail status into samples

Writes back to the original file only after successful evaluation

This addresses previous review concerns about data integrity.

136-138: LGTM! Clean synchronous wrapper.

The synchronous wrapper properly uses asyncio.run() to execute the async evaluation function, following standard Python async patterns.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: dgitman <dgitman@nvidia.com>

wasiahmad added 11 commits September 25, 2025 11:34

init commit for adding data

1d09bbf

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

init commit for adding data

dc8a6eb

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

init commit for adding data

c4d8084

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

init commit for adding data

7f03e2a

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

init commit for adding data

4349f8d

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

init commit for adding data

5ed5198

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

updating docker file

06e114f

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

updating docker file

26efae6

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

updating docker file

2a94a89

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

dataset prep update

45a5faa

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

evaluation logic implemented

a3c890d

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

wasiahmad marked this pull request as draft September 25, 2025 21:48

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

wasiahmad added 12 commits September 25, 2025 15:20

fixing lcb-pro eval args

2dd18f9

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

ojbench eval updated

e2ea44b

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing syntax Error

36cd285

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

modified prepare.py

fa86fb8

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

removed comments

30862b8

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing RuntimeWarning: coroutine 'eval_ojbench_async.<locals>.install…

514b457

…_packages' was never awaited Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing minor bug

21de2c1

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing git url for pip install

22cb58f

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing pip install issue

b009ed5

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

update dmoj version in sandbox

9a35a9e

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

updating num_workers

b2912ca

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Merge remote-tracking branch 'origin/main' into feat/ojbench

8c59638

wasiahmad assigned titu1994 Sep 26, 2025

wasiahmad added 3 commits September 26, 2025 22:01

adding subset_for_metrics for ojbench

2abc050

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

minor fixes

42e13f2

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

moving ojbench to a separate script

096e325

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Kipok requested changes Sep 30, 2025

View reviewed changes

updating eval docs

730efbf

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

docs/evaluation/code.md Outdated Show resolved Hide resolved

wasiahmad added 3 commits September 30, 2025 12:22

addressing comments raised in PR

a2b86b6

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

addressing comments raised in PR

bff6d70

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

addressing comments raised in PR

3d310d2

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

nemo_skills/pipeline/eval.py Show resolved Hide resolved

nemo_skills/pipeline/start_server.py Show resolved Hide resolved

Kipok reviewed Oct 1, 2025

View reviewed changes

docs/evaluation/code.md Show resolved Hide resolved

Kipok reviewed Oct 1, 2025

View reviewed changes

docs/evaluation/code.md Outdated Show resolved Hide resolved

Kipok reviewed Oct 1, 2025

View reviewed changes

docs/evaluation/code.md Outdated Show resolved Hide resolved

Kipok reviewed Oct 1, 2025

View reviewed changes

docs/evaluation/code.md Show resolved Hide resolved

nemo_skills/evaluation/evaluator/ojbench.py Show resolved Hide resolved

wasiahmad added 2 commits September 30, 2025 18:43

addressed all comments

58037b5

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Merge branch 'main' into feat/ojbench

7709ed4

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

wasiahmad requested a review from Kipok October 1, 2025 02:48

fixing a minor error

cc803a9

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Kipok approved these changes Oct 1, 2025

View reviewed changes

Kipok merged commit bfd7d93 into main Oct 1, 2025
6 checks passed

Kipok deleted the feat/ojbench branch October 1, 2025 04:27

wasiahmad added a commit that referenced this pull request Oct 1, 2025

Evaluation on OJBench (#848)

ca09081

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 9, 2025

Evaluation on OJBench (NVIDIA-NeMo#848)

23893cd

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

SeanNaren pushed a commit that referenced this pull request Oct 9, 2025

Evaluation on OJBench (#848)

b3c979b

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

This was referenced Dec 9, 2025

Audiobench and LibriSpeech-PC Benchmarks Evaluation #1060

Closed

Adding clan PR with AudioBench and Librispeech PC. #1103

Merged

coderabbitai bot mentioned this pull request Dec 18, 2025

Fix run.Script refactor #1133

Merged

This was referenced Feb 3, 2026

New slurm customization parameters (account, containers) #1209

Merged

Add SysBench benchmark #1225

Closed

coderabbitai bot mentioned this pull request Feb 13, 2026

Add nemo-skills-core subpackage for lightweight installs #1229

Merged

coderabbitai bot mentioned this pull request Feb 26, 2026

Add SPEED-Bench (within repo) #1279

Merged

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Evaluation on OJBench (#848)

0627865

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: dgitman <dgitman@nvidia.com>

Conversation

wasiahmad commented Sep 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wasiahmad commented Sep 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 25, 2025 •

edited

Loading