Conversation
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
WalkthroughAdds OJBench dataset support (prep, evaluator, metrics, docs), updates default eval split, installs PyPy3 and judge-server in the sandbox Dockerfile, and threads a new keep_mounts_for_sandbox flag through pipeline CLI, task creation, and data-prep tooling. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as Pipeline CLI
participant Prep as prepare_eval_commands
participant Manager as Task Manager (add_task)
participant Sandbox as Sandbox Manager
participant Eval as eval_ojbench
participant OJ as OJBench (in-sandbox)
participant FS as Filesystem
User->>CLI: run eval --benchmark ojbench [--keep-mounts-for-sandbox]
CLI->>Prep: prepare_eval_commands(keep_mounts_for_sandbox)
Prep-->>CLI: job_batches (needs_sandbox, keep_mounts, mount_paths)
CLI->>Manager: add_task(..., keep_mounts_for_sandbox, sandbox_mount_paths)
Manager->>Sandbox: get_sandbox(config, keep_mounts, mount_paths)
Sandbox-->>Manager: sandbox handle
Manager->>Eval: eval_ojbench(cfg)
Eval->>Sandbox: install_packages() (clone & pip install inside sandbox)
Eval->>FS: preprocess JSONL inputs
Eval->>OJ: run ojbench evaluator inside sandbox
OJ-->>Eval: per-file _eval_results.json
Eval->>FS: merge results into samples
Eval-->>Manager: evaluation complete
Manager->>Sandbox: release (respect keep_mounts_for_sandbox)
Sandbox-->>Manager: released
Manager-->>CLI: results with is_passed/metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🪛 Ruff (0.13.1)nemo_skills/pipeline/eval.py339-339: Unpacked variable Prefix it with an underscore or any other dummy variable pattern (RUF059) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/__init__.py (1)
26-63: Fix the OJBench installer URL before exposing the evaluatorRegistering
eval_ojbenchhere will always surface anImportError, because the implementation callsinstall_from_git("git+https://github.com/He-Ren/OJBench/tree/main"), and pip does not understand the/tree/mainsuffix. The install attempt fails every time, so the evaluator never runs. Please update the git spec to the canonical.git@branchform before wiring this evaluator in.- install_from_git("git+https://github.com/He-Ren/OJBench/tree/main") + install_from_git("git+https://github.com/He-Ren/OJBench.git@main")
🧹 Nitpick comments (2)
nemo_skills/evaluation/metrics/code_metrics.py (1)
118-120: Preserve the original prediction payload in the fallback path.Returning a bare
{"is_passed": False}drops any auxiliary fields (token counts, timing, etc.) that downstream logic might still expect when length filtering swaps in the incorrect sample, and it also triggers the RuffARG002warning. Copy the incoming prediction before overridingis_passedso the structure stays intact.- def get_incorrect_sample(self, prediction: dict) -> dict: - return {"is_passed": False} + def get_incorrect_sample(self, prediction: dict) -> dict: + prediction = prediction.copy() + prediction["is_passed"] = False + return predictiondockerfiles/Dockerfile.sandbox (1)
79-92: Remove duplicate LISTEN_PORT envLISTEN_PORT is set twice to 6000; drop the duplicate.
Apply this diff:
-ENV LISTEN_PORT=6000 RUN echo "uwsgi_read_timeout 14400s;" > /etc/nginx/conf.d/custom_timeout.conf
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
dockerfiles/Dockerfile.sandbox(1 hunks)nemo_skills/dataset/livecodebench-pro/__init__.py(1 hunks)nemo_skills/dataset/ojbench/__init__.py(1 hunks)nemo_skills/dataset/ojbench/prepare.py(1 hunks)nemo_skills/evaluation/evaluator/__init__.py(2 hunks)nemo_skills/evaluation/evaluator/code.py(2 hunks)nemo_skills/evaluation/metrics/code_metrics.py(1 hunks)nemo_skills/evaluation/metrics/map_metrics.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/code.py (1)
eval_ojbench(350-397)
nemo_skills/evaluation/metrics/code_metrics.py (3)
nemo_skills/evaluation/metrics/base.py (5)
BaseMetrics(23-434)_get_score_dict(124-143)get_incorrect_sample(200-206)update(145-189)_compute_pass_at_k(352-423)nemo_skills/evaluation/metrics/lean4_metrics.py (3)
_get_score_dict(23-24)get_incorrect_sample(26-29)update(46-48)nemo_skills/evaluation/metrics/ruler_metrics.py (3)
_get_score_dict(19-20)get_incorrect_sample(26-29)update(22-24)
nemo_skills/evaluation/evaluator/code.py (1)
nemo_skills/file_utils.py (1)
unroll_files(21-32)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/code_metrics.py (1)
OJBenchMetrics(114-123)
🪛 Ruff (0.13.1)
nemo_skills/evaluation/metrics/code_metrics.py
118-118: Unused method argument: prediction
(ARG002)
nemo_skills/evaluation/evaluator/code.py
389-389: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
nemo_skills/dataset/ojbench/prepare.py
41-41: subprocess call: check for execution of untrusted input
(S603)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (3)
nemo_skills/evaluation/metrics/map_metrics.py (1)
24-58: Mapping addition looks solid.Importing
OJBenchMetricsand wiring it intoMETRICS_MAPkeeps the registry consistent with existing entries.nemo_skills/dataset/ojbench/__init__.py (1)
16-20: Defaults look goodThese defaults align the dataset with the new OJBench evaluator and metrics wiring.
dockerfiles/Dockerfile.sandbox (1)
27-29: Pin judge-server revision and optimize installation
- Replace hardcoded install with:
- RUN /usr/bin/pypy3 -m pip install git+https://github.com/DMOJ/judge-server.git@f098cd3a49a60186d1fadde5132329ec5f4f2213 + ARG JUDGE_SERVER_REF=f098cd3a49a60186d1fadde5132329ec5f4f2213 + RUN /usr/bin/pypy3 -m pip install --no-cache-dir "git+https://github.com/DMOJ/judge-server.git@${JUDGE_SERVER_REF}"
- Verify the evaluator actually invokes judge-server via the PyPy install (e.g. ensure PATH includes the PyPy bin or it’s called with
/usr/bin/pypy3 -m judge_server).
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
…_packages' was never awaited Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
docs/evaluation/code.md (2)
208-210: Use descriptive link text.Markdownlint flags “here” as non-descriptive. Rename the link so it tells readers what they’ll open.
-- Original benchmark source is [here](https://github.com/He-Ren/OJBench/tree/main). +- Original benchmark source is [the OJBench repository](https://github.com/He-Ren/OJBench/tree/main).
216-217: Add languages to fenced code blocks.The new
nscommand snippets and the JSON metrics block need language identifiers to satisfy MD040 and improve readability.-``` +```bash ns prepare_data --data_dir=<DATA_DIR> --cluster=<CLUSTER_NAME> ojbench@@
-+bash
ns eval
@@
++inference.tokens_to_generate=32768@@ -``` +```json { "ojbench": {Also applies to: 229-244, 249-283
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/evaluation/code.md(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md
204-204: Link text should be descriptive
(MD059, descriptive-link-text)
209-209: Link text should be descriptive
(MD059, descriptive-link-text)
215-215: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
229-229: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
248-248: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/pipeline/prepare_data.py (1)
93-96: Quote data_dir in shell command to avoid injection/spacing issues.data_dir is interpolated into a shell string without quoting; paths with spaces or malicious characters can break or be exploited.
Apply this minimal fix:
+import shlex @@ - if data_dir: - command += f" && mkdir -p {data_dir} && cp -r /nemo_run/code/nemo_skills/dataset/* {data_dir}" + if data_dir: + dd = shlex.quote(data_dir) + command += f" && mkdir -p {dd} && cp -r /nemo_run/code/nemo_skills/dataset/* {dd}"
♻️ Duplicate comments (1)
nemo_skills/pipeline/eval.py (1)
355-360: sandbox_mount_paths likely expects a list; mount_paths here is a CLI string.Passing the raw string risks downstream type errors. Convert to a list (e.g., split and strip) or pass the resolved list returned/recorded by resolve_mount_paths.
- keep_mounts_for_sandbox=job_needs_sandbox_to_keep_mounts or keep_mounts_for_sandbox, - sandbox_mount_paths=mount_paths, + keep_mounts_for_sandbox=job_needs_sandbox_to_keep_mounts or keep_mounts_for_sandbox, + sandbox_mount_paths=[p.strip() for p in mount_paths.split(",")] if isinstance(mount_paths, str) and mount_paths else mount_paths,Run to confirm the expected type for sandbox_mount_paths and how resolve_mount_paths handles mount_paths:
#!/bin/bash # Inspect add_task signature and sandbox_mount_paths type rg -nP 'def\s+add_task\(' -n nemo_skills -S -g '!**/venv/**' -g '!**/site-packages/**' rg -n 'sandbox_mount_paths' nemo_skills -n -C3 # Inspect resolve_mount_paths behavior rg -n 'def resolve_mount_paths' nemo_skills -n -C5
🧹 Nitpick comments (7)
docs/evaluation/code.md (1)
215-217: Consider adding language specifiers to code blocks.The code blocks would benefit from explicit language markers for proper syntax highlighting:
- Line 215: Should be
```bashfor the shell command- Line 229: Should be
```bashfor the ns eval command- Line 248: Should be
```jsonfor the metrics outputApply these changes:
At line 215:
-``` +```bash ns prepare_data --data_dir=<DATA_DIR> --cluster=<CLUSTER_NAME> ojbenchAt line 229: ```diff -``` +```bash ns eval \ --cluster=<CLUSTER_NAME> \At line 248:
-``` +```json { "ojbench": {Also applies to: 229-244, 248-283
nemo_skills/evaluation/evaluator/ojbench.py (1)
79-91: Consider: Original file format lost before evaluation completes.The code overwrites the original JSONL file at line 90-91 with preprocessed samples before evaluation runs (lines 107-112). While samples remain in memory, if the process crashes or is interrupted between preprocessing and the final write (line 128-130), the original field names (
question,completion) are permanently replaced with OJBench format (prompt,content).Since past review flagged this as important ("very important to only replace files at the very end") and the preprocessing is only needed for ojbench.judge_jsonl, consider writing preprocessed samples to a temporary file and leaving the original unchanged until final results are merged.
nemo_skills/pipeline/start_server.py (1)
65-68: Guard against invalid/unsafe flag comboIf keep_mounts_for_sandbox is True while with_sandbox is False, the flag is meaningless and could confuse users. Add an explicit check to fail fast.
@@ try: server_type = server_type.value except AttributeError: pass + + # keep_mounts_for_sandbox only applies when a sandbox is launched + if keep_mounts_for_sandbox and not with_sandbox: + raise ValueError("keep_mounts_for_sandbox requires --with_sandbox.") log_dir = check_mounts(cluster_config, log_dir, check_mounted_paths=check_mounted_paths)nemo_skills/pipeline/generate.py (1)
133-136: Fail fast when keep_mounts_for_sandbox is set without sandboxAvoid silent no-ops or confusion by rejecting keep_mounts_for_sandbox unless with_sandbox is enabled.
@@ try: server_type = server_type.value except AttributeError: pass + + if keep_mounts_for_sandbox and not with_sandbox: + raise ValueError("keep_mounts_for_sandbox requires --with_sandbox.")nemo_skills/pipeline/train.py (1)
250-253: Add validation for keep_mounts_for_sandbox usageSame rationale as other CLIs: reject keep_mounts_for_sandbox when with_sandbox is False.
@@ try: training_algo = training_algo.value except AttributeError: pass + + if keep_mounts_for_sandbox and not with_sandbox: + raise ValueError("keep_mounts_for_sandbox requires --with_sandbox.")nemo_skills/pipeline/run_cmd.py (1)
61-63: Optional: fix typing for server_gpus.Default is None but annotated as int. Consider Optional[int] for clarity and static tooling.
- server_gpus: int = typer.Option(None, help="Number of GPUs to use if hosting the model"), + server_gpus: int | None = typer.Option(None, help="Number of GPUs to use if hosting the model"),nemo_skills/pipeline/utils/eval.py (1)
371-376: Fix condition/message: only warn when the benchmark requires keeping mounts.Current check uses requires_sandbox, but the message is about “requires sandbox to keep mounts.” Use the keep_mounts_for_sandbox flag instead; also the text says “enabling it” though no state is changed here.
- if benchmark_args.requires_sandbox and not keep_mounts_for_sandbox: - LOG.warning("Found benchmark (%s) which requires sandbox to keep mounts, enabling it.", benchmark) + if benchmark_args.keep_mounts_for_sandbox and not keep_mounts_for_sandbox: + LOG.warning( + "Benchmark (%s) requires sandbox to keep mounts; mounts will be kept for its jobs.", + benchmark, + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
docs/evaluation/code.md(1 hunks)nemo_skills/dataset/ojbench/prepare.py(1 hunks)nemo_skills/evaluation/evaluator/ojbench.py(1 hunks)nemo_skills/pipeline/eval.py(5 hunks)nemo_skills/pipeline/generate.py(2 hunks)nemo_skills/pipeline/prepare_data.py(2 hunks)nemo_skills/pipeline/run_cmd.py(2 hunks)nemo_skills/pipeline/start_server.py(2 hunks)nemo_skills/pipeline/train.py(2 hunks)nemo_skills/pipeline/utils/eval.py(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/ojbench.py (3)
nemo_skills/code_execution/sandbox.py (2)
get_sandbox(419-422)close(77-79)nemo_skills/evaluation/evaluator/code.py (1)
preprocess_code(35-91)nemo_skills/utils.py (2)
get_logger_name(130-131)nested_dataclass(49-82)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md
204-204: Link text should be descriptive
(MD059, descriptive-link-text)
209-209: Link text should be descriptive
(MD059, descriptive-link-text)
215-215: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
229-229: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
248-248: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.13.1)
nemo_skills/evaluation/evaluator/ojbench.py
58-58: Avoid specifying long messages outside the exception class
(TRY003)
64-64: Avoid specifying long messages outside the exception class
(TRY003)
98-98: Use explicit conversion flag
Replace with conversion flag
(RUF010)
100-100: Use explicit conversion flag
Replace with conversion flag
(RUF010)
101-101: Use explicit conversion flag
Replace with conversion flag
(RUF010)
115-115: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/pipeline/eval.py
339-339: Unpacked variable job_server_address is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
nemo_skills/dataset/ojbench/prepare.py
46-46: subprocess call: check for execution of untrusted input
(S603)
46-46: Starting a process with a partial executable path
(S607)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (13)
nemo_skills/dataset/ojbench/prepare.py (3)
22-28: LGTM! Token validation properly addresses past feedback.The HF_TOKEN validation with clear error messages addresses the previous review concerns effectively.
31-57: LGTM! Token sanitization properly addresses security concerns.The error handling correctly sanitizes the HF_TOKEN from error messages (lines 52-56), addressing previous security feedback. The static analysis warnings about subprocess calls are false positives in this context—the git executable is validated and the command construction is deterministic.
60-91: LGTM! Subdirectory approach addresses past critical issue.The clone destination correctly uses a subdirectory
OJBench_testdata(line 63), which properly addresses the previous critical feedback about not deleting the package directory. The data processing logic is clean and well-structured.docs/evaluation/code.md (1)
285-285: LGTM! Correctly references ojbench for repeated runs.The documentation properly uses
--benchmarks=ojbench:Nfor repeats, which addresses previous feedback.nemo_skills/evaluation/evaluator/ojbench.py (3)
48-67: LGTM! Error handling properly uses exceptions.The function correctly raises
RuntimeErroron failures, which is better than the warning approach mentioned in past reviews. The error messages are clear and helpful for debugging.
114-122: LGTM! Error handling and validation address past feedback.The code properly:
- Raises
RuntimeErroron evaluation failure (line 115) instead of warning- Validates result count matches sample count (lines 120-122)
- Uses
strict=Truein zip (line 124)These changes address previous review feedback effectively.
48-67: Sandbox contexts share the same long-lived server, so the editable install persists. Closing a LocalSandbox client only tears down the HTTP session, not the server or its working directory—both contexts connect to the same process, and pip install -e remains available.Likely an incorrect or invalid review comment.
nemo_skills/pipeline/generate.py (1)
316-316: Good: flag is threaded to task creationThe new keep_mounts_for_sandbox is correctly propagated to add_task alongside with_sandbox. No further issues here.
nemo_skills/pipeline/train.py (1)
385-385: Propagation acknowledgedkeep_mounts_for_sandbox is passed through to add_task for training jobs; consistent with the pipeline.
nemo_skills/pipeline/prepare_data.py (1)
58-61: Flag threaded correctly; help text is clear.The new keep_mounts_for_sandbox option is wired into the CLI with a clear risk warning. Good addition.
nemo_skills/pipeline/run_cmd.py (1)
91-94: Propagation of keep_mounts_for_sandbox looks good.Option is exposed in CLI and forwarded to add_task; matches other pipelines for consistency.
Also applies to: 180-193
nemo_skills/pipeline/utils/eval.py (2)
184-187: Good: benchmark‑level keep_mounts_for_sandbox is plumbed through.Reading KEEP_MOUNTS_FOR_SANDBOX from the module and carrying it in BenchmarkArgs looks correct.
Also applies to: 220-234
516-531: Job flag for keep_mounts_for_sandbox is correct.Aggregating job_needs_sandbox_to_keep_mounts from per-benchmark args cleanly informs task creation.
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
nemo_skills/evaluation/evaluator/ojbench.py (1)
48-67: Potential issue: Editable install may not persist to evaluation sandbox.The
install_packagesfunction creates its own sandbox context (line 51), performs an editable install of OJBench (line 60), then closes that sandbox (line 45). Later,eval_ojbench_asynccreates a separate sandbox context (line 78) for evaluation.If these are different sandbox instances, the editable install (
pip install -e OJBench) won't be available in the evaluation sandbox because:
- The cloned OJBench directory is in the first sandbox's filesystem
- The pip installation metadata points to that now-closed sandbox's paths
Consider one of these approaches:
- Perform installation once in the evaluation sandbox (recommended):
async def eval_ojbench_async(cfg): eval_config = OJBenchConfig(**cfg.eval_config) problem_dirs = [ Path(cfg.data_dir, "ojbench/OJBench_testdata/NOI"), Path(cfg.data_dir, "ojbench/OJBench_testdata/ICPC"), ] - await install_packages(eval_config) - async with sandbox_context(eval_config.sandbox) as sandbox: + # Install packages in the same sandbox used for evaluation + LOG.info("Installing required packages for ojbench evaluation...") + clone_cmd = "git clone https://github.com/He-Ren/OJBench.git" + result, _ = await sandbox.execute_code(clone_cmd, language="shell", timeout=300) + if result["process_status"] != "completed": + stderr = result.get("stderr", "Unknown error") + raise RuntimeError(f"Failed to clone OJBench repo: {stderr}") + + install_cmd = "pip install -e OJBench" + result, _ = await sandbox.execute_code(install_cmd, language="shell", timeout=300) + if result["process_status"] != "completed": + stderr = result.get("stderr", "Unknown error") + raise RuntimeError(f"Failed to install ojbench. Stderr: {stderr}") + LOG.info("Successfully installed ojbench.") + for jsonl_file_str in unroll_files(cfg.input_files):
Use a non-editable install: Change line 60 to
pip install OJBench(if the package structure supports this)Ensure sandbox persistence: If the sandbox implementation supports persistent volumes via
keep_mounts_for_sandbox, ensure the OJBench clone directory is mounted persistently.
🧹 Nitpick comments (5)
docs/evaluation/code.md (3)
215-217: Add language specifiers to code blocks.The code blocks are missing language specifiers, which affects syntax highlighting and accessibility.
Apply this diff:
-``` +```bash ns prepare_data --data_dir=<DATA_DIR> --cluster=<CLUSTER_NAME> ojbenchAnd for the sample run command: ```diff -``` +```bash ns eval \ --cluster=<CLUSTER_NAME> \Based on static analysis hints.
Also applies to: 227-242
245-264: Consider adding language specifier to results output block.The results output block would benefit from a language specifier (e.g.,
textoryaml) for better rendering.Apply this diff:
-``` +```text ----------------------------- ojbench ----------------------------- evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracyBased on static analysis hints.
204-209: Consider more descriptive link text.While "here" is a common pattern in this documentation, more descriptive link text improves accessibility and scannability.
Example alternatives:
- Line 204: "Original benchmark source is on HuggingFace."
- Line 209: "Original benchmark source is on GitHub."
Based on static analysis hints.
nemo_skills/evaluation/evaluator/ojbench.py (2)
105-105: Verify num_workers=16 is appropriate.The code hardcodes
num_workers=16for parallel evaluation. This may be too high for resource-constrained environments or too low for high-core-count systems. Consider making this configurable throughOJBenchConfig.Example:
@nested_dataclass(kw_only=True) class OJBenchConfig: sandbox: dict = field(default_factory=lambda: {"sandbox_type": "local"}) timeout: int = 6 + num_workers: int = 16Then use
eval_config.num_workersin the eval code construction.
123-125: Consider raising exception instead of continue.When there's a length mismatch between results and samples, the code logs an error and continues to the next file. This means the job appears to succeed but silently skips files with mismatches. Consider raising an exception instead to make failures explicit and prevent incomplete results from appearing successful.
if len(results) != len(samples): - LOG.error(f"Result count mismatch for {jsonl_file}: {len(results)} results vs {len(samples)} samples") - continue + raise RuntimeError( + f"Result count mismatch for {jsonl_file}: " + f"{len(results)} results vs {len(samples)} samples" + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
docs/evaluation/code.md(1 hunks)nemo_skills/evaluation/evaluator/ojbench.py(1 hunks)nemo_skills/pipeline/prepare_data.py(3 hunks)nemo_skills/pipeline/utils/exp.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/pipeline/utils/exp.py
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/ojbench.py (3)
nemo_skills/code_execution/sandbox.py (2)
get_sandbox(419-422)close(77-79)nemo_skills/evaluation/evaluator/code.py (1)
preprocess_code(35-91)nemo_skills/utils.py (2)
get_logger_name(130-131)nested_dataclass(49-82)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md
204-204: Link text should be descriptive
(MD059, descriptive-link-text)
209-209: Link text should be descriptive
(MD059, descriptive-link-text)
215-215: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
227-227: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
245-245: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.13.1)
nemo_skills/evaluation/evaluator/ojbench.py
58-58: Avoid specifying long messages outside the exception class
(TRY003)
64-64: Avoid specifying long messages outside the exception class
(TRY003)
101-101: Use explicit conversion flag
Replace with conversion flag
(RUF010)
103-103: Use explicit conversion flag
Replace with conversion flag
(RUF010)
104-104: Use explicit conversion flag
Replace with conversion flag
(RUF010)
118-118: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (10)
nemo_skills/pipeline/prepare_data.py (3)
30-30: LGTM: OJBench dataset added to data directory requirement list.The addition of "ojbench" to
DATASETS_REQUIRE_DATA_DIRaligns with the PR objectives and ensures that users provide adata_dirwhen preparing OJBench data, consistent with other large datasets like "ruler" and "ioi24".
58-61: LGTM: Security warning appropriately emphasizes the data loss risk.The new
keep_mounts_for_sandboxparameter includes a clear warning about the risks of persisting mounts when the sandbox executes LLM-generated commands. The conservative default ofFalseis appropriate.
146-146: LGTM: Parameter correctly forwarded to _run_cmd.The
keep_mounts_for_sandboxparameter is properly forwarded to_run_cmdusing explicit named argument passing, consistent with the other parameters in the call.docs/evaluation/code.md (2)
213-219: LGTM! Clear data preparation instructions.The documentation properly notes the required
--data_dirflag, large download size (15GB), and HF_TOKEN requirement. This aligns with past review feedback.
244-244: Documentation path is accurate. The pipeline’sadd_taskcall writes logs to<OUTPUT_DIR>/eval-results/<benchmark>/summarized-results, and the log files are prefixedmain(ormain_<idx>), matchingmain_*. No update needed.nemo_skills/evaluation/evaluator/ojbench.py (5)
35-35: Verify timeout default is appropriate.The default timeout of 6 seconds seems very short for code evaluation tasks. At line 113, this is multiplied by the number of samples and adds 60 seconds, but for a single sample this would only be 66 seconds total. Please confirm this is sufficient for OJBench problems, which may involve complex test cases.
79-97: LGTM! Proper preprocessing with safe file handling.The preprocessing correctly:
- Writes to a separate
eval-input-file instead of overwriting the original- Transforms fields appropriately for OJBench format
- Applies code preprocessing with language-specific handling
This addresses previous concerns about data loss from overwriting input files.
99-115: Good: Clean environment and proper escaping.The evaluation execution correctly:
- Uses
env -ito ensure a clean environment- Applies
shlex.quotefor safe shell command construction- Calculates timeout dynamically based on number of samples
- Sets reasonable output limits (100k characters)
Note: Static analysis suggests using
!rconversion flags instead ofrepr()(lines 101, 103, 104), but the current approach is equally valid and more explicit.
127-133: LGTM! Safe result merging and file updates.The result merging correctly:
- Uses
strict=Truewith zip to catch length mismatches at runtime- Merges verdict and pass/fail status into samples
- Writes back to the original file only after successful evaluation
This addresses previous review concerns about data integrity.
136-138: LGTM! Clean synchronous wrapper.The synchronous wrapper properly uses
asyncio.run()to execute the async evaluation function, following standard Python async patterns.
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: dgitman <dgitman@nvidia.com>
In this PR, we are adding evaluation support for OJBench through nemo-skills. Primary changes are:
Summary by CodeRabbit
New Features
Bug Fixes / Defaults
Chores
Documentation