PyPy3 execution support for LiveCodeBench evaluation by wasiahmad · Pull Request #614 · NVIDIA-NeMo/Skills

wasiahmad · 2025-07-24T20:17:25Z

In this PR, we are attempting to include LiveCodeBench score calculation into nemo-skills. We include all supporting scripts at nemo-skills/evaluation/evaluator/livecodebench. And updated the evaluation function at nemo_skills/evaluation/evaluator/livecodebench.py.

Summary by CodeRabbit

New Features
- New LiveCodeBench evaluator: async sandboxed runs with Python/PyPy3 support, dependency setup, per-file processing, and persisted graded outputs.
- livecodebench-pro post-processing tool to normalize evaluation input/output fields.
- Data prep CLI adds --keep_all_columns; prepared outputs now include subset_for_metrics and release_version.
Documentation
- Added end-to-end guides for data preparation, running evaluations (Python/PyPy3), verifying results, and averaging runs.

nemo_skills/evaluation/evaluator/code.py

Signed-off-by: Feng Chen <42473790+fchen97@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Evelina <ebakhturina@nvidia.com>

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>

Signed-off-by: fayejf <fayejf07@gmail.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Feng Chen <42473790+fchen97@users.noreply.github.com>

Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

Signed-off-by: Wei Du <wedu@nvidia.com>

Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com>

Signed-off-by: Wei Du <wedu@nvidia.com>

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com> Signed-off-by: fayejf <fayejf07@gmail.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Feng Chen <42473790+fchen97@users.noreply.github.com> Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com> Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Feng Chen <42473790+fchen97@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <3909933+activatedgeek@users.noreply.github.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: Wei Du <wedu@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

docs/evaluation/code.md (1)
185-249: Fix fenced code block languages to satisfy markdownlint.

markdownlint (MD040) is flagging each of these LiveCodeBench snippets because the fences lack a language hint. Tagging them with the appropriate language (bash/json) will unblock the lint job and keeps syntax highlighting consistent.

Apply this diff to cover all occurrences:
-```
+```bash
 ns prepare_data livecodebench --release_version v6 --start_date 2024-08 --end_date 2025-05
@@
- +bash
ns prepare_data livecodebench --release_version v6 --start_date 2024-08 --end_date 2025-05 --keep_all_columns --cluster=<CLUSTER_NAME> --data_dir=<DATA_DIR>
@@
-```
+```bash
ns eval \
    --cluster=<CLUSTER_NAME> \
    --model=nvidia/OpenReasoning-Nemotron-32B \
@@
    ++inference.tokens_to_generate=65536
@@
- +bash
--extra_eval_args="++eval_config.interpreter=pypy3 ++eval_config.test_file=<DATA_DIR>/livecodebench/test_v6_2408_2505.jsonl"
@@
-```
+```json
{
  "livecodebench": { "pass@1": { "accuracy": 71.14 ... }},
  "livecodebench-easy": { "pass@1": { "accuracy": 99.09 ... }},
  "livecodebench-hard": { "pass@1": { "accuracy": 46.30 ... }},
  "livecodebench-medium": { "pass@1": { "accuracy": 85.10 ... }}
}
@@
- +bash
--benchmarks=livecodebench:3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1ed6bb1 and 0ccde9f.

📒 Files selected for processing (1)

docs/evaluation/code.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

185-185: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

192-192: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

204-204: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

226-226: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

234-234: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

247-247: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com>

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

docs/evaluation/code.md (1)

186-248: Add info strings to new code fences.

All newly added code blocks (commands, JSON output) lack language specifiers, which triggers MD040 and hurts syntax highlighting. Please annotate them with something like shell for CLI commands and json for metrics snippets.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0ccde9f and 4970d6e.

📒 Files selected for processing (2)

docs/evaluation/code.md (1 hunks)
nemo_skills/evaluation/evaluator/__init__.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/evaluation/evaluator/init.py

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

185-185: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

192-192: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

204-204: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

226-226: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

234-234: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

247-247: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

nemo_skills/evaluation/evaluator/livecodebench.py (2)

40-71: Install packages in the same sandbox session used for evaluation.

install_packages() (lines 40-57) creates its own sandbox context, installs the LiveCodeBench library, then closes that sandbox. The actual evaluation at line 71 opens a new sandbox session. For isolated sandbox types (Docker, SSH, remote HTTP), the second sandbox starts clean and will not have access to the packages installed in the first session, causing the from livecodebench.evaluate import evaluate statement to fail at runtime.

Move the package installation inside the evaluation sandbox session:

-    if not await install_packages(eval_config):
-        return
-
     async with sandbox_context(eval_config.sandbox) as sandbox:
+        if not await install_packages(eval_config, sandbox):
+            return
+
         for jsonl_file in unroll_files(cfg.input_files):

And update install_packages to accept an active sandbox:

-async def install_packages(eval_config: LiveCodeBenchEvaluatorConfig) -> bool:
+async def install_packages(eval_config: LiveCodeBenchEvaluatorConfig, sandbox) -> bool:
     """
     Installs required packages in a temporary sandbox.
     Returns True on success, False on failure.
     """
-    async with sandbox_context(eval_config.sandbox) as sandbox:
-        LOG.info(f"Installing livecodebench with {eval_config.interpreter}...")
-        pip_cmd = "pip" if eval_config.interpreter == "python" else "pypy3 -m pip"
-        git_url = LIVECODEBENCH_PYTHON_GIT_URL if eval_config.interpreter == "python" else LIVECODEBENCH_PYPY3_GIT_URL
-        cmd = f"{pip_cmd} install {git_url}"
-
-        result, _ = await sandbox.execute_code(cmd, language="shell", timeout=300)
-        if result.get("process_status") != "completed":
-            LOG.warning(f"Failed to install livecodebench: {result.get('stderr', 'Unknown error')}")
-            return False
-
-        LOG.info("Successfully installed livecodebench.")
-        return True
+    LOG.info(f"Installing livecodebench with {eval_config.interpreter}...")
+    pip_cmd = "pip" if eval_config.interpreter == "python" else "pypy3 -m pip"
+    git_url = LIVECODEBENCH_PYTHON_GIT_URL if eval_config.interpreter == "python" else LIVECODEBENCH_PYPY3_GIT_URL
+    cmd = f"{pip_cmd} install {git_url}"
+
+    result, _ = await sandbox.execute_code(cmd, language="shell", timeout=300)
+    if result.get("process_status") != "completed":
+        LOG.warning(f"Failed to install livecodebench: {result.get('stderr', 'Unknown error')}")
+        return False
+
+    LOG.info("Successfully installed livecodebench.")
+    return True

127-129: Avoid asyncio.run when an event loop may already be running.

asyncio.run() will raise RuntimeError: asyncio.run() cannot be called from a running event loop if the caller is async-aware (e.g., Jupyter notebooks, async CLI frameworks, Triton job managers). This breaks the function in those environments.

Replace the wrapper with a version that checks for an active loop:

 def eval_livecodebench(cfg):
-    """Synchronous wrapper to run the async evaluation."""
-    asyncio.run(eval_livecodebench_async(cfg))
+    """Run the async evaluation, reusing an existing loop when present."""
+    try:
+        loop = asyncio.get_running_loop()
+    except RuntimeError:
+        asyncio.run(eval_livecodebench_async(cfg))
+    else:
+        return loop.create_task(eval_livecodebench_async(cfg))

🧹 Nitpick comments (4)

nemo_skills/dataset/livecodebench/prepare.py (1)
154-157: Consider extracting the exception message.

The static analysis tool flags a long exception message inline. While this is a minor style issue, extracting it to a module-level constant improves maintainability.

Apply this diff:
+CUSTOM_SPLIT_ERROR_MSG = (
+    "If preparing a custom split, you must specify all "
+    "--release_version, --start_date, and --end_date arguments."
+)
+
 ...
         if args.release_version == "all" or args.start_date == "all" or args.end_date == "all":
-            raise ValueError(
-                "If preparing a custom split, you must specify all "
-                "--release_version, --start_date, and --end_date arguments."
-            )
+            raise ValueError(CUSTOM_SPLIT_ERROR_MSG)
nemo_skills/evaluation/evaluator/code.py (1)
152-163: Clarify the purpose and scope of eval_livecodebench_pro.

This function only transforms field names (task_id → problem_id, completion → text_response) and sets response_meta to None. It does not invoke any external evaluation library or compute correctness. The name eval_livecodebench_pro implies evaluation, but this is strictly a post-processing step.

Consider renaming to postprocess_livecodebench_pro or adding a docstring that clarifies this is a schema transformation, not an evaluation workflow:
 def eval_livecodebench_pro(cfg):
+    """Post-process LiveCodeBench-Pro samples: rename fields and add response_meta."""
     for jsonl_file in unroll_files(cfg.input_files):
Alternatively, if this function is intended to be called by a separate evaluation harness, document that expectation.
docs/evaluation/code.md (1)
186-248: Add language specifiers to fenced code blocks.

The markdown linter flags that fenced code blocks at lines 186, 193, 205, 227, 235, and 248 are missing language specifiers. Adding bash or shell identifiers improves syntax highlighting and readability.

Apply this diff:
-```
+```bash
 ns prepare_data livecodebench --release_version v6 --start_date 2024-08 --end_date 2025-05
...

- +bash
ns prepare_data livecodebench --release_version v6 --start_date 2024-08 --end_date 2025-05 --keep_all_columns --cluster=<CLUSTER_NAME> --data_dir=<DATA_DIR>
...

-```
+```bash
ns eval \
    ...
...

- +bash
--extra_eval_args="++eval_config.interpreter=pypy3 ++eval_config.test_file=<DATA_DIR>/livecodebench/test_v6_2408_2505.jsonl"
...

-```
+```json
{
  "livecodebench": { "pass@1": { "accuracy": 71.14 ... }},
  ...
}
...

- +bash
--benchmarks=livecodebench:3
nemo_skills/evaluation/evaluator/livecodebench.py (1)
64-80: Extract long exception messages to improve maintainability.

Static analysis flags lines 64, 66, and 80 for embedding long exception messages inline. Extracting them to module-level constants improves readability and maintainability.

Apply this diff:
+INVALID_PYTHON_INTERPRETER_MSG = "Python interpreter must be 'python' or 'pypy3'."
+CPP_REQUIRES_TEST_FILE_MSG = "C++ evaluation requires a test_file."
+MIXED_RELEASE_VERSIONS_MSG = "All samples should have the same release version. Found: {versions}"
+
 ...
     if eval_config.language == "python" and eval_config.interpreter not in ["python", "pypy3"]:
-        raise ValueError("Python interpreter must be 'python' or 'pypy3'.")
+        raise ValueError(INVALID_PYTHON_INTERPRETER_MSG)
     if eval_config.language == "cpp" and eval_config.test_file is None:
-        raise ValueError("C++ evaluation requires a test_file.")
+        raise ValueError(CPP_REQUIRES_TEST_FILE_MSG)
 
     ...
 
             versions = {s["release_version"] for s in samples}
             if len(versions) > 1:
-                raise ValueError(f"All samples should have the same release version. Found: {versions}")
+                raise ValueError(MIXED_RELEASE_VERSIONS_MSG.format(versions=versions))

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4970d6e and 4042286.

📒 Files selected for processing (5)

docs/evaluation/code.md (1 hunks)
nemo_skills/dataset/livecodebench/prepare.py (6 hunks)
nemo_skills/evaluation/evaluator/__init__.py (1 hunks)
nemo_skills/evaluation/evaluator/code.py (2 hunks)
nemo_skills/evaluation/evaluator/livecodebench.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/evaluation/evaluator/init.py

🧰 Additional context used

🧬 Code graph analysis (2)

nemo_skills/evaluation/evaluator/code.py (2)

nemo_skills/file_utils.py (1)

unroll_files (21-32)

nemo_skills/utils.py (1)

get_logger_name (130-131)

nemo_skills/evaluation/evaluator/livecodebench.py (3)

nemo_skills/code_execution/sandbox.py (2)

get_sandbox (419-422)

close (77-79)

nemo_skills/evaluation/evaluator/code.py (1)

preprocess_code (36-92)

nemo_skills/utils.py (2)

get_logger_name (130-131)

nested_dataclass (49-82)

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

186-186: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

193-193: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

205-205: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

227-227: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

235-235: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

248-248: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.13.1)

nemo_skills/dataset/livecodebench/prepare.py

154-157: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/evaluation/evaluator/livecodebench.py

64-64: Avoid specifying long messages outside the exception class

(TRY003)

66-66: Avoid specifying long messages outside the exception class

(TRY003)

80-80: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (4)

nemo_skills/evaluation/evaluator/code.py (1)

26-27: Import fix confirmed.

The import of unroll_files from nemo_skills.file_utils is now correct and resolves the ImportError flagged in the previous review.

nemo_skills/evaluation/evaluator/livecodebench.py (1)

103-103: LGTM: Safe command construction.

Using shlex.quote to escape the Python code passed to -c prevents shell injection and correctly handles special characters in the evaluation script.
nemo_skills/dataset/livecodebench/prepare.py (2)
122-125: No downstream JSONL readers detected—output format change is self-contained.
All JSONL handling lives in prepare.py; no consumer code parses these files expecting a fixed schema.

96-97: Cast columns only when they are present.

Lines 96-97 unconditionally cast public_test_cases and private_test_cases to large_string, but when keep_all_columns=False, these columns are removed at line 98. The datasets library will raise a KeyError when attempting to cast a column that has already been removed from the dataset.

Move the casting before the removal, or conditionally cast only when keep_all_columns=True:
+    if keep_all_columns:
+        dataset = dataset.cast_column("public_test_cases", Value("large_string"))
+        dataset = dataset.cast_column("private_test_cases", Value("large_string"))
     remove_columns = []
     if not keep_all_columns:
         remove_columns = [
             "question_title",
             "contest_id",
             "metadata",
             "question_content",
             "platform",
             "question_id",
             "starter_code",
             "public_test_cases",
             "private_test_cases",
         ]
-    dataset = dataset.cast_column("public_test_cases", Value("large_string"))
-    dataset = dataset.cast_column("private_test_cases", Value("large_string"))
     dataset = dataset.map(map_fn, remove_columns=remove_columns)
Likely an incorrect or invalid review comment.

docs/evaluation/code.md

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

docs/evaluation/code.md (2)

223-229: Replace PyPy3 snippet with a runnable command and use bash fence.

A full command avoids ambiguity and is immediately executable.

-##### Pypy3 Evaluation
+##### PyPy3 Evaluation
@@
-To run with the Pypy3 interpreter, modify the `--extra_eval_args` flag as shown below.
-```
---extra_eval_args="++eval_config.interpreter=pypy3 ++eval_config.test_file=<DATA_DIR>/livecodebench/test_v6_2408_2505.jsonl"
-```
+To run with the PyPy3 interpreter, use:
+```bash
+ns eval \
+    --cluster=<CLUSTER_NAME> \
+    --model=nvidia/OpenReasoning-Nemotron-32B \
+    --server_type=vllm \
+    --server_args="--async-scheduling" \
+    --server_nodes=1 \
+    --server_gpus=8 \
+    --benchmarks=livecodebench \
+    --split=test_v6_2408_2505 \
+    --data_dir=<DATA_DIR> \
+    --output_dir=<OUTPUT_DIR> \
+    --extra_eval_args="++eval_config.interpreter=pypy3 ++eval_config.test_file=<DATA_DIR>/livecodebench/test_v6_2408_2505.jsonl" \
+    --with_sandbox \
+    ++inference.temperature=0.6 \
+    ++inference.top_p=0.95 \
+    ++inference.tokens_to_generate=65536
+```

196-221: Optional: add a local (non‑Slurm) example.

Many users run locally; adding a minimal local command improves usability.

Tip: For local runs, omit --cluster and use a smaller --server_gpus (e.g., 1). Example:
```bash
ns eval \
  --model=nvidia/OpenReasoning-Nemotron-32B \
  --server_type=vllm \
  --benchmarks=livecodebench \
  --split=test_v6_2408_2505 \
  --data_dir=<DATA_DIR> \
  --output_dir=<OUTPUT_DIR> \
  --extra_eval_args="++eval_config.interpreter=python" \
  --with_sandbox


</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: CodeRabbit UI

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 404228692685c29e5298af14b5707a0cbd0df366 and 27fde4f274ab1f6f5104a59a8a6e53bb4ed26164.

</details>

<details>
<summary>📒 Files selected for processing (1)</summary>

* `docs/evaluation/code.md` (1 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>docs/evaluation/code.md</summary>

185-185: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

192-192: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

204-204: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

226-226: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

234-234: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

247-247: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: unit-tests

</details>

<details>
<summary>🔇 Additional comments (1)</summary><blockquote>

<details>
<summary>docs/evaluation/code.md (1)</summary><blockquote>

`205-221`: **Confirm vLLM flag support and test file path**

- `--async-scheduling` is forwarded verbatim by the CLI; manually verify your vLLM server image supports this flag.  
- The evaluator config keys `interpreter` (used for Python vs. PyPy) and `test_file` (required for C++ runs) match those in `livecodebench.py`.  
- Ensure `<DATA_DIR>/livecodebench/test_v6_2408_2505.jsonl` (i.e. `{data_dir}/livecodebench/{split}.jsonl`) aligns with the output of `ns prepare_data --keep_all_columns`.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

docs/evaluation/code.md

Kipok

thanks @wasiahmad! Just a small comment, but otherwise looks good

docs/evaluation/code.md

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (6)

docs/evaluation/code.md (6)
185-187: Add bash fence to the data prep snippet.

Without a language tag markdownlint (MD040) keeps failing; please mark this command as bash.
-```
+```bash
 ns prepare_data livecodebench --release_version v6 --start_date 2024-08 --end_date 2025-05
189-194: Fix PyPy3 casing and add bash fence.

Line 189 should read “PyPy3”, and the shell example needs a bash fence to satisfy MD040.
-##### For Pypy3 Evaluation:
-If you plan to evaluate using the Pypy3 interpreter, you must add the `--keep_all_columns` flag during data preparation. This will download a larger dataset (~1.9GB) containing the necessary test cases. So, we recommend downloading the dataset into a Slurm cluster location.
-```
+##### For PyPy3 Evaluation:
+If you plan to evaluate using the PyPy3 interpreter, you must add the `--keep_all_columns` flag during data preparation. This will download a larger dataset (~1.9GB) containing the necessary test cases. So, we recommend downloading the dataset into a Slurm cluster location.
+```bash
 ns prepare_data livecodebench --release_version v6 --start_date 2024-08 --end_date 2025-05 --keep_all_columns --cluster=<CLUSTER_NAME> --data_dir=<DATA_DIR>
204-221: Mark the evaluation command as bash.

The primary eval command also needs a bash fence for linting and readability.
-```
+```bash
 ns eval \
     --cluster=<CLUSTER_NAME> \
     --model=nvidia/OpenReasoning-Nemotron-32B \
@@
     ++inference.top_p=0.95 \
     ++inference.tokens_to_generate=65536
227-228: Add bash fence to the PyPy3 override snippet.

This single-line CLI flag should be fenced as bash to clear MD040.
-```
+```bash
 --extra_eval_args="++eval_config.interpreter=pypy3 ++eval_config.test_file=<DATA_DIR>/livecodebench/test_v6_2408_2505.jsonl"
234-253: Tag the metrics block as text (lint fix).

The metrics excerpt needs a language on the fence; text works well here.
-```
+```text
 -------------------------- livecodebench --------------------------
 evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy
 pass@1          | 454         | 15995      | 2188        | 71.15%
@@
 ----------------------- livecodebench-medium ----------------------
 evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy
 pass@1          | 141         | 14178      | 1889        | 85.11%
259-261: Add bash fence to the repeats snippet.

Marking this CLI flag as bash resolves MD040 and improves copy/paste.
-```
+```bash
 --benchmarks=livecodebench:3

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 27fde4f and 29dad0f.

📒 Files selected for processing (1)

docs/evaluation/code.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

185-185: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

192-192: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

204-204: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

226-226: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

234-234: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

259-259: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

docs/evaluation/code.md

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (4)

docs/evaluation/code.md (4)
185-187: Add a bash fence to the prepare command.

Apply the markdownlint fix so the snippet lint-checks cleanly and gets syntax highlighting.
-```
+```bash
 ns prepare_data livecodebench --release_version v6 --start_date 2024-08 --end_date 2025-05
-```
+```
189-194: Fix PyPy3 casing and tag the command as bash.

Consistent branding and a language-tagged fence keep the docs polished.
-##### For Pypy3 Evaluation:
+##### For PyPy3 Evaluation:
@@
-```
+```bash
 ns prepare_data livecodebench --release_version v6 --start_date 2024-08 --end_date 2025-05 --keep_all_columns --cluster=<CLUSTER_NAME> --data_dir=<DATA_DIR>
-```
+```
205-221: Annotate the full eval command as bash.

Needed for markdownlint MD040 and better UX.
-```
+```bash
 ns eval \
     --cluster=<CLUSTER_NAME> \
@@
     ++inference.top_p=0.95 \
     ++inference.tokens_to_generate=65536
-```
+```
260-261: Add a bash fence to the repeat-run flag.

Keeps lint happy and clarifies that this is a CLI flag.
-```
+```bash
 --benchmarks=livecodebench:3
-```
+```

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 29dad0f and d050d9f.

📒 Files selected for processing (1)

docs/evaluation/code.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

185-185: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

192-192: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

204-204: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

226-226: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

234-234: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

259-259: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

coderabbitai · 2025-10-01T21:49:39Z

docs/evaluation/code.md

+--extra_eval_args="++eval_config.interpreter=pypy3 ++eval_config.test_file=<DATA_DIR>/livecodebench/test_v6_2408_2505.jsonl"
+```


⚠️ Potential issue | 🟡 Minor

Tag the interpreter override snippet as bash.

Prevents MD040 lint failures.

-``` +```bash --extra_eval_args="++eval_config.interpreter=pypy3 ++eval_config.test_file=<DATA_DIR>/livecodebench/test_v6_2408_2505.jsonl" -``` +```

🤖 Prompt for AI Agents

In docs/evaluation/code.md around lines 227-228, the code fence containing the --extra_eval_args snippet is not tagged with a language which triggers MD040; update the opening fence to ```bash so the snippet is explicitly labeled as bash and ensure the closing ``` fence remains present and correctly placed.

coderabbitai · 2025-10-01T21:49:39Z

docs/evaluation/code.md

+-------------------------- livecodebench --------------------------
+evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy
+pass@1          | 454         | 15995      | 2188        | 71.15%
+
+
+------------------------ livecodebench-easy -----------------------
+evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy
+pass@1          | 110         | 5338       | 1806        | 99.09%
+
+
+------------------------ livecodebench-hard -----------------------
+evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy
+pass@1          | 203         | 23031      | 2188        | 46.31%
+
+
+----------------------- livecodebench-medium ----------------------
+evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy
+pass@1          | 141         | 14178      | 1889        | 85.11%
+```


⚠️ Potential issue | 🟡 Minor

Declare the metrics block as plain text.

The ASCII table isn’t JSON; marking it as text satisfies MD040 and keeps formatting intact.

-``` +```text -------------------------- livecodebench -------------------------- evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy @@ pass@1 | 141 | 14178 | 1889 | 85.11% -``` +```

🤖 Prompt for AI Agents

In docs/evaluation/code.md around lines 235 to 253, the ASCII metrics table is currently in a fenced code block without a language, triggering MD040; change the fence to declare the block as plain text by adding "text" after the opening triple backticks (i.e., use ```text) and keep the closing triple backticks unchanged so the table renders as plain text and preserves formatting.

Signed-off-by: Feng Chen <42473790+fchen97@users.noreply.github.com> Signed-off-by: Evelina <ebakhturina@nvidia.com> Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com> Signed-off-by: fayejf <fayejf07@gmail.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: Hovhannes Tamoyan <htamoyan@htamoyan-mlt.client.nvidia.com> Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com> Signed-off-by: Sadegh Mahdavi <smahdavi4@gmail.com> Signed-off-by: Michal Bien <mbien@nvidia.com> Signed-off-by: jubick1337 <mattyson.so@gmail.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com> Signed-off-by: i-vainn <1vanmoshkov@mail.ru> Signed-off-by: fzyzcjy <ch271828n@outlook.com> Signed-off-by: David Mosallanezhad <dmosallanezh@dmosallanezh-mlt.client.nvidia.com> Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Makesh Sreedhar <makeshn@nvidia.com> Signed-off-by: alessiodevoto <devoto.alessio@gmail.com> Signed-off-by: Matvei Novikov <mattyson.so@gmail.com> Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: darraghdog <dhanley@nvidia.com> Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Igor Gitman <igitman@cs-oci-ord-login-01.cm.cluster> Signed-off-by: Adam Rajfer <arajfer@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: Lizzie Wei <lizziew@nvidia.com> Co-authored-by: Feng Chen <42473790+fchen97@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Co-authored-by: Sadegh Mahdavi <smahdavi4@gmail.com> Co-authored-by: Shubham Toshniwal <shtoshni@gmail.com> Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <3909933+activatedgeek@users.noreply.github.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Hovhannes Tamoyan <hovhannes.tamoyan@gmail.com> Co-authored-by: Hovhannes Tamoyan <htamoyan@htamoyan-mlt.client.nvidia.com> Co-authored-by: Michał Bień <michal@mbien.pl> Co-authored-by: Matvei Novikov <mattyson.so@gmail.com> Co-authored-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Aleksander Ficek <37374704+aleksficek@users.noreply.github.com> Co-authored-by: i-vainn <1vanmoshkov@mail.ru> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Shubham Toshniwal <stoshniwal@nvidia.com> Co-authored-by: David <dmosallanezh@nvidia.com> Co-authored-by: David Mosallanezhad <dmosallanezh@dmosallanezh-mlt.client.nvidia.com> Co-authored-by: Daria Gitman <dgitman@nvidia.com> Co-authored-by: Darragh Hanley <darraghdog@users.noreply.github.com> Co-authored-by: Wei Du <wedu@wedu-mlt.client.nvidia.com> Co-authored-by: Shantanu Acharya <shantanua@nvidia.com> Co-authored-by: Shantanu Acharya <shan.sacharya@gmail.com> Co-authored-by: smajumdar <titu1994@gmail.com> Co-authored-by: Ivan <42346810+i-vainn@users.noreply.github.com> Co-authored-by: makeshn <makesh.24@gmail.com> Co-authored-by: Sid Jain <tmfs10@gmail.com> Co-authored-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Alessio Devoto <50107094+alessiodevoto@users.noreply.github.com> Co-authored-by: Xin Yu <60579067+xinyu-dev@users.noreply.github.com> Co-authored-by: Xin Yu <mightycamole@Tumole-Macbook-2024.local> Co-authored-by: Hemil Desai <hemil.desai10@gmail.com> Co-authored-by: Feng Chen <fengchen@nvidia.com> Co-authored-by: Nick Ludwig <nick.ludwig.g@gmail.com> Co-authored-by: shuoyangd <shuoyangd@users.noreply.github.com> Co-authored-by: Stephen Ge <stephen.ge@gmail.com> Co-authored-by: vmendelev <vmendelev@nvidia.com> Co-authored-by: Avinash Vem <avem@nvidia.com> Co-authored-by: Igor Gitman <igitman@cs-oci-ord-login-01.cm.cluster> Co-authored-by: Adam Rajfer <arajfer@nvidia.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Vladimir Bataev <artbataev@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Vladimir Bataev <vbataev@nvidia.com> Co-authored-by: Lizzie Wei <elizabeth.m.wei@gmail.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: Feng Chen <42473790+fchen97@users.noreply.github.com> Signed-off-by: Evelina <ebakhturina@nvidia.com> Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com> Signed-off-by: fayejf <fayejf07@gmail.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: Hovhannes Tamoyan <htamoyan@htamoyan-mlt.client.nvidia.com> Signed-off-by: tamohannes <hovhannes.tamoyan@gmail.com> Signed-off-by: Sadegh Mahdavi <smahdavi4@gmail.com> Signed-off-by: Michal Bien <mbien@nvidia.com> Signed-off-by: jubick1337 <mattyson.so@gmail.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com> Signed-off-by: i-vainn <1vanmoshkov@mail.ru> Signed-off-by: fzyzcjy <ch271828n@outlook.com> Signed-off-by: David Mosallanezhad <dmosallanezh@dmosallanezh-mlt.client.nvidia.com> Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Makesh Sreedhar <makeshn@nvidia.com> Signed-off-by: alessiodevoto <devoto.alessio@gmail.com> Signed-off-by: Matvei Novikov <mattyson.so@gmail.com> Signed-off-by: Shuoyang Ding <shuoyangd@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: darraghdog <dhanley@nvidia.com> Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Igor Gitman <igitman@cs-oci-ord-login-01.cm.cluster> Signed-off-by: Adam Rajfer <arajfer@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: Vladimir Bataev <vbataev@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: Lizzie Wei <lizziew@nvidia.com> Co-authored-by: Feng Chen <42473790+fchen97@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com> Co-authored-by: Sadegh Mahdavi <smahdavi4@gmail.com> Co-authored-by: Shubham Toshniwal <shtoshni@gmail.com> Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <3909933+activatedgeek@users.noreply.github.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Hovhannes Tamoyan <hovhannes.tamoyan@gmail.com> Co-authored-by: Hovhannes Tamoyan <htamoyan@htamoyan-mlt.client.nvidia.com> Co-authored-by: Michał Bień <michal@mbien.pl> Co-authored-by: Matvei Novikov <mattyson.so@gmail.com> Co-authored-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Aleksander Ficek <37374704+aleksficek@users.noreply.github.com> Co-authored-by: i-vainn <1vanmoshkov@mail.ru> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Shubham Toshniwal <stoshniwal@nvidia.com> Co-authored-by: David <dmosallanezh@nvidia.com> Co-authored-by: David Mosallanezhad <dmosallanezh@dmosallanezh-mlt.client.nvidia.com> Co-authored-by: Daria Gitman <dgitman@nvidia.com> Co-authored-by: Darragh Hanley <darraghdog@users.noreply.github.com> Co-authored-by: Wei Du <wedu@wedu-mlt.client.nvidia.com> Co-authored-by: Shantanu Acharya <shantanua@nvidia.com> Co-authored-by: Shantanu Acharya <shan.sacharya@gmail.com> Co-authored-by: smajumdar <titu1994@gmail.com> Co-authored-by: Ivan <42346810+i-vainn@users.noreply.github.com> Co-authored-by: makeshn <makesh.24@gmail.com> Co-authored-by: Sid Jain <tmfs10@gmail.com> Co-authored-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Alessio Devoto <50107094+alessiodevoto@users.noreply.github.com> Co-authored-by: Xin Yu <60579067+xinyu-dev@users.noreply.github.com> Co-authored-by: Xin Yu <mightycamole@Tumole-Macbook-2024.local> Co-authored-by: Hemil Desai <hemil.desai10@gmail.com> Co-authored-by: Feng Chen <fengchen@nvidia.com> Co-authored-by: Nick Ludwig <nick.ludwig.g@gmail.com> Co-authored-by: shuoyangd <shuoyangd@users.noreply.github.com> Co-authored-by: Stephen Ge <stephen.ge@gmail.com> Co-authored-by: vmendelev <vmendelev@nvidia.com> Co-authored-by: Avinash Vem <avem@nvidia.com> Co-authored-by: Igor Gitman <igitman@cs-oci-ord-login-01.cm.cluster> Co-authored-by: Adam Rajfer <arajfer@nvidia.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Vladimir Bataev <artbataev@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Vladimir Bataev <vbataev@nvidia.com> Co-authored-by: Lizzie Wei <elizabeth.m.wei@gmail.com> Signed-off-by: dgitman <dgitman@nvidia.com>

wasiahmad marked this pull request as draft July 24, 2025 20:17

wasiahmad commented Jul 24, 2025

View reviewed changes

nemo_skills/evaluation/evaluator/code.py Outdated Show resolved Hide resolved

fchen97 and others added 28 commits August 7, 2025 13:57

Fix timeout bug for LEAN4 code execution (#647)

9dc70bd

Signed-off-by: Feng Chen <42473790+fchen97@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

rename folder, add datasets (#651)

27f5e44

Signed-off-by: Evelina <ebakhturina@nvidia.com>

Improve client (#652)

62d0845

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

Fixes for code execution (#656)

3806646

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add skip_special_tokens=False for completions (#657)

446ed09

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix timeout raise in sandbox

33a8baa

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix timeout for code exec

4cfd725

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update annotation

5cd44ab

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update timeout to 4 hours

8755ab5

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Online GenSelect (#655)

4e931e6

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>

Adding long context benchmark MRCR (#634)

ed68025

Signed-off-by: fayejf <fayejf07@gmail.com> Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Fix a small bug in generation with chunks (#661)

45ac117

Signed-off-by: Feng Chen <42473790+fchen97@users.noreply.github.com>

Small fix for mrcr prepare.py (#662)

1bd199d

Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>

Fix base checkpoints in the docs

58af382

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix formatting

6a12754

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix type mismatch for max code executions (#665)

c9d361d

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Allow generation type or custom module in eval pipeline (#666)

642b92a

Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

update grpo with megatron backend (#653)

f200cac

Signed-off-by: Wei Du <wedu@nvidia.com>

bugfix: missing generation module arg in eval pipeline cmd script (#668)

34b90a5

Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com>

add support for nsys profile (#667)

eab2d58

Signed-off-by: Wei Du <wedu@nvidia.com>

Fixing BFCL (#669)

8c27f28

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>

Minor fixes to dataset defaults (#672)

eba855e

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>

Enable system_message for openai prompt format (#670)

be737a0

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fixes for docs

e600c78

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add SWE-bench inference & evaluation (#671)

3022a9b

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Remove prompt template (#673)

5b00e48

Signed-off-by: Igor Gitman <igitman@nvidia.com>

allow overlapping sandbox with run_cmd (#680)

86ee4e4

Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

activatedgeek and others added 8 commits September 30, 2025 14:28

revert nemo-rl patch (#871)

cc875fd

Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com>

Remove sharding docs (#872)

bdf6b37

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

Adding support for training with megatron-lm (#873)

65e99b2

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into feat/lcb_eval

1050ecf

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Merge branch 'main' into feat/lcb_eval

60087ac

Evaluation on OJBench (#848)

ca09081

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing merge conflicts

9028635

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

fixing merge conflicts

4970d6e

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

resolving conflicts

4042286

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

wasiahmad force-pushed the feat/lcb_eval branch from 4970d6e to 4042286 Compare October 1, 2025 08:30

merging

27fde4f

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

docs/evaluation/code.md Outdated Show resolved Hide resolved

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

docs/evaluation/code.md Show resolved Hide resolved

docs/evaluation/code.md Show resolved Hide resolved

docs/evaluation/code.md Show resolved Hide resolved

docs/evaluation/code.md Outdated Show resolved Hide resolved

docs/evaluation/code.md Show resolved Hide resolved

Kipok approved these changes Oct 1, 2025

View reviewed changes

docs/evaluation/code.md Show resolved Hide resolved

wasiahmad added 2 commits October 1, 2025 14:32

updating docs

3729e9a

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Merge branch 'main' into feat/lcb_eval

29dad0f

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

docs/evaluation/code.md Outdated Show resolved Hide resolved

minor doc update

d050d9f

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

wasiahmad enabled auto-merge (squash) October 1, 2025 21:47

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

wasiahmad merged commit 3f6f2da into main Oct 1, 2025
5 of 6 checks passed

wasiahmad deleted the feat/lcb_eval branch October 1, 2025 21:56

coderabbitai bot mentioned this pull request Dec 5, 2025

Add LCB Prompts, fix regex bug in robust_eval, remove CR, make summarize_robustness generic for more benchmarks, update docstrings. #1079

Merged

coderabbitai bot mentioned this pull request Dec 16, 2025

Evaluation on Livecodebench-pro #1115

Merged

coderabbitai bot mentioned this pull request Feb 11, 2026

removing datasets version restriction for LCB eval #1230

Merged

		--extra_eval_args="++eval_config.interpreter=pypy3 ++eval_config.test_file=<DATA_DIR>/livecodebench/test_v6_2408_2505.jsonl"
		```

Conversation

wasiahmad commented Jul 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

wasiahmad commented Jul 24, 2025 •

edited by coderabbitai bot

Loading