Skip to content

Evaluation on Livecodebench-pro#1115

Merged
wasiahmad merged 54 commits intomainfrom
livecodebench_pro
Dec 19, 2025
Merged

Evaluation on Livecodebench-pro#1115
wasiahmad merged 54 commits intomainfrom
livecodebench_pro

Conversation

@wasiahmad
Copy link
Collaborator

@wasiahmad wasiahmad commented Dec 16, 2025

Dataset: https://huggingface.co/datasets/QAQAQAQAQ/LiveCodeBench-Pro
Test cases: https://huggingface.co/datasets/QAQAQAQAQ/LiveCodeBench-Pro-Testcase

Summary by CodeRabbit

Release Notes

  • New Features

    • Added C++ language support for LiveCodeBench Pro code evaluation with configurable sandbox environments.
    • Introduced enhanced evaluator configuration with adjustable timeout and process management settings.
  • Improvements

    • Streamlined evaluation pipeline with improved code preprocessing and result handling.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

📝 Walkthrough

Walkthrough

The PR introduces LiveCodeBench-Pro evaluation support by adding configuration constants, dataset preparation utilities for downloading and processing remote test cases, and a new evaluator function with automatic dependency installation and sample processing.

Changes

Cohort / File(s) Summary
Configuration Updates
nemo_skills/dataset/livecodebench-pro/__init__.py
Added new constant EVAL_SPLIT ("test_25q2"); updated GENERATION_ARGS to use cpp_codegen instead of python_codegen.
Dataset Preparation
nemo_skills/dataset/livecodebench-pro/prepare.py
Added download_testcases() function for remote test case downloads; added process_problem_splits() for transforming problem data into per-split JSONL files; updated main workflow to execute download and processing pipeline instead of inline dataset handling.
Evaluation Implementation
nemo_skills/evaluation/evaluator/code.py
Added LiveCodeBenchProEvaluatorConfig dataclass with sandbox, language, test paths, timeout, and process configuration; added eval_livecodebench_pro() function with automatic livecodebench package installation, sample preprocessing, evaluation execution, and result aggregation; extended preprocess logic to handle closing </think> tags.

Sequence Diagram

sequenceDiagram
    participant User
    participant Evaluator as eval_livecodebench_pro()
    participant HF as Hugging Face Hub
    participant LiveCodeBench as livecodebench lib
    participant Sandbox as Local Sandbox

    User->>Evaluator: Call with config (test_dir, language=cpp)
    Evaluator->>Evaluator: Import livecodebench
    alt Package not found
        Evaluator->>HF: Install from Git URL
        HF-->>Evaluator: Installation complete
    end
    
    Evaluator->>Evaluator: Read samples from JSONL
    Evaluator->>Evaluator: Preprocess samples (strip_whitespace=True)
    Evaluator->>Evaluator: Add code_list field per sample
    
    Evaluator->>LiveCodeBench: Call evaluate() with<br/>language, test_file, timeout
    LiveCodeBench->>Sandbox: Execute test cases<br/>(num_processes=12)
    Sandbox-->>LiveCodeBench: Test results
    LiveCodeBench-->>Evaluator: Evaluation results file
    
    Evaluator->>Evaluator: Load results & attach<br/>graded_list per sample
    Evaluator->>Evaluator: Rewrite JSONL with results
    Evaluator-->>User: Return evaluated samples
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Automatic package installation logic in eval_livecodebench_pro(): Review the try-except flow for livecodebench import and Git-based installation to ensure error handling is robust.
  • Sample preprocessing pipeline: Verify the preprocess behavior changes (whitespace stripping, code_list field addition) don't conflict with existing evaluation logic.
  • File I/O operations: Check JSONL read/write sequences and the intermediate results file handling to confirm no data loss or corruption paths.

Possibly related PRs

Suggested reviewers

  • Kipok

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly relates to the main objective of the PR, which adds evaluation support for the LiveCodeBench-Pro benchmark.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch livecodebench_pro

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
nemo_skills/evaluation/evaluator/code.py (1)

239-246: Consider using str | None type annotations for optional fields.

The test_file and test_dir fields default to None but are typed as str. For consistency with BaseEvaluatorConfig (which uses str | None), consider updating the type hints.

 @nested_dataclass(kw_only=True)
 class LiveCodeBenchProEvaluatorConfig(BaseEvaluatorConfig):
     sandbox: dict = field(default_factory=lambda: {"sandbox_type": "local"})
     language: str = "cpp"  # use either "python" or "cpp"
-    test_file: str = None
-    test_dir: str = None  # path to the unit tests directory
+    test_file: str | None = None
+    test_dir: str | None = None  # path to the unit tests directory
     timeout: int = 6
     num_processes: int = 12
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ddefa5 and 6f73340.

📒 Files selected for processing (3)
  • nemo_skills/dataset/livecodebench-pro/__init__.py (1 hunks)
  • nemo_skills/dataset/livecodebench-pro/prepare.py (1 hunks)
  • nemo_skills/evaluation/evaluator/code.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/evaluator/code.py (3)
nemo_skills/utils.py (1)
  • nested_dataclass (69-102)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluatorConfig (27-31)
nemo_skills/evaluation/evaluator/__init__.py (1)
  • evaluate (117-131)
🪛 Ruff (0.14.8)
nemo_skills/dataset/livecodebench-pro/prepare.py

69-69: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (7)
nemo_skills/dataset/livecodebench-pro/__init__.py (1)

18-19: LGTM!

The configuration constants are consistent with the rest of the PR: EVAL_SPLIT = "test_25q2" matches the 25q2 split defined in prepare.py, and cpp_codegen aligns with the default language: str = "cpp" in LiveCodeBenchProEvaluatorConfig.

nemo_skills/dataset/livecodebench-pro/prepare.py (3)

22-29: LGTM!

The repository constants and split definitions are clear and well-structured. The tuple format (tag, split_name, expected_count) provides useful validation data.


32-42: LGTM!

The error handling appropriately logs the failure and re-raises to propagate the error.


73-83: LGTM!

The main block properly validates the HF_TOKEN environment variable and orchestrates the two-step workflow.

nemo_skills/evaluation/evaluator/code.py (3)

125-134: LGTM!

Good defensive change to handle the edge case where the generation contains a </think> closing tag without the opening tag.


292-293: LGTM!

The pattern of moving the eval results file to prevent recomputation is consistent with other evaluators in this file.


285-290: Verify question_id field exists in the LiveCodeBench-Pro HuggingFace dataset.

The code at line 289 accesses sample["question_id"] to look up evaluation results in the grades dictionary. While the prepare.py script preserves all fields from the source dataset via output_record = dict(row), the presence of question_id in the original HuggingFace repository (QAQAQAQAQ/LiveCodeBench-Pro) should be explicitly confirmed in documentation or code comments to ensure the data pipeline is robust.

@wasiahmad
Copy link
Collaborator Author

@gwarmstrong this PR is ready to be merged. I have checked it by evaluating Qwen3-30B-A3B-Thinking-2507 and Qwen3-235B-A22B-Thinking-2507 models on LCB-Pro dataset. The results align with our expectation.

@wasiahmad wasiahmad enabled auto-merge (squash) December 18, 2025 01:02
gwarmstrong and others added 20 commits December 18, 2025 18:26
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
…ize_robustness generic for more benchmarks, update docstrings. (#1079)

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: bzantium <ryumin93@gmail.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Ivan <imoshkov@nvidia.com>
Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
wasiahmad and others added 19 commits December 18, 2025 18:26
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
…ontainers (#1116)

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Wei Du <wedu@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
…#1129)

Signed-off-by: Stephen Ge <stepheng@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
@wasiahmad wasiahmad merged commit 7205c43 into main Dec 19, 2025
5 checks passed
@wasiahmad wasiahmad deleted the livecodebench_pro branch December 19, 2025 03:03
blahblahasdf pushed a commit to blahblahasdf/Skills that referenced this pull request Jan 8, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: dlord <dlord@nvidia.com>
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
wasiahmad added a commit that referenced this pull request Feb 4, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.