Add compute eval by blahblahasdf · Pull Request #1158 · NVIDIA-NeMo/Skills

blahblahasdf · 2026-01-08T23:37:03Z

Overview

This PR integrates ComputeEval - a benchmark for evaluating LLMs on CUDA code generation - into NeMo-Skills.

ComputeEval features:

406 (and growing) CUDA programming challenges
Functional correctness evaluation via compilation and test execution
Multiple release versions (2025-1, 2025-2, 2025-3)

Changes

Dataset Preparation - download ComputeEval problems from HuggingFace (gated dataset)
Generation Task - implement GenerationTask to generate CUDA solutions using ComputeEval's reference generation logic
Evaluator - implement BaseEvaluator using ComputeEval's evaluation logic
Metrics - compute pass@k metrics

Dependencies

Adds compute-eval package (compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@2d14770)
Note: compute-eval is a public repository but has not been published to PyPI yet

Summary by CodeRabbit

Release Notes

New Features
- Added support for the compute-eval benchmark, enabling evaluation of CUDA code generation with compilation and execution against test suites.
- Included data preparation and metrics tracking for compute-eval workflows.
Documentation
- Added comprehensive compute-eval benchmark documentation with setup, preparation, and evaluation instructions.
Tests
- Extended test coverage to include the compute-eval dataset.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: dlord <dlord@nvidia.com>

… move `ComputeEvalGenerationTask` to be more on the rails of the `process_single_datapoint` approach. Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

…#1129) Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com> # Conflicts: # nemo_skills/evaluation/evaluator/__init__.py

Signed-off-by: bzantium <ryumin93@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Signed-off-by: dlord <dlord@nvidia.com>

coderabbitai · 2026-01-08T23:44:16Z

📝 Walkthrough

Walkthrough

This pull request introduces compute-eval, a NVIDIA CUDA code generation benchmark for the NemoSkills evaluation framework. It adds dataset preparation, evaluation components, prompt configuration, and test coverage to support end-to-end evaluation workflows.

Changes

Cohort / File(s)	Summary
Dataset Module `nemo_skills/dataset/compute-eval/__init__.py`, `nemo_skills/dataset/compute-eval/prepare.py`	Introduces dataset module with constants (EVAL_SPLIT, DATASET_GROUP, METRICS_TYPE, etc.) and a prepare script that downloads NVIDIA compute-eval dataset via HuggingFace, validates HF_TOKEN, formats context files as fenced code blocks, and outputs eval.jsonl.
Evaluation Framework `nemo_skills/evaluation/evaluator/compute_eval.py`, `nemo_skills/evaluation/evaluator/__init__.py`	Adds ComputeEvalEvaluator class that loads NVCC version, validates input via pydantic TypeAdapter, runs evaluate_solution in a worker thread, and returns structured results with passed/skipped/elapsed_time/build_output/test_output. Registers evaluator in EVALUATOR_CLASS_MAP.
Metrics Framework `nemo_skills/evaluation/metrics/code_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	Implements ComputeEvalMetrics class deriving accuracy scores from prediction["passed"] and registers it in METRICS_MAP under "compute-eval" key.
Generation Task `nemo_skills/inference/eval/compute_eval.py`	Introduces ComputeEvalGenerationTask subclass that parses generated solutions into FileSolution objects, includes error handling for missing fields, and provides Hydra CLI with help message support.
Configuration & Documentation `nemo_skills/prompt/config/compute-eval/baseline.yaml`, `docs/evaluation/code.md`	Adds system/user prompt configuration defining CUDA/C/C++ engineer persona, build command handling, and output formatting constraints; updates evaluation docs with compute-eval benchmark setup and usage instructions.
Dependencies & Tests `requirements/main.txt`, `tests/test_datasets.py`	Adds compute-eval Git dependency to requirements and extends dataset test coverage with ("compute-eval", ["eval"]) entry.

Sequence Diagram

sequenceDiagram
    actor User
    participant PrepareScript as prepare.py
    participant HFDatasets as HuggingFace Datasets
    participant FileSystem as Local Storage
    participant Generator as ComputeEvalGenerationTask
    participant Evaluator as ComputeEvalEvaluator
    participant Metrics as ComputeEvalMetrics

    User->>PrepareScript: Run with --release
    PrepareScript->>HFDatasets: Download compute-eval dataset
    HFDatasets-->>PrepareScript: Dataset records
    PrepareScript->>FileSystem: Write eval.jsonl with context blocks
    
    User->>Generator: Generate solution (prompt + context)
    Generator->>Generator: Parse generated files
    Generator-->>User: Return FileSolution + generation metadata
    
    User->>Evaluator: eval_single(data_point)
    Evaluator->>Evaluator: Load NVCC, validate inputs
    Evaluator->>Evaluator: Run evaluate_solution in thread
    Evaluator-->>User: Return {passed, skipped, elapsed_time, build_output, test_output}
    
    User->>Metrics: update(predictions)
    Metrics->>Metrics: Extract passed from predictions
    Metrics->>Metrics: Compute accuracy & pass@k
    Metrics-->>User: Return scores

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

run GPU tests

Suggested reviewers

gwarmstrong
ludwig-n

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title 'Add compute eval' is vague and generic. While technically related to the changeset, it lacks specificity about what 'compute eval' refers to and does not clearly convey that this integrates a CUDA code generation benchmark into NeMo-Skills.	Consider a more descriptive title such as 'Integrate ComputeEval benchmark for CUDA code generation' or 'Add ComputeEval dataset, generation task, and evaluator' to better communicate the scope and purpose of the integration.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In @nemo_skills/dataset/compute-eval/prepare.py:
- Around line 56-87: Add a new CLI arg via parser.add_argument("--output-dir",
type=str, default=".", help="Directory to write eval.jsonl") and replace
data_dir = Path(__file__).absolute().parent with data_dir =
Path(args.output_dir).resolve(); ensure data_dir.mkdir(parents=True,
exist_ok=True). Replace exit(1) with sys.exit(1) when HF_TOKEN is missing. For
backwards-compatible HF auth, call load_dataset("nvidia/compute-eval",
args.release, token=token) inside a try/except that catches TypeError and
retries with load_dataset(..., use_auth_token=token). Keep writing the file to
data_dir / "eval.jsonl" and keep using _format_context_files_block for records.

In @nemo_skills/evaluation/metrics/code_metrics.py:
- Around line 126-135: The linter flags an unused parameter in
get_incorrect_sample; rename the parameter prediction to _prediction (i.e., def
get_incorrect_sample(self, _prediction: dict) -> dict:) or alternatively append
an inline noqa (e.g., prediction: dict  # noqa: ARG002) to silence Ruff, and
keep the returned dict as {"passed": False}; update the function signature in
class ComputeEvalMetrics accordingly so references remain valid.

🧹 Nitpick comments (5)

nemo_skills/inference/eval/compute_eval.py (1)
38-54: Improve exception handling specificity.

The exception handling has a couple of issues:

Line 49: The KeyError handler doesn't provide context about which field is missing. The error could be from task_id not being in data_point, or from issues within _parse_solution.

Lines 52-54: The broad Exception catch might hide programming errors or unexpected issues that should fail fast rather than being caught and logged.
♻️ Suggested improvements
     async def process_single_datapoint(self, data_point, data):
         res = await super().process_single_datapoint(data_point, data)
         try:
+            # Validate required fields first
+            if "task_id" not in data_point:
+                raise ValueError("Missing required field 'task_id' in data_point")
+            
             solution = FileSolution(
                 task_id=data_point["task_id"],
                 files=_parse_solution(res["generation"]),
             )
             return {
                 "solution": solution.model_dump(),
                 "generation": res["generation"],
             }
-        except KeyError as e:
-            _LOG.error(f"Missing required field: {e}")
+        except (KeyError, ValueError) as e:
+            _LOG.error(f"Failed to create solution for task {data_point.get('task_id', 'unknown')}: {e}")
             raise
-        except Exception as e:
-            _LOG.error(f"Failed to parse solution: {e}")
+        except (TypeError, AttributeError) as e:
+            # Catch specific parsing-related errors only
+            _LOG.error(f"Failed to parse solution for task {data_point.get('task_id', 'unknown')}: {e}")
             raise
docs/evaluation/code.md (2)
184-184: Use descriptive link text instead of "here".

The link text "here" is not descriptive for accessibility and clarity. Consider using more descriptive text that indicates what the link points to.
📝 Suggested improvement
-- Original benchmark source is [here](https://github.com/NVIDIA/compute-eval).
+- Original benchmark source: [compute-eval on GitHub](https://github.com/NVIDIA/compute-eval)
Based on learnings from static analysis tools.

230-235: Add language identifier to code fence for proper syntax highlighting.

The code block showing the metrics output should specify a language identifier (e.g., json or text) for proper rendering and syntax highlighting.
📝 Suggested improvement
-```
+```text
 ---------------------------- compute-eval -----------------------------
 evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy
 pass@1          | 50          | 8432       | 1245        | 64.00%
-```
+```
Based on learnings from static analysis tools.
nemo_skills/dataset/compute-eval/prepare.py (1)
21-54: **Make context-file fencing robust (avoid accidental termination) and avoid leading newline.** Today `_CONTEXT_FILES_BLOCK_TEMPLATE` uses unconditionally; if a context file contains triple backticks (rare, but possible), the prompt block will break.
Proposed diff
-_CONTEXT_FILES_BLOCK_TEMPLATE = """
---- file: {path}
-```{fence}
-{content}
-```
-"""
+_CONTEXT_FILES_BLOCK_TEMPLATE = """--- file: {path}
+````{fence}
+{content}
+````
+"""
nemo_skills/evaluation/evaluator/compute_eval.py (1)
35-44: Initialization logic is sound.

The NVCC version check appropriately prevents instantiation when the CUDA Toolkit is unavailable. The version parsing and storage are correct.

Static analysis suggests extracting the error message to reduce complexity, but the current inline message is clear and acceptable:
♻️ Optional: Extract error message to exception class or constant

If you prefer to address the static analysis hint:
+_NVCC_NOT_FOUND_MSG = "NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH."
+
 class ComputeEvalEvaluator(BaseEvaluator):
     _installed_ctk_major: int
     _installed_ctk_minor: int

     def __init__(self, config: dict, num_parallel_requests=10):
         super().__init__(config, num_parallel_requests)
         nvcc_version = get_nvcc_version()
         if not nvcc_version:
-            raise RuntimeError(
-                "NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH."
-            )
+            raise RuntimeError(_NVCC_NOT_FOUND_MSG)

         self._installed_ctk_major, self._installed_ctk_minor, _ = parse_semver(nvcc_version)
Based on static analysis hint TRY003.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52dbc32 and 9a4f608.

📒 Files selected for processing (11)

docs/evaluation/code.md
nemo_skills/dataset/compute-eval/__init__.py
nemo_skills/dataset/compute-eval/prepare.py
nemo_skills/evaluation/evaluator/__init__.py
nemo_skills/evaluation/evaluator/compute_eval.py
nemo_skills/evaluation/metrics/code_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/inference/eval/compute_eval.py
nemo_skills/prompt/config/compute-eval/baseline.yaml
requirements/main.txt
tests/test_datasets.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-23T17:56:57.556Z

Learnt from: Glorf
Repo: NVIDIA-NeMo/Skills PR: 908
File: requirements/main.txt:16-16
Timestamp: 2025-11-23T17:56:57.556Z
Learning: faiss-cpu must be explicitly listed in requirements/main.txt for BFCLv4 memory evaluations (memory_kv, memory_vector, memory_rec_sum) as it is an optional dependency of sentence_transformers that is required for vector similarity search functionality in the memory backends.

Applied to files:

requirements/main.txt

🧬 Code graph analysis (4)

nemo_skills/inference/eval/compute_eval.py (3)

nemo_skills/inference/generate.py (2)

GenerateSolutionsConfig (87-253)

GenerationTask (260-791)

nemo_skills/inference/model/__init__.py (1)

server_params (141-145)

nemo_skills/utils.py (3)

get_help_message (341-361)

get_logger_name (39-43)

setup_logging (105-140)

nemo_skills/evaluation/evaluator/compute_eval.py (2)

nemo_skills/evaluation/evaluator/base.py (1)

BaseEvaluator (34-91)

nemo_skills/utils.py (1)

get_logger_name (39-43)

nemo_skills/evaluation/metrics/map_metrics.py (1)

nemo_skills/evaluation/metrics/code_metrics.py (1)

ComputeEvalMetrics (126-135)

nemo_skills/evaluation/evaluator/__init__.py (1)

nemo_skills/evaluation/evaluator/compute_eval.py (1)

ComputeEvalEvaluator (31-85)

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/code.md

184-184: Link text should be descriptive

(MD059, descriptive-link-text)

230-230: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)

nemo_skills/evaluation/evaluator/compute_eval.py

39-41: Avoid specifying long messages outside the exception class

(TRY003)

59-65: Consider moving this statement to an else block

(TRY300)

76-76: Do not catch blind exception: Exception

(BLE001)

nemo_skills/evaluation/metrics/code_metrics.py

130-130: Unused method argument: prediction

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (13)

nemo_skills/inference/eval/compute_eval.py (1)

60-73: LGTM! Clean Hydra integration.

The CLI setup properly handles help messages before Hydra initialization and sets up logging appropriately. The pattern follows best practices for Hydra-based command-line tools.

tests/test_datasets.py (1)

62-62: LGTM! Dataset entry follows the established pattern.

The new compute-eval dataset entry is correctly formatted and positioned in the list.

nemo_skills/evaluation/evaluator/__init__.py (1)

30-30: LGTM! Evaluator registration follows the established pattern.

The ComputeEvalEvaluator is correctly imported and registered in EVALUATOR_CLASS_MAP, following the same pattern as other class-based evaluators. The validation logic at lines 76-84 ensures no conflicts between class and function maps.

Also applies to: 73-73

docs/evaluation/code.md (1)

181-240: Excellent documentation coverage!

The compute-eval section provides comprehensive guidance including prerequisites, data preparation, evaluation steps, result verification, and important security considerations. The structure is clear and follows the pattern established for other benchmarks in this document.

requirements/main.txt (1)

17-17: Git dependency commit exists and package structure is valid.

The commit 2d14770 is verified in the NVIDIA/compute-eval repository, pinned to the main branch as a stable reference. The package includes a proper pyproject.toml configuration and is accessible for installation.

nemo_skills/evaluation/metrics/map_metrics.py (2)

25-33: ComputeEvalMetrics import wiring looks correct.
No concerns with the added import and it follows the existing pattern for code-related metrics.

45-76: Metric key "compute-eval" is correctly registered in METRICS_MAP.
This should make get_metrics("compute-eval") work as expected.

nemo_skills/dataset/compute-eval/__init__.py (1)

15-19: Constants are properly defined and consumed by the pipeline. The module nemo_skills.inference.eval.compute_eval exists and is dynamically loaded via get_arg_from_module_or_dict() in nemo_skills/pipeline/utils/eval.py. The GENERATION_ARGS string is passed through without transformation.

nemo_skills/prompt/config/compute-eval/baseline.yaml (1)

1-65: The prompt specifies // file: format, but the parser implementation cannot be verified in this environment.

The baseline.yaml prompt (lines 22–27) clearly specifies the output format: "Each file must be in its own fenced code block, with the first line indicating its path as a comment" with the example // file: geodistance.cu.

However, the actual parsing is delegated to an external library function (compute_eval.generate_completions._parse_solution()) imported in nemo_skills/inference/eval/compute_eval.py (line 43). Since the compute_eval library is not available for inspection in this environment, I cannot definitively confirm that _parse_solution() actually accepts the // file: format as written in the prompt.

To resolve this, verify that compute_eval._parse_solution() accepts the // file: comment convention by checking the library's source code or running a test with the exact format specified in the prompt.

nemo_skills/evaluation/evaluator/compute_eval.py (4)

26-28: LGTM: Logger and type adapter setup.

The logger setup follows the nemo_skills pattern, and the TypeAdapter configuration with discriminated unions is the correct approach for validating problem and solution types dynamically.

45-65: Evaluation logic correctly implemented.

The async evaluation pattern using asyncio.to_thread properly delegates the synchronous evaluate_solution call to a worker thread, preventing blocking of the async event loop. The pydantic validation ensures type safety for problem and solution inputs.

66-85: Exception handling provides pipeline resilience.

The dual exception handlers (KeyError for missing fields, generic Exception for all other failures) ensure the evaluation pipeline continues processing even when individual samples fail. Both handlers return consistent error payloads that maintain the expected structure.

The broad exception catching (flagged by static analysis BLE001) is intentional per the noinspection comment and appropriate for an evaluation pipeline where continuing to process other samples is more valuable than failing fast.

Note: The KeyError handler catches missing "problem" or "solution" keys, while pydantic ValidationError exceptions (from invalid data structure) are caught by the generic Exception handler. This distinction allows different error messages but both produce the same error payload structure.

18-21: All imported APIs (CudaCppProblem, CudaPythonProblem, FileSolution, PatchSolution, evaluate_solution, get_nvcc_version, parse_semver) are present and stable at the pinned commit 2d14770 of the compute-eval package. No compatibility issues detected.

coderabbitai · 2026-01-08T23:44:19Z

nemo_skills/dataset/compute-eval/prepare.py

+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Download and prepare nvidia/compute-eval dataset")
+    parser.add_argument(
+        "--release",
+        type=str,
+        default=None,
+        help="Release to download (e.g., '2025-1', '2025-2'). If not specified, downloads default release.",
+    )
+
+    args = parser.parse_args()
+
+    token = os.getenv("HF_TOKEN", None)
+    if not token:
+        print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.")
+        exit(1)
+
+    dataset = load_dataset("nvidia/compute-eval", args.release, token=token)
+    data_dir = Path(__file__).absolute().parent
+    data_dir.mkdir(exist_ok=True)
+
+    with open(data_dir / "eval.jsonl", "wt", encoding="utf-8") as f:
+        for item in dataset["eval"]:
+            record = {
+                "problem": item,
+                "task_id": item["task_id"],
+                "problem_prompt": item["prompt"],
+                "build_command": item["build_command"],
+                "context_files_block": _format_context_files_block(item["context_files"]),
+            }
+
+            # Dumping using default=str to handle datetime serialization from the problem records
+            f.write(json.dumps(record, default=str) + "\n")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# 1) Find datasets version constraints rg -n '(?:^|\b)datasets\b' -S requirements.txt pyproject.toml setup.cfg setup.py nemo_skills | head -n 200

Repository: NVIDIA-NeMo/Skills

Length of output: 15038

🏁 Script executed:

# 2) Search for eval.jsonl usage in the codebase rg -n --fixed-strings 'eval.jsonl' nemo_skills tests 2>/dev/null | head -n 200

Repository: NVIDIA-NeMo/Skills

Length of output: 298

🏁 Script executed:

# 3) Find the actual prepare.py file to confirm current implementation fd -e py -type f -name "prepare.py" | xargs cat -n 2>/dev/null

Repository: NVIDIA-NeMo/Skills

Length of output: 232

🏁 Script executed:

# Read the prepare.py file in compute-eval cat -n nemo_skills/dataset/compute-eval/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 3524

🏁 Script executed:

# Find pyproject.toml or other version files find . -name "pyproject.toml" -o -name "requirements*.txt" -o -name "setup.py" -o -name "setup.cfg" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 220

🏁 Script executed:

# Look for where eval.jsonl is actually used/imported rg -n 'eval\.jsonl' nemo_skills tests 2>/dev/null

Repository: NVIDIA-NeMo/Skills

Length of output: 298

🏁 Script executed:

# Check pyproject.toml for datasets version cat pyproject.toml | grep -A 5 -B 5 datasets

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Also check if there's a requirements file at root ls -la requirements*.txt 2>/dev/null || echo "No requirements files at root"

Repository: NVIDIA-NeMo/Skills

Length of output: 91

🌐 Web query:

datasets library token= vs use_auth_token= API compatibility versions

💡 Result:

Short answer

The datasets library deprecated use_auth_token in favor of token in version 2.14.0; use_auth_token is scheduled for removal in 3.0.0. Use token=<str|bool> for authentication going forward. [1][2]

Compatibility notes and advice

Older code that passes use_auth_token may still work on older datasets releases, but can cause errors in newer releases because load_dataset forwards unknown kwargs into builder/configs (there have been bugs and fixes to strip the deprecated argument). If you maintain code that must run across multiple datasets versions, switch to token and/or detect both names and pass token to the call. [2][3]

If you call functions that accept DownloadConfig/BuilderConfig/ push_to_hub, ensure you pass token (or set HF token via environment/HF CLI) because some internals expect token now. [2][4]

Example (portable pattern)

Preferred (new):
load_dataset(..., token="hf_XXXX")

Backward-compatible wrapper:
def load_ds(*args, **kwargs):
if "use_auth_token" in kwargs and "token" not in kwargs:
kwargs["token"] = kwargs.pop("use_auth_token")
return load_dataset(*args, **kwargs)

References
[1] Hugging Face datasets docs — loading methods (deprecation note: use_auth_token deprecated in 2.14.0).
[2] Hugging Face datasets docs (package reference / load_dataset parameters).
[3] GH issue/PR and fixes discussing forwarding/stripping deprecated use_auth_token and related compatibility bugs.

🏁 Script executed:

# Read the full pyproject.toml to see datasets constraints cat pyproject.toml

Repository: NVIDIA-NeMo/Skills

Length of output: 2343

🏁 Script executed:

# Check the main requirements file cat requirements/main.txt | grep -i datasets

Repository: NVIDIA-NeMo/Skills

Length of output: 107

🏁 Script executed:

# Also check if there are any other requirements files that might specify datasets find requirements -type f -name "*.txt" | xargs grep -l datasets 2>/dev/null

Repository: NVIDIA-NeMo/Skills

Length of output: 105

🏁 Script executed:

# Check stem.txt to see if it has more specific datasets constraints cat requirements/stem.txt | grep -i datasets

Repository: NVIDIA-NeMo/Skills

Length of output: 90

🏁 Script executed:

# Search for any try/except or version detection around load_dataset calls rg -n "load_dataset" nemo_skills/dataset/compute-eval/prepare.py -A 3 -B 3

Repository: NVIDIA-NeMo/Skills

Length of output: 507

🏁 Script executed:

# Check other prepare.py files to see how they handle load_dataset and token/use_auth_token rg -n "load_dataset" nemo_skills/dataset -l | head -5 | xargs -I {} bash -c 'echo "=== {} ===" && cat {} | grep -A 2 "load_dataset"' 2>/dev/null | head -60

Repository: NVIDIA-NeMo/Skills

Length of output: 1336

🏁 Script executed:

# Check if datasets versions support token= parameter # Search for any version-specific handling in the codebase rg -n "use_auth_token" nemo_skills

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Check if the compute-eval dataset is actually used elsewhere or if eval.jsonl is read somewhere rg -n "compute-eval" nemo_skills --type py

Repository: NVIDIA-NeMo/Skills

Length of output: 721

🏁 Script executed:

# Check if there's a minimum datasets version constraint elsewhere grep -r "datasets" requirements/ nemo_skills --include="*.txt" --include="*.py" | grep -E "(>=|==|<=)"

Repository: NVIDIA-NeMo/Skills

Length of output: 547

🏁 Script executed:

# Verify if eval.jsonl is read from the package directory somewhere rg -n "compute-eval.*eval\.jsonl" nemo_skills

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Check how the dataset is actually used after preparation rg -n "eval\.jsonl" -B 3 -A 3 nemo_skills

Repository: NVIDIA-NeMo/Skills

Length of output: 1190

Add --output-dir argument and handle backwards-compatible HF authentication.

Writing to Path(__file__).parent fails in installed environments (package directories are typically read-only). Additionally, while the code currently uses token=, the datasets<4 constraint lacks a lower bound, so users with older versions may need use_auth_token= for compatibility. Add --output-dir argument (defaulting to current working directory), use sys.exit() instead of exit(), and use try/except to handle both authentication parameter names across datasets versions.

Proposed diff

import argparse import json import os +import sys from pathlib import Path from datasets import load_dataset @@ if __name__ == "__main__": parser = argparse.ArgumentParser(description="Download and prepare nvidia/compute-eval dataset") parser.add_argument( "--release", type=str, default=None, help="Release to download (e.g., '2025-1', '2025-2'). If not specified, downloads default release.", ) + parser.add_argument( + "--output-dir", + type=Path, + default=Path.cwd(), + help="Directory to write eval.jsonl into (default: current working directory).", + ) args = parser.parse_args() token = os.getenv("HF_TOKEN", None) if not token: print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.") - exit(1) + sys.exit(1) - dataset = load_dataset("nvidia/compute-eval", args.release, token=token) - data_dir = Path(__file__).absolute().parent - data_dir.mkdir(exist_ok=True) + try: + dataset = load_dataset("nvidia/compute-eval", args.release, token=token) + except TypeError: + # Backward compatibility with older `datasets` versions. + dataset = load_dataset("nvidia/compute-eval", args.release, use_auth_token=token) + + data_dir: Path = args.output_dir + data_dir.mkdir(parents=True, exist_ok=True) with open(data_dir / "eval.jsonl", "wt", encoding="utf-8") as f: for item in dataset["eval"]: record = { "problem": item, "task_id": item["task_id"], "problem_prompt": item["prompt"], "build_command": item["build_command"], "context_files_block": _format_context_files_block(item["context_files"]), } @@ f.write(json.dumps(record, default=str) + "\n")

🤖 Prompt for AI Agents

In @nemo_skills/dataset/compute-eval/prepare.py around lines 56 - 87, Add a new CLI arg via parser.add_argument("--output-dir", type=str, default=".", help="Directory to write eval.jsonl") and replace data_dir = Path(__file__).absolute().parent with data_dir = Path(args.output_dir).resolve(); ensure data_dir.mkdir(parents=True, exist_ok=True). Replace exit(1) with sys.exit(1) when HF_TOKEN is missing. For backwards-compatible HF auth, call load_dataset("nvidia/compute-eval", args.release, token=token) inside a try/except that catches TypeError and retries with load_dataset(..., use_auth_token=token). Keep writing the file to data_dir / "eval.jsonl" and keep using _format_context_files_block for records.

coderabbitai · 2026-01-08T23:44:19Z

nemo_skills/evaluation/metrics/code_metrics.py

+class ComputeEvalMetrics(BaseMetrics):
+    def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]:
+        return {"accuracy": prediction["passed"]}
+
+    def get_incorrect_sample(self, prediction: dict) -> dict:
+        return {"passed": False}
+
+    def update(self, predictions):
+        super().update(predictions)
+        self._compute_pass_at_k(predictions=predictions)


⚠️ Potential issue | 🟡 Minor

Fix Ruff ARG002: unused prediction arg in get_incorrect_sample.
Rename to _prediction (or add an inline # noqa: ARG002) to keep lint clean.

Proposed diff

class ComputeEvalMetrics(BaseMetrics): def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]: return {"accuracy": prediction["passed"]} - def get_incorrect_sample(self, prediction: dict) -> dict: + def get_incorrect_sample(self, _prediction: dict) -> dict: return {"passed": False}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class ComputeEvalMetrics(BaseMetrics):

def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]:

return {"accuracy": prediction["passed"]}

def get_incorrect_sample(self, prediction: dict) -> dict:

return {"passed": False}

def update(self, predictions):

super().update(predictions)

self._compute_pass_at_k(predictions=predictions)

class ComputeEvalMetrics(BaseMetrics):

def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]:

return {"accuracy": prediction["passed"]}

def get_incorrect_sample(self, _prediction: dict) -> dict:

return {"passed": False}

def update(self, predictions):

super().update(predictions)

self._compute_pass_at_k(predictions=predictions)

🧰 Tools

🪛 Ruff (0.14.10)

130-130: Unused method argument: prediction

(ARG002)

🤖 Prompt for AI Agents

In @nemo_skills/evaluation/metrics/code_metrics.py around lines 126 - 135, The linter flags an unused parameter in get_incorrect_sample; rename the parameter prediction to _prediction (i.e., def get_incorrect_sample(self, _prediction: dict) -> dict:) or alternatively append an inline noqa (e.g., prediction: dict # noqa: ARG002) to silence Ruff, and keep the returned dict as {"passed": False}; update the function signature in class ComputeEvalMetrics accordingly so references remain valid.

coderabbitai · 2026-01-08T23:44:19Z

nemo_skills/inference/eval/compute_eval.py

+# noinspection PyProtectedMember
+from compute_eval.generate_completions import _parse_solution


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if compute_eval is an external package or part of the repo find . -name "compute_eval" -type d 2>/dev/null | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 44

🏁 Script executed:

# Also search for any requirements or dependencies files git ls-files | grep -E "requirements|setup.py|pyproject.toml|poetry.lock|Pipfile" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 368

🏁 Script executed:

# Look at the specific file mentioned in the review cat -n nemo_skills/inference/eval/compute_eval.py | head -30

Repository: NVIDIA-NeMo/Skills

Length of output: 1306

🏁 Script executed:

# Search for _parse_solution usage in the codebase rg "_parse_solution" -A 2 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 851

🏁 Script executed:

# Check if there are other imports from compute_eval in the codebase rg "from compute_eval|import compute_eval" -A 2 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1305

🏁 Script executed:

# Check requirements files for compute_eval dependency and version rg "compute.eval|compute-eval" requirements/ pyproject.toml --no-filename

Repository: NVIDIA-NeMo/Skills

Length of output: 131

🏁 Script executed:

# Check if there's any documentation or comments about _parse_solution rg "_parse_solution|parse_solution" -A 3 -B 3

Repository: NVIDIA-NeMo/Skills

Length of output: 1195

🏁 Script executed:

# Look for any imports from generate_completions module rg "from compute_eval.generate_completions|import.*generate_completions" -A 5 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 693

🏁 Script executed:

# Search for any documentation or comments about what _parse_solution does rg "FileSolution|files=" nemo_skills/inference/eval/compute_eval.py -A 5 -B 5

Repository: NVIDIA-NeMo/Skills

Length of output: 841

🏁 Script executed:

# Check git history for when this import was added git log --oneline --all -S "_parse_solution" 2>/dev/null | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 148

🏁 Script executed:

# Look for any test files or examples that might show FileSolution usage fd -e py . | xargs rg "FileSolution" -l

Repository: NVIDIA-NeMo/Skills

Length of output: 153

🏁 Script executed:

# Check the compute_eval.py evaluator file to see if it has similar needs cat -n nemo_skills/evaluation/evaluator/compute_eval.py | grep -A 20 -B 5 "FileSolution\|PatchSolution" | head -40

Repository: NVIDIA-NeMo/Skills

Length of output: 1806

Avoid importing protected members from external libraries.

Importing _parse_solution (a protected member indicated by the leading underscore) from compute_eval.generate_completions creates a fragile dependency on internal implementation details. If the compute-eval library refactors or renames this function, your code will break. Note that the codebase is pinned to a specific commit (git+https://github.com/NVIDIA/compute-eval.git@2d14770), which provides some stability, but this protected API could still change in future versions.

Consider one of these approaches:

Request a public API from the compute-eval maintainers to expose this parsing functionality

Vendor the parsing logic if the library doesn't provide a stable public API

Check for public alternatives in the compute-eval package (note: the evaluator uses only public compute_eval APIs, suggesting this parsing function may not have a public equivalent)

Improve exception handling during inference. Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com>

greptile-apps

Greptile Overview

Greptile Summary

This PR successfully integrates ComputeEval, a benchmark for evaluating LLMs on CUDA code generation, into the NeMo-Skills framework. The implementation follows established patterns in the codebase and provides a complete end-to-end workflow.

What Was Added

Dataset Integration:

Dataset preparation script that downloads problems from HuggingFace (gated dataset requiring authentication)
Formats context files, build commands, and problem prompts into the NeMo-Skills data format
Creates eval.jsonl in the standard dataset directory structure

Generation Pipeline:

ComputeEvalGenerationTask extends GenerationTask to handle CUDA solution generation
Uses compute-eval's _parse_solution function to parse LLM outputs into structured FileSolution objects
Properly handles parsing errors by setting solution to None and logging errors

Evaluation System:

ComputeEvalEvaluator implements BaseEvaluator for async evaluation
Validates CUDA Toolkit version (requires CUDA 12+) at initialization
Uses compute-eval's evaluate_solution function to compile and test CUDA code
Comprehensive error handling for missing fields, validation errors, and execution failures

Metrics & Configuration:

ComputeEvalMetrics follows the same pattern as other code benchmarks
Computes pass@k metrics and accuracy
Detailed prompt configuration with instructions for CUDA/C/C++ code generation

Architecture & Integration

The implementation properly integrates with NeMo-Skills architecture:

Registered in EVALUATOR_CLASS_MAP for class-based evaluation
Registered in METRICS_MAP for metrics computation
Added to test coverage in test_datasets.py
Comprehensive documentation with prerequisites, setup, and usage examples

Code Quality

Follows existing codebase patterns (similar to IOI, ICPC evaluators)
Proper error handling with try-except blocks and error logging
Uses Pydantic TypeAdapters for robust data validation
Async/await pattern for concurrent evaluation
Dependencies pinned to specific commit hash for reproducibility

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation is well-structured, follows established patterns in the codebase, includes proper error handling, comprehensive documentation, and test coverage. The code integrates cleanly with existing NeMo-Skills infrastructure without breaking changes. All components follow best practices seen in similar evaluators (IOI, ICPC, livecodebench).
No files require special attention - all implementations follow codebase conventions and include appropriate error handling

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/compute-eval/prepare.py	4/5	Dataset preparation script that downloads compute-eval from HuggingFace and formats context files; relies on HF_TOKEN environment variable
nemo_skills/evaluation/evaluator/compute_eval.py	4/5	Evaluator implementation that validates CUDA toolkit version, evaluates solutions using compute-eval library with proper error handling
nemo_skills/evaluation/metrics/code_metrics.py	5/5	Added ComputeEvalMetrics class following same pattern as other code metrics (e.g., HumanEvalInfillingMetrics)
nemo_skills/inference/eval/compute_eval.py	4/5	Generation task that parses LLM output into FileSolution format using compute-eval's _parse_solution; handles parsing errors gracefully
nemo_skills/prompt/config/compute-eval/baseline.yaml	5/5	Prompt configuration for CUDA code generation with clear instructions on build commands, context files, and output format

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as ns CLI
    participant Prepare as prepare.py
    participant HF as HuggingFace
    participant GenTask as ComputeEvalGenerationTask
    participant LLM as LLM Server
    participant ParseSolution as _parse_solution
    participant Evaluator as ComputeEvalEvaluator
    participant ComputeEval as compute-eval library
    participant Metrics as ComputeEvalMetrics

    Note over User,CLI: Data Preparation Phase
    User->>CLI: ns prepare_data compute-eval --release 2025-1
    CLI->>Prepare: Execute prepare.py
    Prepare->>Prepare: Validate authentication token
    Prepare->>HF: load_dataset with authentication
    HF-->>Prepare: Return dataset with eval split
    Prepare->>Prepare: Format context files & build commands
    Prepare->>Prepare: Write eval.jsonl
    Prepare-->>User: Dataset ready

    Note over User,CLI: Inference Phase
    User->>CLI: ns eval --benchmarks=compute-eval
    CLI->>GenTask: Initialize with prompt config
    GenTask->>GenTask: Load eval.jsonl
    loop For each problem
        GenTask->>LLM: Generate CUDA solution
        LLM-->>GenTask: Return generated code
        GenTask->>ParseSolution: Parse solution into FileSolution
        ParseSolution-->>GenTask: Return structured solution
        GenTask->>GenTask: Write to output file
    end

    Note over User,CLI: Evaluation Phase
    CLI->>Evaluator: Initialize ComputeEvalEvaluator
    Evaluator->>Evaluator: Check nvcc version (CUDA Toolkit 12+)
    loop For each generated solution
        Evaluator->>Evaluator: Validate problem & solution models
        Evaluator->>ComputeEval: evaluate_solution(problem, solution)
        ComputeEval->>ComputeEval: Compile CUDA code
        ComputeEval->>ComputeEval: Run test cases
        ComputeEval-->>Evaluator: Return graded result (passed/failed)
        Evaluator->>Evaluator: Update output with evaluation results
    end

    Note over User,CLI: Metrics Calculation Phase
    CLI->>Metrics: Calculate metrics from evaluation results
    Metrics->>Metrics: Compute pass@k metrics
    Metrics->>Metrics: Calculate accuracy
    Metrics-->>User: Display metrics.json

Signed-off-by: George Armstrong <georgea@nvidia.com>

greptile-apps

Greptile Overview

Greptile Summary

Overview

This PR successfully integrates ComputeEval - a CUDA code generation benchmark - into NeMo-Skills with comprehensive support for dataset preparation, generation, evaluation, and metrics tracking.

Key Changes

Dataset Preparation (prepare.py): Downloads ComputeEval problems from HuggingFace (gated dataset) and creates eval.jsonl with problem metadata, build commands, and context files.
Generation Task (inference/eval/compute_eval.py): Implements ComputeEvalGenerationTask that extends GenerationTask to parse LLM-generated CUDA solutions using compute-eval's parsing utility.
Evaluator (evaluation/evaluator/compute_eval.py): Implements ComputeEvalEvaluator that validates solutions and executes them against test suites, with NVCC version checking and comprehensive error handling.
Metrics (evaluation/metrics/code_metrics.py): Implements ComputeEvalMetrics for pass@k computation following established patterns.
Documentation (docs/evaluation/code.md): Comprehensive guide including security warnings about executing machine-generated code, setup requirements (NVIDIA GPU with CUDA 12+), data preparation, and evaluation workflows.
Configuration: Baseline prompt template providing detailed instructions for CUDA code generation, config registration in evaluator/metrics maps, and test coverage.

Architecture

The implementation follows established NeMo-Skills patterns:

Evaluator classes inherit from BaseEvaluator and implement async eval_single()
Metrics classes implement _get_score_dict() and call _compute_pass_at_k()
Generation tasks override process_single_datapoint() for custom solution parsing
Clean integration with existing evaluator/metrics registration systems

Issues Found

Critical Issues

Missing import in prepare.py: Uses exit(1) without importing sys module. Should use sys.exit(1) (lines 14-70).

Minor Issues

Error handling in generation task: When solution parsing fails, returns solution: None which will be validated by the evaluator. While error handling catches this, could be more explicit about failure states.
Misplaced comment in evaluator: The # noinspection PyBroadException comment (line 46) is placed above a specific KeyError handler when it should apply to the broad Exception handler on line 76.

Strengths

Proper NVCC version validation ensuring CUDA Toolkit 12+ compatibility
Comprehensive exception handling at both generation and evaluation stages
Security documentation warning about executing machine-generated code
Follows established NeMo-Skills architectural patterns
Test coverage included in test_datasets.py
External dependency properly pinned to specific commit (not yet on PyPI)

Dependencies

Adds compute-eval from NVIDIA GitHub repository (public but not on PyPI). This is appropriate given the specialized nature of the CUDA benchmark.

Confidence Score: 3/5

This PR has a critical bug (missing sys import) that will cause runtime failure and should be fixed before merging. Otherwise, the integration is architecturally sound with good error handling.
The score reflects one critical syntax error that prevents the prepare.py script from running (missing sys import), which is a blocker. The evaluator and metrics implementation are solid and follow established patterns. The generation task has appropriate error handling. The critical issue must be fixed, and minor issues (misplaced comment, ambiguous error handling) should be addressed for code quality.
nemo_skills/dataset/compute-eval/prepare.py requires fixing the sys import issue before merge. nemo_skills/inference/eval/compute_eval.py could benefit from clarifying error handling. nemo_skills/evaluation/evaluator/compute_eval.py has a misplaced comment that should be corrected.

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/dataset/compute-eval/prepare.py	2/5	Script downloads compute-eval dataset from HuggingFace and prepares JSONL file. Contains critical bug: calls `exit(1)` without importing `sys` module on line 70.
nemo_skills/evaluation/evaluator/compute_eval.py	4/5	ComputeEvalEvaluator correctly implements BaseEvaluator with proper async handling, NVCC version checking, and comprehensive exception handling. No issues found.
nemo_skills/inference/eval/compute_eval.py	3/5	Generation task extends GenerationTask and processes datapoint to parse solutions. Includes error handling but solution=None could cause issues downstream during evaluation.
nemo_skills/evaluation/metrics/code_metrics.py	5/5	ComputeEvalMetrics correctly implements metrics computation with pass@k support following established patterns. No issues found.

Sequence Diagram

sequenceDiagram
    participant User
    participant prepare.py as prepare.py<br/>(Dataset)
    participant generate as compute_eval.py<br/>(Generation)
    participant LLM as LLM Server
    participant evaluator as ComputeEvalEvaluator
    participant solver as execute_solution<br/>(compute-eval)

    User->>prepare.py: ns prepare_data compute-eval
    prepare.py->>prepare.py: Load from HF (gated)
    prepare.py->>prepare.py: Format context files
    prepare.py->>prepare.py: Write eval.jsonl

    User->>generate: ns eval --benchmarks=compute-eval
    generate->>generate: Read eval.jsonl entries
    generate->>generate: Fill prompt template
    generate->>LLM: Send CUDA problem
    LLM->>generate: Return generated code
    generate->>generate: Parse solution files
    generate->>generate: Write to output.jsonl

    User->>evaluator: Run batch evaluation
    evaluator->>evaluator: Check NVCC version
    evaluator->>evaluator: Validate problem/solution
    evaluator->>solver: execute_solution()
    solver->>solver: Compile code
    solver->>solver: Run tests
    solver->>evaluator: Return results
    evaluator->>evaluator: Compute pass@k metrics
    evaluator->>evaluator: Write results

greptile-apps · 2026-01-09T22:55:22Z

nemo_skills/dataset/compute-eval/prepare.py

+import argparse
+import json
+import os
+from pathlib import Path
+
+from datasets import load_dataset


The sys module needs to be imported to use sys.exit(). Currently, line 70 calls exit(1) which is a builtin but not recommended. The script should import sys and use sys.exit(1) for consistency with Python best practices.

Suggested change

import argparse

import json

import os

from pathlib import Path

from datasets import load_dataset

import argparse

import json

import os

import sys

from pathlib import Path

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-09T22:55:23Z

nemo_skills/dataset/compute-eval/prepare.py

+    token = os.getenv("HF_TOKEN", None)
+    if not token:
+        print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.")
+        exit(1)


Missing import for sys module. The code calls exit(1) on line 70, but the standard Python idiom is to use sys.exit(). Add import sys to the imports at the top of the file (after line 14).

Suggested change

token = os.getenv("HF_TOKEN", None)

if not token:

print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.")

exit(1)

import sys

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-09T22:55:24Z

nemo_skills/dataset/compute-eval/prepare.py

+    token = os.getenv("HF_TOKEN", None)
+    if not token:
+        print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.")
+        exit(1)


Replace exit(1) with sys.exit(1) after importing sys module at the top of the file. The builtin exit() function is intended for interactive mode, while sys.exit() is the standard for scripts.

Suggested change

exit(1)

sys.exit(1)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-09T22:55:25Z

nemo_skills/inference/eval/compute_eval.py

+        except KeyError as e:
+            _LOG.error(f"Missing required field: {e}")
+            return {
+                "solution": None,
+                "generation": res.get("generation", ""),
+                "error": f"Missing required field: {e}",
+            }
+        except Exception as e:
+            _LOG.error(f"Failed to parse solution: {e}")
+            return {
+                "solution": None,
+                "generation": res.get("generation", ""),
+                "error": f"Failed to parse solution: {e}",
+            }


When solution parsing fails, the code returns solution: None. However, the evaluator will receive this None value and attempt to validate it with _SOLUTION_ADAPTER.validate_python(data_point["solution"]). While the evaluator's exception handling will catch this, it would be clearer to prevent None solutions from reaching the evaluator. Consider either: (1) not including the solution key when it's None so evaluator can skip evaluation, or (2) including an explicit error indicator that the evaluator can check before attempting validation. This improves clarity of failure modes.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-09T22:55:26Z

nemo_skills/evaluation/evaluator/compute_eval.py

+        # noinspection PyBroadException
+        try:
+            problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])


The # noinspection PyBroadException comment on line 46 is placed above the specific KeyError handler, but the broad exception handler is actually on line 76. The comment should be moved to line 76 where the broad Exception catch occurs, or removed if not needed here.

Suggested change

# noinspection PyBroadException

try:

problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])

async def eval_single(self, data_point: dict[str, Any]) -> dict[str, Any]:

try:

problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])

solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"])

greptile-apps

Greptile Overview

Greptile Summary

This PR integrates ComputeEval, a benchmark for evaluating LLMs on CUDA code generation, into the NeMo-Skills framework. The implementation includes:

Components Added:

Data Preparation (nemo_skills/dataset/compute-eval/prepare.py): Downloads ComputeEval dataset from HuggingFace (gated, requires HF_TOKEN), formats problem prompts, build commands, and context files into eval.jsonl
Generation Task (nemo_skills/inference/eval/compute_eval.py): Extends GenerationTask to generate CUDA solutions, parses generated code using compute-eval's _parse_solution, and creates FileSolution objects
Evaluator (nemo_skills/evaluation/evaluator/compute_eval.py): Extends BaseEvaluator to evaluate solutions by compiling CUDA code and running test suites via compute-eval library
Metrics (nemo_skills/evaluation/metrics/code_metrics.py): ComputeEvalMetrics class computes pass@k and accuracy metrics
Prompt Template (nemo_skills/prompt/config/compute-eval/baseline.yaml): Comprehensive system prompt for CUDA/C++ code generation with instructions on output format, build commands, and context files
Documentation (docs/evaluation/code.md): Complete setup and usage instructions with prerequisites (CUDA Toolkit 12+), security warnings about executing machine-generated code

Architecture:
The implementation follows NeMo-Skills patterns with proper registration in EVALUATOR_CLASS_MAP and METRICS_MAP. Data flows from HuggingFace → prepare.py → eval.jsonl → GenerationTask → LLM → Evaluator → compute-eval library → Metrics.

Critical Issue Found:
ComputeEvalEvaluator has a bug where it doesn't convert the config dict to a typed config object. BaseEvaluator.eval_full() expects to access self.config.input_file as an attribute, but ComputeEvalEvaluator stores config as a plain dict, which will cause AttributeError at runtime. This needs to be fixed by creating a ComputeEvalEvaluatorConfig class (like other evaluators do) and converting the dict to this object in init.

Strengths:

Comprehensive error handling in both generation and evaluation
Proper async/await patterns with asyncio.to_thread for blocking operations
Clear documentation with security warnings
Follows existing codebase patterns for metrics and dataset structure

Confidence Score: 2/5

This PR contains a critical bug that will cause runtime failures when the evaluator runs in batch mode
Score of 2 reflects a critical logic error in ComputeEvalEvaluator that will cause AttributeError when eval_full() is called. The evaluator doesn't follow the pattern used by other evaluators (MathEvaluator, BirdEvaluator, CodeExecEvaluator, etc.) of converting the config dict to a typed config object. BaseEvaluator.eval_full() attempts to access self.config.input_file as an attribute, but since self.config is stored as a plain dict, this will fail. This is a blocking issue that prevents the batch evaluation functionality from working. The rest of the implementation is solid with proper error handling, async patterns, and comprehensive documentation.
Pay close attention to nemo_skills/evaluation/evaluator/compute_eval.py - it needs a config class and proper conversion in init to match other evaluators' patterns

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/evaluation/evaluator/compute_eval.py	2/5	Evaluator implementation has critical bug: missing config class conversion that will cause AttributeError in eval_full()
nemo_skills/inference/eval/compute_eval.py	4/5	Generation task properly inherits from GenerationTask, handles solution parsing with error handling
nemo_skills/dataset/compute-eval/prepare.py	4/5	Data preparation script downloads dataset and formats it correctly for evaluation pipeline
nemo_skills/evaluation/metrics/code_metrics.py	5/5	Added ComputeEvalMetrics class following existing pattern, correctly implements pass@k computation
docs/evaluation/code.md	5/5	Comprehensive documentation added with clear setup, preparation, and evaluation instructions

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareData as prepare.py
    participant HuggingFace as HuggingFace Dataset
    participant GenerationTask as ComputeEvalGenerationTask
    participant LLM as LLM Server
    participant Evaluator as ComputeEvalEvaluator
    participant ComputeEval as compute-eval Library
    participant Metrics as ComputeEvalMetrics

    User->>PrepareData: ns prepare_data compute-eval
    PrepareData->>HuggingFace: load_dataset("nvidia/compute-eval")
    HuggingFace-->>PrepareData: Return dataset items
    PrepareData->>PrepareData: Format context_files_block
    PrepareData->>PrepareData: Write eval.jsonl with problem, task_id, etc.

    User->>GenerationTask: ns eval --benchmarks=compute-eval
    GenerationTask->>GenerationTask: Load eval.jsonl
    loop For each problem
        GenerationTask->>LLM: Generate CUDA code with prompt template
        LLM-->>GenerationTask: Return generated code
        GenerationTask->>GenerationTask: Parse solution using _parse_solution
        GenerationTask->>GenerationTask: Create FileSolution object
        opt If eval during generation
            GenerationTask->>Evaluator: eval_single(data_point)
            Evaluator->>ComputeEval: evaluate_solution(problem, solution)
            ComputeEval->>ComputeEval: Compile CUDA code
            ComputeEval->>ComputeEval: Run test suite
            ComputeEval-->>Evaluator: Return passed, skipped, build/test output
            Evaluator-->>GenerationTask: Return evaluation results
        end
        GenerationTask->>GenerationTask: Write results to output file
    end

    opt If batch evaluation
        User->>Evaluator: Run evaluation
        Evaluator->>Evaluator: Load output file
        loop For each result
            Evaluator->>ComputeEval: evaluate_solution(problem, solution)
            ComputeEval-->>Evaluator: Return evaluation results
        end
        Evaluator->>Evaluator: Update file with evaluation results
    end

    User->>Metrics: Calculate metrics
    Metrics->>Metrics: Load evaluation results
    Metrics->>Metrics: Compute pass@k, accuracy
    Metrics-->>User: Display metrics.json

greptile-apps · 2026-01-09T22:58:28Z

nemo_skills/evaluation/evaluator/compute_eval.py

+    def __init__(self, config: dict, num_parallel_requests=10):
+        super().__init__(config, num_parallel_requests)
+        nvcc_version = get_nvcc_version()
+        if not nvcc_version:
+            raise RuntimeError(
+                "NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH."
+            )
+
+        self._installed_ctk_major, self._installed_ctk_minor, _ = parse_semver(nvcc_version)


ComputeEvalEvaluator is missing a config class and doesn't convert the dict config to an object, which will cause AttributeError when BaseEvaluator.eval_full() tries to access self.config.input_file.

Following the pattern from other evaluators (MathEvaluator, BirdEvaluator, CodeExecEvaluator, etc.), you should:

Create a ComputeEvalEvaluatorConfig class that extends BaseEvaluatorConfig

In init, convert the config dict to this object: self.eval_config = ComputeEvalEvaluatorConfig(**self.config)

Update self.config to be the eval_config object so BaseEvaluator.eval_full() can access config.input_file

Suggested change

def __init__(self, config: dict, num_parallel_requests=10):

super().__init__(config, num_parallel_requests)

nvcc_version = get_nvcc_version()

if not nvcc_version:

raise RuntimeError(

"NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH."

)

self._installed_ctk_major, self._installed_ctk_minor, _ = parse_semver(nvcc_version)

def __init__(self, config: dict, num_parallel_requests=10):

super().__init__(config, num_parallel_requests)

# Convert dict config to typed config object for attribute access

from nemo_skills.evaluation.evaluator.base import BaseEvaluatorConfig

from nemo_skills.utils import nested_dataclass

@nested_dataclass(kw_only=True)

class ComputeEvalEvaluatorConfig(BaseEvaluatorConfig):

pass

self.config = ComputeEvalEvaluatorConfig(**self.config)

nvcc_version = get_nvcc_version()

if not nvcc_version:

raise RuntimeError(

"NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH."

)

Signed-off-by: dlord <dlord@nvidia.com>

greptile-apps · 2026-01-09T23:48:32Z

Greptile Overview

Greptile Summary

This PR successfully integrates the ComputeEval benchmark for evaluating LLMs on CUDA code generation tasks. The implementation follows NeMo-Skills architectural patterns and adds all necessary components:

What Changed

Dataset Integration: Adds nemo_skills/dataset/compute-eval/ with data preparation script that downloads problems from HuggingFace and formats them with context files
Custom Generation Task: Implements ComputeEvalGenerationTask that parses generated CUDA code using compute-eval's solution parser
Evaluator: Implements ComputeEvalEvaluator as a class-based evaluator that compiles and executes CUDA code using the compute-eval library
Metrics: Adds ComputeEvalMetrics for computing pass@k accuracy metrics
Configuration: Includes prompt template with detailed CUDA coding instructions
Documentation: Comprehensive docs with setup, evaluation steps, and security warnings
Dependencies: Adds compute-eval package pinned to specific commit

Integration Quality

The integration is well-structured and follows existing patterns in the codebase. All registration points are correctly updated (EVALUATOR_CLASS_MAP, METRICS_MAP, test datasets). The documentation is thorough and includes important security notes about executing machine-generated code.

Key Observations

Security: The documentation appropriately warns about executing machine-generated CUDA code and recommends sandboxed environments. The evaluation relies on the compute-eval library's built-in security measures.
Dependency on Internal API: The generation task uses _parse_solution from compute-eval (a protected member), which is acknowledged with a comment. This is mitigated by pinning to a specific git commit.
Error Handling: Both generation and evaluation have comprehensive try-catch blocks for handling failures. However, the evaluator could be slightly more efficient by checking for None solutions before attempting validation.
Testing: Appropriately excludes compute-eval from automated GPU tests due to CUDA Toolkit requirements, but includes it in dataset validation tests.

Confidence Score: 4/5

This PR is safe to merge with minor code quality improvements recommended
The implementation is solid and follows established patterns in the codebase. All integration points are correctly wired, documentation is comprehensive, and error handling is in place. The score is 4 rather than 5 due to two minor style issues: (1) use of a protected API from compute-eval that could be fragile, and (2) the evaluator could handle None solutions more gracefully. Both issues have workarounds in place (pinned dependency and exception handling), so they don't block merging but would benefit from cleanup.
Files nemo_skills/evaluation/evaluator/compute_eval.py and nemo_skills/inference/eval/compute_eval.py have minor style improvements noted in comments, but no critical issues

Important Files Changed

File Analysis

Filename	Score	Overview
nemo_skills/evaluation/evaluator/compute_eval.py	4/5	Implements ComputeEvalEvaluator for CUDA code evaluation. Clean implementation but could handle None solution values more gracefully before validation.
nemo_skills/inference/eval/compute_eval.py	4/5	Implements generation task for ComputeEval. Uses protected API _parse_solution from compute-eval library which may be fragile.
nemo_skills/dataset/compute-eval/prepare.py	5/5	Dataset preparation script that downloads and formats ComputeEval problems from HuggingFace. Implementation is clean and follows existing patterns.
nemo_skills/evaluation/metrics/code_metrics.py	5/5	Adds ComputeEvalMetrics class for computing pass@k metrics. Simple implementation that correctly inherits from BaseMetrics.
nemo_skills/prompt/config/compute-eval/baseline.yaml	5/5	Prompt configuration for CUDA code generation. Well-structured with clear instructions for models to generate compilable CUDA code.

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareData as ns prepare_data
    participant HuggingFace as HuggingFace Dataset
    participant Eval as ns eval
    participant GenTask as ComputeEvalGenerationTask
    participant LLM as LLM Server
    participant Parser as _parse_solution
    participant Evaluator as ComputeEvalEvaluator
    participant ComputeEval as compute_eval library
    participant NVCC as CUDA Toolkit
    participant Metrics as ComputeEvalMetrics

    User->>PrepareData: Run ns prepare_data compute-eval
    PrepareData->>HuggingFace: load_dataset(nvidia/compute-eval)
    HuggingFace-->>PrepareData: Return problem data
    PrepareData->>PrepareData: Format context files & prompts
    PrepareData-->>User: Generate eval.jsonl

    User->>Eval: Run ns eval --benchmarks=compute-eval
    Eval->>GenTask: Initialize generation task
    GenTask->>LLM: Send prompts with CUDA problems
    LLM-->>GenTask: Return generated CUDA code
    GenTask->>Parser: Parse solution with _parse_solution
    alt Parsing succeeds
        Parser-->>GenTask: Return FileSolution object
        GenTask->>GenTask: Create solution dict
    else Parsing fails
        Parser-->>GenTask: Raise exception
        GenTask->>GenTask: Set solution = None, log error
    end

    GenTask->>Evaluator: Pass data_point with solution
    Evaluator->>Evaluator: Check nvcc version
    alt Solution is not None
        Evaluator->>ComputeEval: evaluate_solution(problem, solution)
        ComputeEval->>NVCC: Compile CUDA code
        alt Compilation succeeds
            NVCC-->>ComputeEval: Binary compiled
            ComputeEval->>ComputeEval: Execute tests
            ComputeEval-->>Evaluator: Return graded results
        else Compilation fails
            NVCC-->>ComputeEval: Build errors
            ComputeEval-->>Evaluator: Return failed result
        end
    else Solution is None
        Evaluator->>Evaluator: Return failed evaluation
    end

    Evaluator-->>Eval: Return evaluation results
    Eval->>Metrics: Compute pass@k metrics
    Metrics-->>User: Display accuracy and pass rates

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-09T23:48:36Z

nemo_skills/evaluation/evaluator/compute_eval.py

+    async def eval_single(self, data_point: dict[str, Any]) -> dict[str, Any]:
+        # noinspection PyBroadException
+        try:
+            problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])
+            solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"])
+


The evaluator does not check if data_point["solution"] is None before attempting to validate it. When the generation task fails to parse a solution (in nemo_skills/inference/eval/compute_eval.py lines 51-54, 58-61), it sets solution to None. Attempting to validate None with the TypeAdapter will raise a validation error.

While this is caught by the exception handler, it would be cleaner and more efficient to explicitly check for None first:

Suggested change

async def eval_single(self, data_point: dict[str, Any]) -> dict[str, Any]:

# noinspection PyBroadException

try:

problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])

solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"])

async def eval_single(self, data_point: dict[str, Any]) -> dict[str, Any]:

# noinspection PyBroadException

try:

# Check if solution is None (parsing failed during generation)

if data_point.get("solution") is None:

return {

"passed": False,

"skipped": False,

"elapsed_time": 0.0,

"build_output": "",

"test_output": "",

"error": "Solution parsing failed during generation",

}

problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])

solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"])

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-09T23:48:37Z

nemo_skills/inference/eval/compute_eval.py

+# noinspection PyProtectedMember
+from compute_eval.generate_completions import _parse_solution


This imports a protected member _parse_solution from the compute_eval library. Protected members (prefixed with underscore) are internal APIs that may change without notice in future versions of compute-eval, which could break this integration.

Consider:

Checking if compute_eval provides a public API for parsing solutions

If not, documenting this dependency on an internal API and pinning to a specific commit (which is already done in requirements/main.txt at line 17)

Requesting the compute-eval maintainers to expose this as a public API

gwarmstrong

LGTM. Comments primarily addressed in the previous version of PR #1124

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>

blahblahasdf and others added 30 commits January 8, 2026 12:52

Add ComputeEval skill.

98e9dbf

Signed-off-by: dlord <dlord@nvidia.com>

Split out evaluator from generation task.

e9ce660

Signed-off-by: dlord <dlord@nvidia.com>

Adding clan PR with AudioBench and Librispeech PC. (#1103)

d1f8e8a

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: dlord <dlord@nvidia.com>

Schema overrides for tool-calling (#1118)

1510216

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

FIX tool call error handling and search tool errors (#1120)

572f770

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Use run.Script for generate pipeline (#1052)

b2308f7

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Port ICPC changes to IOI (#1046)

0832f7d

Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Signed-off-by: dlord <dlord@nvidia.com>

replace raise error with LOG.warning in AA LCR dataset prepare (#1119)

eb4969d

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

FIX tavily search results return type (#1123)

0211bea

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Address CodeRabbit review comments.

980a574

Signed-off-by: dlord <dlord@nvidia.com>

Add exception handling during eval.

52059dd

Signed-off-by: dlord <dlord@nvidia.com>

Remove unncessary HuggingFace Hub call.

920e174

Signed-off-by: dlord <dlord@nvidia.com>

Push Problem data transformation logic in to the prepare script and…

1530926

… move `ComputeEvalGenerationTask` to be more on the rails of the `process_single_datapoint` approach. Signed-off-by: dlord <dlord@nvidia.com>

Revert "Use run.Script for generate pipeline (#1052)" (#1125)

cd88e99

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Fix: add serialized_output on bad request (#1127)

dc9c920

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

update paper link (#1128)

9e1a951

Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

update paper link, references to dataset, self-correction differences (…

ef71245

…#1129) Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

FIX ioi ignore (#1131)

a676e34

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

download AA-LCR_extracted-text.zip via hf_hub_download (#1126)

417ef07

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Evaluation on Livecodebench-pro (#1115)

a90cc44

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: dlord <dlord@nvidia.com>

Resolve broken links in docs (#1150)

7225f97

Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Introduced vLLM_multimodal model to save multimodal outputs (#1136)

1cbe0b1

Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

add swe-rebench to excluded datasets (#1154)

4b49e26

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Fix run.Script refactor (#1133)

5df7599

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

BIRD Benchmark (Text-to-SQL) (#1132)

7f405cb

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com> # Conflicts: # nemo_skills/evaluation/evaluator/__init__.py

o3-mini-20250131 -> o3-mini-2025-01-31 (#1149)

73624d6

Signed-off-by: bzantium <ryumin93@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>

Update compute-eval version.

94f2208

Signed-off-by: dlord <dlord@nvidia.com>

Another update to compute eval version that removes the pinned urllib3.

544199b

Signed-off-by: dlord <dlord@nvidia.com>

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

Update nemo_skills/inference/eval/compute_eval.py

f56175d

Improve exception handling during inference. Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com>

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

gwarmstrong added the run GPU tests label Jan 9, 2026

gwarmstrong added 2 commits January 9, 2026 14:51

Merge branch 'main' into add-compute-eval

2a49236

Update Dockerfile.nemo-skills

28af5ee

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong added run GPU tests and removed run GPU tests labels Jan 9, 2026

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

Exclude compute-eval from gpu integration tests.

3d3f753

Signed-off-by: dlord <dlord@nvidia.com>

gwarmstrong added run GPU tests and removed run GPU tests labels Jan 9, 2026

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

gwarmstrong approved these changes Jan 10, 2026

View reviewed changes

gwarmstrong merged commit 3a06378 into main Jan 10, 2026
6 checks passed

gwarmstrong deleted the add-compute-eval branch January 10, 2026 01:51

coderabbitai bot mentioned this pull request Jan 13, 2026

Add FrontierScience-Olympiad to benchmark #1165

Merged

coderabbitai bot mentioned this pull request Feb 3, 2026

Api call server #1189

Closed

This was referenced Feb 12, 2026

Eval kit support #1239

Merged

Adding an option to store benchmarks in external repo #1240

Merged

Add DSBench-DA evaluation #1254

Merged

coderabbitai bot mentioned this pull request Mar 6, 2026

Gnalbandyan/ugph hle verified #1293

Merged

		# noinspection PyProtectedMember
		from compute_eval.generate_completions import _parse_solution

Conversation

blahblahasdf commented Jan 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

ComputeEval features:

Changes

Dependencies

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 8, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

What Was Added

Architecture & Integration

Code Quality

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Overview

Key Changes

Architecture

Issues Found

Critical Issues

Minor Issues

Strengths

Dependencies

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

blahblahasdf commented Jan 8, 2026 •

edited by coderabbitai bot

Loading