Conversation
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
… move `ComputeEvalGenerationTask` to be more on the rails of the `process_single_datapoint` approach. Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
…#1129) Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com> # Conflicts: # nemo_skills/evaluation/evaluator/__init__.py
Signed-off-by: bzantium <ryumin93@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
📝 WalkthroughWalkthroughThis pull request introduces compute-eval, a NVIDIA CUDA code generation benchmark for the NemoSkills evaluation framework. It adds dataset preparation, evaluation components, prompt configuration, and test coverage to support end-to-end evaluation workflows. Changes
Sequence DiagramsequenceDiagram
actor User
participant PrepareScript as prepare.py
participant HFDatasets as HuggingFace Datasets
participant FileSystem as Local Storage
participant Generator as ComputeEvalGenerationTask
participant Evaluator as ComputeEvalEvaluator
participant Metrics as ComputeEvalMetrics
User->>PrepareScript: Run with --release
PrepareScript->>HFDatasets: Download compute-eval dataset
HFDatasets-->>PrepareScript: Dataset records
PrepareScript->>FileSystem: Write eval.jsonl with context blocks
User->>Generator: Generate solution (prompt + context)
Generator->>Generator: Parse generated files
Generator-->>User: Return FileSolution + generation metadata
User->>Evaluator: eval_single(data_point)
Evaluator->>Evaluator: Load NVCC, validate inputs
Evaluator->>Evaluator: Run evaluate_solution in thread
Evaluator-->>User: Return {passed, skipped, elapsed_time, build_output, test_output}
User->>Metrics: update(predictions)
Metrics->>Metrics: Extract passed from predictions
Metrics->>Metrics: Compute accuracy & pass@k
Metrics-->>User: Return scores
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @nemo_skills/dataset/compute-eval/prepare.py:
- Around line 56-87: Add a new CLI arg via parser.add_argument("--output-dir",
type=str, default=".", help="Directory to write eval.jsonl") and replace
data_dir = Path(__file__).absolute().parent with data_dir =
Path(args.output_dir).resolve(); ensure data_dir.mkdir(parents=True,
exist_ok=True). Replace exit(1) with sys.exit(1) when HF_TOKEN is missing. For
backwards-compatible HF auth, call load_dataset("nvidia/compute-eval",
args.release, token=token) inside a try/except that catches TypeError and
retries with load_dataset(..., use_auth_token=token). Keep writing the file to
data_dir / "eval.jsonl" and keep using _format_context_files_block for records.
In @nemo_skills/evaluation/metrics/code_metrics.py:
- Around line 126-135: The linter flags an unused parameter in
get_incorrect_sample; rename the parameter prediction to _prediction (i.e., def
get_incorrect_sample(self, _prediction: dict) -> dict:) or alternatively append
an inline noqa (e.g., prediction: dict # noqa: ARG002) to silence Ruff, and
keep the returned dict as {"passed": False}; update the function signature in
class ComputeEvalMetrics accordingly so references remain valid.
🧹 Nitpick comments (5)
nemo_skills/inference/eval/compute_eval.py (1)
38-54: Improve exception handling specificity.The exception handling has a couple of issues:
- Line 49: The KeyError handler doesn't provide context about which field is missing. The error could be from
task_idnot being indata_point, or from issues within_parse_solution.- Lines 52-54: The broad
Exceptioncatch might hide programming errors or unexpected issues that should fail fast rather than being caught and logged.♻️ Suggested improvements
async def process_single_datapoint(self, data_point, data): res = await super().process_single_datapoint(data_point, data) try: + # Validate required fields first + if "task_id" not in data_point: + raise ValueError("Missing required field 'task_id' in data_point") + solution = FileSolution( task_id=data_point["task_id"], files=_parse_solution(res["generation"]), ) return { "solution": solution.model_dump(), "generation": res["generation"], } - except KeyError as e: - _LOG.error(f"Missing required field: {e}") + except (KeyError, ValueError) as e: + _LOG.error(f"Failed to create solution for task {data_point.get('task_id', 'unknown')}: {e}") raise - except Exception as e: - _LOG.error(f"Failed to parse solution: {e}") + except (TypeError, AttributeError) as e: + # Catch specific parsing-related errors only + _LOG.error(f"Failed to parse solution for task {data_point.get('task_id', 'unknown')}: {e}") raisedocs/evaluation/code.md (2)
184-184: Use descriptive link text instead of "here".The link text "here" is not descriptive for accessibility and clarity. Consider using more descriptive text that indicates what the link points to.
📝 Suggested improvement
-- Original benchmark source is [here](https://github.com/NVIDIA/compute-eval). +- Original benchmark source: [compute-eval on GitHub](https://github.com/NVIDIA/compute-eval)Based on learnings from static analysis tools.
230-235: Add language identifier to code fence for proper syntax highlighting.The code block showing the metrics output should specify a language identifier (e.g.,
jsonortext) for proper rendering and syntax highlighting.📝 Suggested improvement
-``` +```text ---------------------------- compute-eval ----------------------------- evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy pass@1 | 50 | 8432 | 1245 | 64.00% -``` +```Based on learnings from static analysis tools.
nemo_skills/dataset/compute-eval/prepare.py (1)
21-54: **Make context-file fencing robust (avoid accidentaltermination) and avoid leading newline.** Today `_CONTEXT_FILES_BLOCK_TEMPLATE` usesunconditionally; if a context file contains triple backticks (rare, but possible), the prompt block will break.Proposed diff
-_CONTEXT_FILES_BLOCK_TEMPLATE = """ ---- file: {path} -```{fence} -{content} -``` -""" +_CONTEXT_FILES_BLOCK_TEMPLATE = """--- file: {path} +````{fence} +{content} +```` +"""nemo_skills/evaluation/evaluator/compute_eval.py (1)
35-44: Initialization logic is sound.The NVCC version check appropriately prevents instantiation when the CUDA Toolkit is unavailable. The version parsing and storage are correct.
Static analysis suggests extracting the error message to reduce complexity, but the current inline message is clear and acceptable:
♻️ Optional: Extract error message to exception class or constant
If you prefer to address the static analysis hint:
+_NVCC_NOT_FOUND_MSG = "NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH." + class ComputeEvalEvaluator(BaseEvaluator): _installed_ctk_major: int _installed_ctk_minor: int def __init__(self, config: dict, num_parallel_requests=10): super().__init__(config, num_parallel_requests) nvcc_version = get_nvcc_version() if not nvcc_version: - raise RuntimeError( - "NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH." - ) + raise RuntimeError(_NVCC_NOT_FOUND_MSG) self._installed_ctk_major, self._installed_ctk_minor, _ = parse_semver(nvcc_version)Based on static analysis hint TRY003.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
docs/evaluation/code.mdnemo_skills/dataset/compute-eval/__init__.pynemo_skills/dataset/compute-eval/prepare.pynemo_skills/evaluation/evaluator/__init__.pynemo_skills/evaluation/evaluator/compute_eval.pynemo_skills/evaluation/metrics/code_metrics.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/inference/eval/compute_eval.pynemo_skills/prompt/config/compute-eval/baseline.yamlrequirements/main.txttests/test_datasets.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-23T17:56:57.556Z
Learnt from: Glorf
Repo: NVIDIA-NeMo/Skills PR: 908
File: requirements/main.txt:16-16
Timestamp: 2025-11-23T17:56:57.556Z
Learning: faiss-cpu must be explicitly listed in requirements/main.txt for BFCLv4 memory evaluations (memory_kv, memory_vector, memory_rec_sum) as it is an optional dependency of sentence_transformers that is required for vector similarity search functionality in the memory backends.
Applied to files:
requirements/main.txt
🧬 Code graph analysis (4)
nemo_skills/inference/eval/compute_eval.py (3)
nemo_skills/inference/generate.py (2)
GenerateSolutionsConfig(87-253)GenerationTask(260-791)nemo_skills/inference/model/__init__.py (1)
server_params(141-145)nemo_skills/utils.py (3)
get_help_message(341-361)get_logger_name(39-43)setup_logging(105-140)
nemo_skills/evaluation/evaluator/compute_eval.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
BaseEvaluator(34-91)nemo_skills/utils.py (1)
get_logger_name(39-43)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/code_metrics.py (1)
ComputeEvalMetrics(126-135)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/compute_eval.py (1)
ComputeEvalEvaluator(31-85)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md
184-184: Link text should be descriptive
(MD059, descriptive-link-text)
230-230: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🪛 Ruff (0.14.10)
nemo_skills/evaluation/evaluator/compute_eval.py
39-41: Avoid specifying long messages outside the exception class
(TRY003)
59-65: Consider moving this statement to an else block
(TRY300)
76-76: Do not catch blind exception: Exception
(BLE001)
nemo_skills/evaluation/metrics/code_metrics.py
130-130: Unused method argument: prediction
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (13)
nemo_skills/inference/eval/compute_eval.py (1)
60-73: LGTM! Clean Hydra integration.The CLI setup properly handles help messages before Hydra initialization and sets up logging appropriately. The pattern follows best practices for Hydra-based command-line tools.
tests/test_datasets.py (1)
62-62: LGTM! Dataset entry follows the established pattern.The new compute-eval dataset entry is correctly formatted and positioned in the list.
nemo_skills/evaluation/evaluator/__init__.py (1)
30-30: LGTM! Evaluator registration follows the established pattern.The ComputeEvalEvaluator is correctly imported and registered in EVALUATOR_CLASS_MAP, following the same pattern as other class-based evaluators. The validation logic at lines 76-84 ensures no conflicts between class and function maps.
Also applies to: 73-73
docs/evaluation/code.md (1)
181-240: Excellent documentation coverage!The compute-eval section provides comprehensive guidance including prerequisites, data preparation, evaluation steps, result verification, and important security considerations. The structure is clear and follows the pattern established for other benchmarks in this document.
requirements/main.txt (1)
17-17: Git dependency commit exists and package structure is valid.The commit
2d14770is verified in the NVIDIA/compute-eval repository, pinned to the main branch as a stable reference. The package includes a properpyproject.tomlconfiguration and is accessible for installation.nemo_skills/evaluation/metrics/map_metrics.py (2)
25-33: ComputeEvalMetrics import wiring looks correct.
No concerns with the added import and it follows the existing pattern for code-related metrics.
45-76: Metric key"compute-eval"is correctly registered inMETRICS_MAP.
This should makeget_metrics("compute-eval")work as expected.nemo_skills/dataset/compute-eval/__init__.py (1)
15-19: Constants are properly defined and consumed by the pipeline. The modulenemo_skills.inference.eval.compute_evalexists and is dynamically loaded viaget_arg_from_module_or_dict()innemo_skills/pipeline/utils/eval.py. TheGENERATION_ARGSstring is passed through without transformation.nemo_skills/prompt/config/compute-eval/baseline.yaml (1)
1-65: The prompt specifies// file:format, but the parser implementation cannot be verified in this environment.The baseline.yaml prompt (lines 22–27) clearly specifies the output format: "Each file must be in its own fenced code block, with the first line indicating its path as a comment" with the example
// file: geodistance.cu.However, the actual parsing is delegated to an external library function (
compute_eval.generate_completions._parse_solution()) imported innemo_skills/inference/eval/compute_eval.py(line 43). Since thecompute_evallibrary is not available for inspection in this environment, I cannot definitively confirm that_parse_solution()actually accepts the// file:format as written in the prompt.To resolve this, verify that
compute_eval._parse_solution()accepts the// file:comment convention by checking the library's source code or running a test with the exact format specified in the prompt.nemo_skills/evaluation/evaluator/compute_eval.py (4)
26-28: LGTM: Logger and type adapter setup.The logger setup follows the nemo_skills pattern, and the TypeAdapter configuration with discriminated unions is the correct approach for validating problem and solution types dynamically.
45-65: Evaluation logic correctly implemented.The async evaluation pattern using
asyncio.to_threadproperly delegates the synchronousevaluate_solutioncall to a worker thread, preventing blocking of the async event loop. The pydantic validation ensures type safety for problem and solution inputs.
66-85: Exception handling provides pipeline resilience.The dual exception handlers (KeyError for missing fields, generic Exception for all other failures) ensure the evaluation pipeline continues processing even when individual samples fail. Both handlers return consistent error payloads that maintain the expected structure.
The broad exception catching (flagged by static analysis BLE001) is intentional per the
noinspectioncomment and appropriate for an evaluation pipeline where continuing to process other samples is more valuable than failing fast.Note: The
KeyErrorhandler catches missing "problem" or "solution" keys, while pydanticValidationErrorexceptions (from invalid data structure) are caught by the genericExceptionhandler. This distinction allows different error messages but both produce the same error payload structure.
18-21: All imported APIs (CudaCppProblem, CudaPythonProblem, FileSolution, PatchSolution, evaluate_solution, get_nvcc_version, parse_semver) are present and stable at the pinned commit 2d14770 of the compute-eval package. No compatibility issues detected.
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser(description="Download and prepare nvidia/compute-eval dataset") | ||
| parser.add_argument( | ||
| "--release", | ||
| type=str, | ||
| default=None, | ||
| help="Release to download (e.g., '2025-1', '2025-2'). If not specified, downloads default release.", | ||
| ) | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| token = os.getenv("HF_TOKEN", None) | ||
| if not token: | ||
| print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.") | ||
| exit(1) | ||
|
|
||
| dataset = load_dataset("nvidia/compute-eval", args.release, token=token) | ||
| data_dir = Path(__file__).absolute().parent | ||
| data_dir.mkdir(exist_ok=True) | ||
|
|
||
| with open(data_dir / "eval.jsonl", "wt", encoding="utf-8") as f: | ||
| for item in dataset["eval"]: | ||
| record = { | ||
| "problem": item, | ||
| "task_id": item["task_id"], | ||
| "problem_prompt": item["prompt"], | ||
| "build_command": item["build_command"], | ||
| "context_files_block": _format_context_files_block(item["context_files"]), | ||
| } | ||
|
|
||
| # Dumping using default=str to handle datetime serialization from the problem records | ||
| f.write(json.dumps(record, default=str) + "\n") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# 1) Find datasets version constraints
rg -n '(?:^|\b)datasets\b' -S requirements.txt pyproject.toml setup.cfg setup.py nemo_skills | head -n 200Repository: NVIDIA-NeMo/Skills
Length of output: 15038
🏁 Script executed:
# 2) Search for eval.jsonl usage in the codebase
rg -n --fixed-strings 'eval.jsonl' nemo_skills tests 2>/dev/null | head -n 200Repository: NVIDIA-NeMo/Skills
Length of output: 298
🏁 Script executed:
# 3) Find the actual prepare.py file to confirm current implementation
fd -e py -type f -name "prepare.py" | xargs cat -n 2>/dev/nullRepository: NVIDIA-NeMo/Skills
Length of output: 232
🏁 Script executed:
# Read the prepare.py file in compute-eval
cat -n nemo_skills/dataset/compute-eval/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 3524
🏁 Script executed:
# Find pyproject.toml or other version files
find . -name "pyproject.toml" -o -name "requirements*.txt" -o -name "setup.py" -o -name "setup.cfg" | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 220
🏁 Script executed:
# Look for where eval.jsonl is actually used/imported
rg -n 'eval\.jsonl' nemo_skills tests 2>/dev/nullRepository: NVIDIA-NeMo/Skills
Length of output: 298
🏁 Script executed:
# Check pyproject.toml for datasets version
cat pyproject.toml | grep -A 5 -B 5 datasetsRepository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Also check if there's a requirements file at root
ls -la requirements*.txt 2>/dev/null || echo "No requirements files at root"Repository: NVIDIA-NeMo/Skills
Length of output: 91
🌐 Web query:
datasets library token= vs use_auth_token= API compatibility versions
💡 Result:
Short answer
- The datasets library deprecated use_auth_token in favor of token in version 2.14.0; use_auth_token is scheduled for removal in 3.0.0. Use token=<str|bool> for authentication going forward. [1][2]
Compatibility notes and advice
- Older code that passes use_auth_token may still work on older datasets releases, but can cause errors in newer releases because load_dataset forwards unknown kwargs into builder/configs (there have been bugs and fixes to strip the deprecated argument). If you maintain code that must run across multiple datasets versions, switch to token and/or detect both names and pass token to the call. [2][3]
- If you call functions that accept DownloadConfig/BuilderConfig/ push_to_hub, ensure you pass token (or set HF token via environment/HF CLI) because some internals expect token now. [2][4]
Example (portable pattern)
- Preferred (new):
load_dataset(..., token="hf_XXXX") - Backward-compatible wrapper:
def load_ds(*args, **kwargs):
if "use_auth_token" in kwargs and "token" not in kwargs:
kwargs["token"] = kwargs.pop("use_auth_token")
return load_dataset(*args, **kwargs)
References
[1] Hugging Face datasets docs — loading methods (deprecation note: use_auth_token deprecated in 2.14.0).
[2] Hugging Face datasets docs (package reference / load_dataset parameters).
[3] GH issue/PR and fixes discussing forwarding/stripping deprecated use_auth_token and related compatibility bugs.
🏁 Script executed:
# Read the full pyproject.toml to see datasets constraints
cat pyproject.tomlRepository: NVIDIA-NeMo/Skills
Length of output: 2343
🏁 Script executed:
# Check the main requirements file
cat requirements/main.txt | grep -i datasetsRepository: NVIDIA-NeMo/Skills
Length of output: 107
🏁 Script executed:
# Also check if there are any other requirements files that might specify datasets
find requirements -type f -name "*.txt" | xargs grep -l datasets 2>/dev/nullRepository: NVIDIA-NeMo/Skills
Length of output: 105
🏁 Script executed:
# Check stem.txt to see if it has more specific datasets constraints
cat requirements/stem.txt | grep -i datasetsRepository: NVIDIA-NeMo/Skills
Length of output: 90
🏁 Script executed:
# Search for any try/except or version detection around load_dataset calls
rg -n "load_dataset" nemo_skills/dataset/compute-eval/prepare.py -A 3 -B 3Repository: NVIDIA-NeMo/Skills
Length of output: 507
🏁 Script executed:
# Check other prepare.py files to see how they handle load_dataset and token/use_auth_token
rg -n "load_dataset" nemo_skills/dataset -l | head -5 | xargs -I {} bash -c 'echo "=== {} ===" && cat {} | grep -A 2 "load_dataset"' 2>/dev/null | head -60Repository: NVIDIA-NeMo/Skills
Length of output: 1336
🏁 Script executed:
# Check if datasets versions support token= parameter
# Search for any version-specific handling in the codebase
rg -n "use_auth_token" nemo_skillsRepository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Check if the compute-eval dataset is actually used elsewhere or if eval.jsonl is read somewhere
rg -n "compute-eval" nemo_skills --type pyRepository: NVIDIA-NeMo/Skills
Length of output: 721
🏁 Script executed:
# Check if there's a minimum datasets version constraint elsewhere
grep -r "datasets" requirements/ nemo_skills --include="*.txt" --include="*.py" | grep -E "(>=|==|<=)"Repository: NVIDIA-NeMo/Skills
Length of output: 547
🏁 Script executed:
# Verify if eval.jsonl is read from the package directory somewhere
rg -n "compute-eval.*eval\.jsonl" nemo_skillsRepository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Check how the dataset is actually used after preparation
rg -n "eval\.jsonl" -B 3 -A 3 nemo_skillsRepository: NVIDIA-NeMo/Skills
Length of output: 1190
Add --output-dir argument and handle backwards-compatible HF authentication.
Writing to Path(__file__).parent fails in installed environments (package directories are typically read-only). Additionally, while the code currently uses token=, the datasets<4 constraint lacks a lower bound, so users with older versions may need use_auth_token= for compatibility. Add --output-dir argument (defaulting to current working directory), use sys.exit() instead of exit(), and use try/except to handle both authentication parameter names across datasets versions.
Proposed diff
import argparse
import json
import os
+import sys
from pathlib import Path
from datasets import load_dataset
@@
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Download and prepare nvidia/compute-eval dataset")
parser.add_argument(
"--release",
type=str,
default=None,
help="Release to download (e.g., '2025-1', '2025-2'). If not specified, downloads default release.",
)
+ parser.add_argument(
+ "--output-dir",
+ type=Path,
+ default=Path.cwd(),
+ help="Directory to write eval.jsonl into (default: current working directory).",
+ )
args = parser.parse_args()
token = os.getenv("HF_TOKEN", None)
if not token:
print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.")
- exit(1)
+ sys.exit(1)
- dataset = load_dataset("nvidia/compute-eval", args.release, token=token)
- data_dir = Path(__file__).absolute().parent
- data_dir.mkdir(exist_ok=True)
+ try:
+ dataset = load_dataset("nvidia/compute-eval", args.release, token=token)
+ except TypeError:
+ # Backward compatibility with older `datasets` versions.
+ dataset = load_dataset("nvidia/compute-eval", args.release, use_auth_token=token)
+
+ data_dir: Path = args.output_dir
+ data_dir.mkdir(parents=True, exist_ok=True)
with open(data_dir / "eval.jsonl", "wt", encoding="utf-8") as f:
for item in dataset["eval"]:
record = {
"problem": item,
"task_id": item["task_id"],
"problem_prompt": item["prompt"],
"build_command": item["build_command"],
"context_files_block": _format_context_files_block(item["context_files"]),
}
@@
f.write(json.dumps(record, default=str) + "\n")🤖 Prompt for AI Agents
In @nemo_skills/dataset/compute-eval/prepare.py around lines 56 - 87, Add a new
CLI arg via parser.add_argument("--output-dir", type=str, default=".",
help="Directory to write eval.jsonl") and replace data_dir =
Path(__file__).absolute().parent with data_dir =
Path(args.output_dir).resolve(); ensure data_dir.mkdir(parents=True,
exist_ok=True). Replace exit(1) with sys.exit(1) when HF_TOKEN is missing. For
backwards-compatible HF auth, call load_dataset("nvidia/compute-eval",
args.release, token=token) inside a try/except that catches TypeError and
retries with load_dataset(..., use_auth_token=token). Keep writing the file to
data_dir / "eval.jsonl" and keep using _format_context_files_block for records.
| class ComputeEvalMetrics(BaseMetrics): | ||
| def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]: | ||
| return {"accuracy": prediction["passed"]} | ||
|
|
||
| def get_incorrect_sample(self, prediction: dict) -> dict: | ||
| return {"passed": False} | ||
|
|
||
| def update(self, predictions): | ||
| super().update(predictions) | ||
| self._compute_pass_at_k(predictions=predictions) |
There was a problem hiding this comment.
Fix Ruff ARG002: unused prediction arg in get_incorrect_sample.
Rename to _prediction (or add an inline # noqa: ARG002) to keep lint clean.
Proposed diff
class ComputeEvalMetrics(BaseMetrics):
def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]:
return {"accuracy": prediction["passed"]}
- def get_incorrect_sample(self, prediction: dict) -> dict:
+ def get_incorrect_sample(self, _prediction: dict) -> dict:
return {"passed": False}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class ComputeEvalMetrics(BaseMetrics): | |
| def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]: | |
| return {"accuracy": prediction["passed"]} | |
| def get_incorrect_sample(self, prediction: dict) -> dict: | |
| return {"passed": False} | |
| def update(self, predictions): | |
| super().update(predictions) | |
| self._compute_pass_at_k(predictions=predictions) | |
| class ComputeEvalMetrics(BaseMetrics): | |
| def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]: | |
| return {"accuracy": prediction["passed"]} | |
| def get_incorrect_sample(self, _prediction: dict) -> dict: | |
| return {"passed": False} | |
| def update(self, predictions): | |
| super().update(predictions) | |
| self._compute_pass_at_k(predictions=predictions) |
🧰 Tools
🪛 Ruff (0.14.10)
130-130: Unused method argument: prediction
(ARG002)
🤖 Prompt for AI Agents
In @nemo_skills/evaluation/metrics/code_metrics.py around lines 126 - 135, The
linter flags an unused parameter in get_incorrect_sample; rename the parameter
prediction to _prediction (i.e., def get_incorrect_sample(self, _prediction:
dict) -> dict:) or alternatively append an inline noqa (e.g., prediction: dict
# noqa: ARG002) to silence Ruff, and keep the returned dict as {"passed":
False}; update the function signature in class ComputeEvalMetrics accordingly so
references remain valid.
| # noinspection PyProtectedMember | ||
| from compute_eval.generate_completions import _parse_solution |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if compute_eval is an external package or part of the repo
find . -name "compute_eval" -type d 2>/dev/null | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 44
🏁 Script executed:
# Also search for any requirements or dependencies files
git ls-files | grep -E "requirements|setup.py|pyproject.toml|poetry.lock|Pipfile" | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 368
🏁 Script executed:
# Look at the specific file mentioned in the review
cat -n nemo_skills/inference/eval/compute_eval.py | head -30Repository: NVIDIA-NeMo/Skills
Length of output: 1306
🏁 Script executed:
# Search for _parse_solution usage in the codebase
rg "_parse_solution" -A 2 -B 2Repository: NVIDIA-NeMo/Skills
Length of output: 851
🏁 Script executed:
# Check if there are other imports from compute_eval in the codebase
rg "from compute_eval|import compute_eval" -A 2 -B 2Repository: NVIDIA-NeMo/Skills
Length of output: 1305
🏁 Script executed:
# Check requirements files for compute_eval dependency and version
rg "compute.eval|compute-eval" requirements/ pyproject.toml --no-filenameRepository: NVIDIA-NeMo/Skills
Length of output: 131
🏁 Script executed:
# Check if there's any documentation or comments about _parse_solution
rg "_parse_solution|parse_solution" -A 3 -B 3Repository: NVIDIA-NeMo/Skills
Length of output: 1195
🏁 Script executed:
# Look for any imports from generate_completions module
rg "from compute_eval.generate_completions|import.*generate_completions" -A 5 -B 2Repository: NVIDIA-NeMo/Skills
Length of output: 693
🏁 Script executed:
# Search for any documentation or comments about what _parse_solution does
rg "FileSolution|files=" nemo_skills/inference/eval/compute_eval.py -A 5 -B 5Repository: NVIDIA-NeMo/Skills
Length of output: 841
🏁 Script executed:
# Check git history for when this import was added
git log --oneline --all -S "_parse_solution" 2>/dev/null | head -10Repository: NVIDIA-NeMo/Skills
Length of output: 148
🏁 Script executed:
# Look for any test files or examples that might show FileSolution usage
fd -e py . | xargs rg "FileSolution" -lRepository: NVIDIA-NeMo/Skills
Length of output: 153
🏁 Script executed:
# Check the compute_eval.py evaluator file to see if it has similar needs
cat -n nemo_skills/evaluation/evaluator/compute_eval.py | grep -A 20 -B 5 "FileSolution\|PatchSolution" | head -40Repository: NVIDIA-NeMo/Skills
Length of output: 1806
Avoid importing protected members from external libraries.
Importing _parse_solution (a protected member indicated by the leading underscore) from compute_eval.generate_completions creates a fragile dependency on internal implementation details. If the compute-eval library refactors or renames this function, your code will break. Note that the codebase is pinned to a specific commit (git+https://github.com/NVIDIA/compute-eval.git@2d14770), which provides some stability, but this protected API could still change in future versions.
Consider one of these approaches:
- Request a public API from the compute-eval maintainers to expose this parsing functionality
- Vendor the parsing logic if the library doesn't provide a stable public API
- Check for public alternatives in the compute-eval package (note: the evaluator uses only public compute_eval APIs, suggesting this parsing function may not have a public equivalent)
Improve exception handling during inference. Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com>
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR successfully integrates ComputeEval, a benchmark for evaluating LLMs on CUDA code generation, into the NeMo-Skills framework. The implementation follows established patterns in the codebase and provides a complete end-to-end workflow.
What Was Added
Dataset Integration:
- Dataset preparation script that downloads problems from HuggingFace (gated dataset requiring authentication)
- Formats context files, build commands, and problem prompts into the NeMo-Skills data format
- Creates
eval.jsonlin the standard dataset directory structure
Generation Pipeline:
ComputeEvalGenerationTaskextendsGenerationTaskto handle CUDA solution generation- Uses compute-eval's
_parse_solutionfunction to parse LLM outputs into structuredFileSolutionobjects - Properly handles parsing errors by setting solution to None and logging errors
Evaluation System:
ComputeEvalEvaluatorimplementsBaseEvaluatorfor async evaluation- Validates CUDA Toolkit version (requires CUDA 12+) at initialization
- Uses compute-eval's
evaluate_solutionfunction to compile and test CUDA code - Comprehensive error handling for missing fields, validation errors, and execution failures
Metrics & Configuration:
ComputeEvalMetricsfollows the same pattern as other code benchmarks- Computes pass@k metrics and accuracy
- Detailed prompt configuration with instructions for CUDA/C/C++ code generation
Architecture & Integration
The implementation properly integrates with NeMo-Skills architecture:
- Registered in
EVALUATOR_CLASS_MAPfor class-based evaluation - Registered in
METRICS_MAPfor metrics computation - Added to test coverage in
test_datasets.py - Comprehensive documentation with prerequisites, setup, and usage examples
Code Quality
- Follows existing codebase patterns (similar to IOI, ICPC evaluators)
- Proper error handling with try-except blocks and error logging
- Uses Pydantic TypeAdapters for robust data validation
- Async/await pattern for concurrent evaluation
- Dependencies pinned to specific commit hash for reproducibility
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- The implementation is well-structured, follows established patterns in the codebase, includes proper error handling, comprehensive documentation, and test coverage. The code integrates cleanly with existing NeMo-Skills infrastructure without breaking changes. All components follow best practices seen in similar evaluators (IOI, ICPC, livecodebench).
- No files require special attention - all implementations follow codebase conventions and include appropriate error handling
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/dataset/compute-eval/prepare.py | 4/5 | Dataset preparation script that downloads compute-eval from HuggingFace and formats context files; relies on HF_TOKEN environment variable |
| nemo_skills/evaluation/evaluator/compute_eval.py | 4/5 | Evaluator implementation that validates CUDA toolkit version, evaluates solutions using compute-eval library with proper error handling |
| nemo_skills/evaluation/metrics/code_metrics.py | 5/5 | Added ComputeEvalMetrics class following same pattern as other code metrics (e.g., HumanEvalInfillingMetrics) |
| nemo_skills/inference/eval/compute_eval.py | 4/5 | Generation task that parses LLM output into FileSolution format using compute-eval's _parse_solution; handles parsing errors gracefully |
| nemo_skills/prompt/config/compute-eval/baseline.yaml | 5/5 | Prompt configuration for CUDA code generation with clear instructions on build commands, context files, and output format |
Sequence Diagram
sequenceDiagram
participant User
participant CLI as ns CLI
participant Prepare as prepare.py
participant HF as HuggingFace
participant GenTask as ComputeEvalGenerationTask
participant LLM as LLM Server
participant ParseSolution as _parse_solution
participant Evaluator as ComputeEvalEvaluator
participant ComputeEval as compute-eval library
participant Metrics as ComputeEvalMetrics
Note over User,CLI: Data Preparation Phase
User->>CLI: ns prepare_data compute-eval --release 2025-1
CLI->>Prepare: Execute prepare.py
Prepare->>Prepare: Validate authentication token
Prepare->>HF: load_dataset with authentication
HF-->>Prepare: Return dataset with eval split
Prepare->>Prepare: Format context files & build commands
Prepare->>Prepare: Write eval.jsonl
Prepare-->>User: Dataset ready
Note over User,CLI: Inference Phase
User->>CLI: ns eval --benchmarks=compute-eval
CLI->>GenTask: Initialize with prompt config
GenTask->>GenTask: Load eval.jsonl
loop For each problem
GenTask->>LLM: Generate CUDA solution
LLM-->>GenTask: Return generated code
GenTask->>ParseSolution: Parse solution into FileSolution
ParseSolution-->>GenTask: Return structured solution
GenTask->>GenTask: Write to output file
end
Note over User,CLI: Evaluation Phase
CLI->>Evaluator: Initialize ComputeEvalEvaluator
Evaluator->>Evaluator: Check nvcc version (CUDA Toolkit 12+)
loop For each generated solution
Evaluator->>Evaluator: Validate problem & solution models
Evaluator->>ComputeEval: evaluate_solution(problem, solution)
ComputeEval->>ComputeEval: Compile CUDA code
ComputeEval->>ComputeEval: Run test cases
ComputeEval-->>Evaluator: Return graded result (passed/failed)
Evaluator->>Evaluator: Update output with evaluation results
end
Note over User,CLI: Metrics Calculation Phase
CLI->>Metrics: Calculate metrics from evaluation results
Metrics->>Metrics: Compute pass@k metrics
Metrics->>Metrics: Calculate accuracy
Metrics-->>User: Display metrics.json
Signed-off-by: George Armstrong <georgea@nvidia.com>
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
Overview
This PR successfully integrates ComputeEval - a CUDA code generation benchmark - into NeMo-Skills with comprehensive support for dataset preparation, generation, evaluation, and metrics tracking.
Key Changes
-
Dataset Preparation (
prepare.py): Downloads ComputeEval problems from HuggingFace (gated dataset) and creates eval.jsonl with problem metadata, build commands, and context files. -
Generation Task (
inference/eval/compute_eval.py): ImplementsComputeEvalGenerationTaskthat extendsGenerationTaskto parse LLM-generated CUDA solutions using compute-eval's parsing utility. -
Evaluator (
evaluation/evaluator/compute_eval.py): ImplementsComputeEvalEvaluatorthat validates solutions and executes them against test suites, with NVCC version checking and comprehensive error handling. -
Metrics (
evaluation/metrics/code_metrics.py): ImplementsComputeEvalMetricsfor pass@k computation following established patterns. -
Documentation (
docs/evaluation/code.md): Comprehensive guide including security warnings about executing machine-generated code, setup requirements (NVIDIA GPU with CUDA 12+), data preparation, and evaluation workflows. -
Configuration: Baseline prompt template providing detailed instructions for CUDA code generation, config registration in evaluator/metrics maps, and test coverage.
Architecture
The implementation follows established NeMo-Skills patterns:
- Evaluator classes inherit from
BaseEvaluatorand implement asynceval_single() - Metrics classes implement
_get_score_dict()and call_compute_pass_at_k() - Generation tasks override
process_single_datapoint()for custom solution parsing - Clean integration with existing evaluator/metrics registration systems
Issues Found
Critical Issues
- Missing import in prepare.py: Uses
exit(1)without importingsysmodule. Should usesys.exit(1)(lines 14-70).
Minor Issues
-
Error handling in generation task: When solution parsing fails, returns
solution: Nonewhich will be validated by the evaluator. While error handling catches this, could be more explicit about failure states. -
Misplaced comment in evaluator: The
# noinspection PyBroadExceptioncomment (line 46) is placed above a specific KeyError handler when it should apply to the broad Exception handler on line 76.
Strengths
- Proper NVCC version validation ensuring CUDA Toolkit 12+ compatibility
- Comprehensive exception handling at both generation and evaluation stages
- Security documentation warning about executing machine-generated code
- Follows established NeMo-Skills architectural patterns
- Test coverage included in test_datasets.py
- External dependency properly pinned to specific commit (not yet on PyPI)
Dependencies
Adds compute-eval from NVIDIA GitHub repository (public but not on PyPI). This is appropriate given the specialized nature of the CUDA benchmark.
Confidence Score: 3/5
- This PR has a critical bug (missing sys import) that will cause runtime failure and should be fixed before merging. Otherwise, the integration is architecturally sound with good error handling.
- The score reflects one critical syntax error that prevents the prepare.py script from running (missing sys import), which is a blocker. The evaluator and metrics implementation are solid and follow established patterns. The generation task has appropriate error handling. The critical issue must be fixed, and minor issues (misplaced comment, ambiguous error handling) should be addressed for code quality.
- nemo_skills/dataset/compute-eval/prepare.py requires fixing the sys import issue before merge. nemo_skills/inference/eval/compute_eval.py could benefit from clarifying error handling. nemo_skills/evaluation/evaluator/compute_eval.py has a misplaced comment that should be corrected.
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/dataset/compute-eval/prepare.py | 2/5 | Script downloads compute-eval dataset from HuggingFace and prepares JSONL file. Contains critical bug: calls exit(1) without importing sys module on line 70. |
| nemo_skills/evaluation/evaluator/compute_eval.py | 4/5 | ComputeEvalEvaluator correctly implements BaseEvaluator with proper async handling, NVCC version checking, and comprehensive exception handling. No issues found. |
| nemo_skills/inference/eval/compute_eval.py | 3/5 | Generation task extends GenerationTask and processes datapoint to parse solutions. Includes error handling but solution=None could cause issues downstream during evaluation. |
| nemo_skills/evaluation/metrics/code_metrics.py | 5/5 | ComputeEvalMetrics correctly implements metrics computation with pass@k support following established patterns. No issues found. |
Sequence Diagram
sequenceDiagram
participant User
participant prepare.py as prepare.py<br/>(Dataset)
participant generate as compute_eval.py<br/>(Generation)
participant LLM as LLM Server
participant evaluator as ComputeEvalEvaluator
participant solver as execute_solution<br/>(compute-eval)
User->>prepare.py: ns prepare_data compute-eval
prepare.py->>prepare.py: Load from HF (gated)
prepare.py->>prepare.py: Format context files
prepare.py->>prepare.py: Write eval.jsonl
User->>generate: ns eval --benchmarks=compute-eval
generate->>generate: Read eval.jsonl entries
generate->>generate: Fill prompt template
generate->>LLM: Send CUDA problem
LLM->>generate: Return generated code
generate->>generate: Parse solution files
generate->>generate: Write to output.jsonl
User->>evaluator: Run batch evaluation
evaluator->>evaluator: Check NVCC version
evaluator->>evaluator: Validate problem/solution
evaluator->>solver: execute_solution()
solver->>solver: Compile code
solver->>solver: Run tests
solver->>evaluator: Return results
evaluator->>evaluator: Compute pass@k metrics
evaluator->>evaluator: Write results
| import argparse | ||
| import json | ||
| import os | ||
| from pathlib import Path | ||
|
|
||
| from datasets import load_dataset |
There was a problem hiding this comment.
The sys module needs to be imported to use sys.exit(). Currently, line 70 calls exit(1) which is a builtin but not recommended. The script should import sys and use sys.exit(1) for consistency with Python best practices.
| import argparse | |
| import json | |
| import os | |
| from pathlib import Path | |
| from datasets import load_dataset | |
| import argparse | |
| import json | |
| import os | |
| import sys | |
| from pathlib import Path |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| token = os.getenv("HF_TOKEN", None) | ||
| if not token: | ||
| print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.") | ||
| exit(1) |
There was a problem hiding this comment.
Missing import for sys module. The code calls exit(1) on line 70, but the standard Python idiom is to use sys.exit(). Add import sys to the imports at the top of the file (after line 14).
| token = os.getenv("HF_TOKEN", None) | |
| if not token: | |
| print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.") | |
| exit(1) | |
| import sys |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| token = os.getenv("HF_TOKEN", None) | ||
| if not token: | ||
| print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.") | ||
| exit(1) |
There was a problem hiding this comment.
Replace exit(1) with sys.exit(1) after importing sys module at the top of the file. The builtin exit() function is intended for interactive mode, while sys.exit() is the standard for scripts.
| exit(1) | |
| sys.exit(1) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| except KeyError as e: | ||
| _LOG.error(f"Missing required field: {e}") | ||
| return { | ||
| "solution": None, | ||
| "generation": res.get("generation", ""), | ||
| "error": f"Missing required field: {e}", | ||
| } | ||
| except Exception as e: | ||
| _LOG.error(f"Failed to parse solution: {e}") | ||
| return { | ||
| "solution": None, | ||
| "generation": res.get("generation", ""), | ||
| "error": f"Failed to parse solution: {e}", | ||
| } |
There was a problem hiding this comment.
When solution parsing fails, the code returns solution: None. However, the evaluator will receive this None value and attempt to validate it with _SOLUTION_ADAPTER.validate_python(data_point["solution"]). While the evaluator's exception handling will catch this, it would be clearer to prevent None solutions from reaching the evaluator. Consider either: (1) not including the solution key when it's None so evaluator can skip evaluation, or (2) including an explicit error indicator that the evaluator can check before attempting validation. This improves clarity of failure modes.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| # noinspection PyBroadException | ||
| try: | ||
| problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"]) |
There was a problem hiding this comment.
The # noinspection PyBroadException comment on line 46 is placed above the specific KeyError handler, but the broad exception handler is actually on line 76. The comment should be moved to line 76 where the broad Exception catch occurs, or removed if not needed here.
| # noinspection PyBroadException | |
| try: | |
| problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"]) | |
| async def eval_single(self, data_point: dict[str, Any]) -> dict[str, Any]: | |
| try: | |
| problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"]) | |
| solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"]) |
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR integrates ComputeEval, a benchmark for evaluating LLMs on CUDA code generation, into the NeMo-Skills framework. The implementation includes:
Components Added:
- Data Preparation (
nemo_skills/dataset/compute-eval/prepare.py): Downloads ComputeEval dataset from HuggingFace (gated, requires HF_TOKEN), formats problem prompts, build commands, and context files into eval.jsonl - Generation Task (
nemo_skills/inference/eval/compute_eval.py): Extends GenerationTask to generate CUDA solutions, parses generated code using compute-eval's_parse_solution, and creates FileSolution objects - Evaluator (
nemo_skills/evaluation/evaluator/compute_eval.py): Extends BaseEvaluator to evaluate solutions by compiling CUDA code and running test suites via compute-eval library - Metrics (
nemo_skills/evaluation/metrics/code_metrics.py): ComputeEvalMetrics class computes pass@k and accuracy metrics - Prompt Template (
nemo_skills/prompt/config/compute-eval/baseline.yaml): Comprehensive system prompt for CUDA/C++ code generation with instructions on output format, build commands, and context files - Documentation (
docs/evaluation/code.md): Complete setup and usage instructions with prerequisites (CUDA Toolkit 12+), security warnings about executing machine-generated code
Architecture:
The implementation follows NeMo-Skills patterns with proper registration in EVALUATOR_CLASS_MAP and METRICS_MAP. Data flows from HuggingFace → prepare.py → eval.jsonl → GenerationTask → LLM → Evaluator → compute-eval library → Metrics.
Critical Issue Found:
ComputeEvalEvaluator has a bug where it doesn't convert the config dict to a typed config object. BaseEvaluator.eval_full() expects to access self.config.input_file as an attribute, but ComputeEvalEvaluator stores config as a plain dict, which will cause AttributeError at runtime. This needs to be fixed by creating a ComputeEvalEvaluatorConfig class (like other evaluators do) and converting the dict to this object in init.
Strengths:
- Comprehensive error handling in both generation and evaluation
- Proper async/await patterns with asyncio.to_thread for blocking operations
- Clear documentation with security warnings
- Follows existing codebase patterns for metrics and dataset structure
Confidence Score: 2/5
- This PR contains a critical bug that will cause runtime failures when the evaluator runs in batch mode
- Score of 2 reflects a critical logic error in ComputeEvalEvaluator that will cause AttributeError when eval_full() is called. The evaluator doesn't follow the pattern used by other evaluators (MathEvaluator, BirdEvaluator, CodeExecEvaluator, etc.) of converting the config dict to a typed config object. BaseEvaluator.eval_full() attempts to access self.config.input_file as an attribute, but since self.config is stored as a plain dict, this will fail. This is a blocking issue that prevents the batch evaluation functionality from working. The rest of the implementation is solid with proper error handling, async patterns, and comprehensive documentation.
- Pay close attention to nemo_skills/evaluation/evaluator/compute_eval.py - it needs a config class and proper conversion in init to match other evaluators' patterns
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| nemo_skills/evaluation/evaluator/compute_eval.py | 2/5 | Evaluator implementation has critical bug: missing config class conversion that will cause AttributeError in eval_full() |
| nemo_skills/inference/eval/compute_eval.py | 4/5 | Generation task properly inherits from GenerationTask, handles solution parsing with error handling |
| nemo_skills/dataset/compute-eval/prepare.py | 4/5 | Data preparation script downloads dataset and formats it correctly for evaluation pipeline |
| nemo_skills/evaluation/metrics/code_metrics.py | 5/5 | Added ComputeEvalMetrics class following existing pattern, correctly implements pass@k computation |
| docs/evaluation/code.md | 5/5 | Comprehensive documentation added with clear setup, preparation, and evaluation instructions |
Sequence Diagram
sequenceDiagram
participant User
participant PrepareData as prepare.py
participant HuggingFace as HuggingFace Dataset
participant GenerationTask as ComputeEvalGenerationTask
participant LLM as LLM Server
participant Evaluator as ComputeEvalEvaluator
participant ComputeEval as compute-eval Library
participant Metrics as ComputeEvalMetrics
User->>PrepareData: ns prepare_data compute-eval
PrepareData->>HuggingFace: load_dataset("nvidia/compute-eval")
HuggingFace-->>PrepareData: Return dataset items
PrepareData->>PrepareData: Format context_files_block
PrepareData->>PrepareData: Write eval.jsonl with problem, task_id, etc.
User->>GenerationTask: ns eval --benchmarks=compute-eval
GenerationTask->>GenerationTask: Load eval.jsonl
loop For each problem
GenerationTask->>LLM: Generate CUDA code with prompt template
LLM-->>GenerationTask: Return generated code
GenerationTask->>GenerationTask: Parse solution using _parse_solution
GenerationTask->>GenerationTask: Create FileSolution object
opt If eval during generation
GenerationTask->>Evaluator: eval_single(data_point)
Evaluator->>ComputeEval: evaluate_solution(problem, solution)
ComputeEval->>ComputeEval: Compile CUDA code
ComputeEval->>ComputeEval: Run test suite
ComputeEval-->>Evaluator: Return passed, skipped, build/test output
Evaluator-->>GenerationTask: Return evaluation results
end
GenerationTask->>GenerationTask: Write results to output file
end
opt If batch evaluation
User->>Evaluator: Run evaluation
Evaluator->>Evaluator: Load output file
loop For each result
Evaluator->>ComputeEval: evaluate_solution(problem, solution)
ComputeEval-->>Evaluator: Return evaluation results
end
Evaluator->>Evaluator: Update file with evaluation results
end
User->>Metrics: Calculate metrics
Metrics->>Metrics: Load evaluation results
Metrics->>Metrics: Compute pass@k, accuracy
Metrics-->>User: Display metrics.json
| def __init__(self, config: dict, num_parallel_requests=10): | ||
| super().__init__(config, num_parallel_requests) | ||
| nvcc_version = get_nvcc_version() | ||
| if not nvcc_version: | ||
| raise RuntimeError( | ||
| "NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH." | ||
| ) | ||
|
|
||
| self._installed_ctk_major, self._installed_ctk_minor, _ = parse_semver(nvcc_version) |
There was a problem hiding this comment.
ComputeEvalEvaluator is missing a config class and doesn't convert the dict config to an object, which will cause AttributeError when BaseEvaluator.eval_full() tries to access self.config.input_file.
Following the pattern from other evaluators (MathEvaluator, BirdEvaluator, CodeExecEvaluator, etc.), you should:
- Create a ComputeEvalEvaluatorConfig class that extends BaseEvaluatorConfig
- In init, convert the config dict to this object:
self.eval_config = ComputeEvalEvaluatorConfig(**self.config) - Update self.config to be the eval_config object so BaseEvaluator.eval_full() can access config.input_file
| def __init__(self, config: dict, num_parallel_requests=10): | |
| super().__init__(config, num_parallel_requests) | |
| nvcc_version = get_nvcc_version() | |
| if not nvcc_version: | |
| raise RuntimeError( | |
| "NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH." | |
| ) | |
| self._installed_ctk_major, self._installed_ctk_minor, _ = parse_semver(nvcc_version) | |
| def __init__(self, config: dict, num_parallel_requests=10): | |
| super().__init__(config, num_parallel_requests) | |
| # Convert dict config to typed config object for attribute access | |
| from nemo_skills.evaluation.evaluator.base import BaseEvaluatorConfig | |
| from nemo_skills.utils import nested_dataclass | |
| @nested_dataclass(kw_only=True) | |
| class ComputeEvalEvaluatorConfig(BaseEvaluatorConfig): | |
| pass | |
| self.config = ComputeEvalEvaluatorConfig(**self.config) | |
| nvcc_version = get_nvcc_version() | |
| if not nvcc_version: | |
| raise RuntimeError( | |
| "NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH." | |
| ) |
Signed-off-by: dlord <dlord@nvidia.com>
Greptile OverviewGreptile SummaryThis PR successfully integrates the ComputeEval benchmark for evaluating LLMs on CUDA code generation tasks. The implementation follows NeMo-Skills architectural patterns and adds all necessary components: What Changed
Integration QualityThe integration is well-structured and follows existing patterns in the codebase. All registration points are correctly updated (EVALUATOR_CLASS_MAP, METRICS_MAP, test datasets). The documentation is thorough and includes important security notes about executing machine-generated code. Key Observations
Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant User
participant PrepareData as ns prepare_data
participant HuggingFace as HuggingFace Dataset
participant Eval as ns eval
participant GenTask as ComputeEvalGenerationTask
participant LLM as LLM Server
participant Parser as _parse_solution
participant Evaluator as ComputeEvalEvaluator
participant ComputeEval as compute_eval library
participant NVCC as CUDA Toolkit
participant Metrics as ComputeEvalMetrics
User->>PrepareData: Run ns prepare_data compute-eval
PrepareData->>HuggingFace: load_dataset(nvidia/compute-eval)
HuggingFace-->>PrepareData: Return problem data
PrepareData->>PrepareData: Format context files & prompts
PrepareData-->>User: Generate eval.jsonl
User->>Eval: Run ns eval --benchmarks=compute-eval
Eval->>GenTask: Initialize generation task
GenTask->>LLM: Send prompts with CUDA problems
LLM-->>GenTask: Return generated CUDA code
GenTask->>Parser: Parse solution with _parse_solution
alt Parsing succeeds
Parser-->>GenTask: Return FileSolution object
GenTask->>GenTask: Create solution dict
else Parsing fails
Parser-->>GenTask: Raise exception
GenTask->>GenTask: Set solution = None, log error
end
GenTask->>Evaluator: Pass data_point with solution
Evaluator->>Evaluator: Check nvcc version
alt Solution is not None
Evaluator->>ComputeEval: evaluate_solution(problem, solution)
ComputeEval->>NVCC: Compile CUDA code
alt Compilation succeeds
NVCC-->>ComputeEval: Binary compiled
ComputeEval->>ComputeEval: Execute tests
ComputeEval-->>Evaluator: Return graded results
else Compilation fails
NVCC-->>ComputeEval: Build errors
ComputeEval-->>Evaluator: Return failed result
end
else Solution is None
Evaluator->>Evaluator: Return failed evaluation
end
Evaluator-->>Eval: Return evaluation results
Eval->>Metrics: Compute pass@k metrics
Metrics-->>User: Display accuracy and pass rates
|
| async def eval_single(self, data_point: dict[str, Any]) -> dict[str, Any]: | ||
| # noinspection PyBroadException | ||
| try: | ||
| problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"]) | ||
| solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"]) | ||
|
|
There was a problem hiding this comment.
The evaluator does not check if data_point["solution"] is None before attempting to validate it. When the generation task fails to parse a solution (in nemo_skills/inference/eval/compute_eval.py lines 51-54, 58-61), it sets solution to None. Attempting to validate None with the TypeAdapter will raise a validation error.
While this is caught by the exception handler, it would be cleaner and more efficient to explicitly check for None first:
| async def eval_single(self, data_point: dict[str, Any]) -> dict[str, Any]: | |
| # noinspection PyBroadException | |
| try: | |
| problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"]) | |
| solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"]) | |
| async def eval_single(self, data_point: dict[str, Any]) -> dict[str, Any]: | |
| # noinspection PyBroadException | |
| try: | |
| # Check if solution is None (parsing failed during generation) | |
| if data_point.get("solution") is None: | |
| return { | |
| "passed": False, | |
| "skipped": False, | |
| "elapsed_time": 0.0, | |
| "build_output": "", | |
| "test_output": "", | |
| "error": "Solution parsing failed during generation", | |
| } | |
| problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"]) | |
| solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"]) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| # noinspection PyProtectedMember | ||
| from compute_eval.generate_completions import _parse_solution |
There was a problem hiding this comment.
This imports a protected member _parse_solution from the compute_eval library. Protected members (prefixed with underscore) are internal APIs that may change without notice in future versions of compute-eval, which could break this integration.
Consider:
- Checking if
compute_evalprovides a public API for parsing solutions - If not, documenting this dependency on an internal API and pinning to a specific commit (which is already done in requirements/main.txt at line 17)
- Requesting the compute-eval maintainers to expose this as a public API
gwarmstrong
left a comment
There was a problem hiding this comment.
LGTM. Comments primarily addressed in the previous version of PR #1124
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Overview
This PR integrates ComputeEval - a benchmark for evaluating LLMs on CUDA code generation - into NeMo-Skills.
ComputeEval features:
Changes
GenerationTaskto generate CUDA solutions using ComputeEval's reference generation logicBaseEvaluatorusing ComputeEval's evaluation logicDependencies
Adds compute-eval package (
compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@2d14770)Note: compute-eval is a public repository but has not been published to PyPI yet
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.