Skip to content

Add compute eval#1158

Merged
gwarmstrong merged 38 commits intomainfrom
add-compute-eval
Jan 10, 2026
Merged

Add compute eval#1158
gwarmstrong merged 38 commits intomainfrom
add-compute-eval

Conversation

@blahblahasdf
Copy link
Collaborator

@blahblahasdf blahblahasdf commented Jan 8, 2026

Overview

This PR integrates ComputeEval - a benchmark for evaluating LLMs on CUDA code generation - into NeMo-Skills.

ComputeEval features:

  1. 406 (and growing) CUDA programming challenges
  2. Functional correctness evaluation via compilation and test execution
  3. Multiple release versions (2025-1, 2025-2, 2025-3)

Changes

  1. Dataset Preparation - download ComputeEval problems from HuggingFace (gated dataset)
  2. Generation Task - implement GenerationTask to generate CUDA solutions using ComputeEval's reference generation logic
  3. Evaluator - implement BaseEvaluator using ComputeEval's evaluation logic
  4. Metrics - compute pass@k metrics

Dependencies

Adds compute-eval package (compute-eval @ git+https://github.com/NVIDIA/compute-eval.git@2d14770)
Note: compute-eval is a public repository but has not been published to PyPI yet

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for the compute-eval benchmark, enabling evaluation of CUDA code generation with compilation and execution against test suites.
    • Included data preparation and metrics tracking for compute-eval workflows.
  • Documentation

    • Added comprehensive compute-eval benchmark documentation with setup, preparation, and evaluation instructions.
  • Tests

    • Extended test coverage to include the compute-eval dataset.

✏️ Tip: You can customize this high-level summary in your review settings.

blahblahasdf and others added 30 commits January 8, 2026 12:52
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
… move `ComputeEvalGenerationTask` to be more on the rails of the `process_single_datapoint` approach.

Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
…#1129)

Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Ivan <imoshkov@nvidia.com>
Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
Co-authored-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>

# Conflicts:
#	nemo_skills/evaluation/evaluator/__init__.py
Signed-off-by: bzantium <ryumin93@gmail.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
Signed-off-by: dlord <dlord@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

📝 Walkthrough

Walkthrough

This pull request introduces compute-eval, a NVIDIA CUDA code generation benchmark for the NemoSkills evaluation framework. It adds dataset preparation, evaluation components, prompt configuration, and test coverage to support end-to-end evaluation workflows.

Changes

Cohort / File(s) Summary
Dataset Module
nemo_skills/dataset/compute-eval/__init__.py, nemo_skills/dataset/compute-eval/prepare.py
Introduces dataset module with constants (EVAL_SPLIT, DATASET_GROUP, METRICS_TYPE, etc.) and a prepare script that downloads NVIDIA compute-eval dataset via HuggingFace, validates HF_TOKEN, formats context files as fenced code blocks, and outputs eval.jsonl.
Evaluation Framework
nemo_skills/evaluation/evaluator/compute_eval.py, nemo_skills/evaluation/evaluator/__init__.py
Adds ComputeEvalEvaluator class that loads NVCC version, validates input via pydantic TypeAdapter, runs evaluate_solution in a worker thread, and returns structured results with passed/skipped/elapsed_time/build_output/test_output. Registers evaluator in EVALUATOR_CLASS_MAP.
Metrics Framework
nemo_skills/evaluation/metrics/code_metrics.py, nemo_skills/evaluation/metrics/map_metrics.py
Implements ComputeEvalMetrics class deriving accuracy scores from prediction["passed"] and registers it in METRICS_MAP under "compute-eval" key.
Generation Task
nemo_skills/inference/eval/compute_eval.py
Introduces ComputeEvalGenerationTask subclass that parses generated solutions into FileSolution objects, includes error handling for missing fields, and provides Hydra CLI with help message support.
Configuration & Documentation
nemo_skills/prompt/config/compute-eval/baseline.yaml, docs/evaluation/code.md
Adds system/user prompt configuration defining CUDA/C/C++ engineer persona, build command handling, and output formatting constraints; updates evaluation docs with compute-eval benchmark setup and usage instructions.
Dependencies & Tests
requirements/main.txt, tests/test_datasets.py
Adds compute-eval Git dependency to requirements and extends dataset test coverage with ("compute-eval", ["eval"]) entry.

Sequence Diagram

sequenceDiagram
    actor User
    participant PrepareScript as prepare.py
    participant HFDatasets as HuggingFace Datasets
    participant FileSystem as Local Storage
    participant Generator as ComputeEvalGenerationTask
    participant Evaluator as ComputeEvalEvaluator
    participant Metrics as ComputeEvalMetrics

    User->>PrepareScript: Run with --release
    PrepareScript->>HFDatasets: Download compute-eval dataset
    HFDatasets-->>PrepareScript: Dataset records
    PrepareScript->>FileSystem: Write eval.jsonl with context blocks
    
    User->>Generator: Generate solution (prompt + context)
    Generator->>Generator: Parse generated files
    Generator-->>User: Return FileSolution + generation metadata
    
    User->>Evaluator: eval_single(data_point)
    Evaluator->>Evaluator: Load NVCC, validate inputs
    Evaluator->>Evaluator: Run evaluate_solution in thread
    Evaluator-->>User: Return {passed, skipped, elapsed_time, build_output, test_output}
    
    User->>Metrics: update(predictions)
    Metrics->>Metrics: Extract passed from predictions
    Metrics->>Metrics: Compute accuracy & pass@k
    Metrics-->>User: Return scores
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

run GPU tests

Suggested reviewers

  • gwarmstrong
  • ludwig-n
🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The PR title 'Add compute eval' is vague and generic. While technically related to the changeset, it lacks specificity about what 'compute eval' refers to and does not clearly convey that this integrates a CUDA code generation benchmark into NeMo-Skills. Consider a more descriptive title such as 'Integrate ComputeEval benchmark for CUDA code generation' or 'Add ComputeEval dataset, generation task, and evaluator' to better communicate the scope and purpose of the integration.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @nemo_skills/dataset/compute-eval/prepare.py:
- Around line 56-87: Add a new CLI arg via parser.add_argument("--output-dir",
type=str, default=".", help="Directory to write eval.jsonl") and replace
data_dir = Path(__file__).absolute().parent with data_dir =
Path(args.output_dir).resolve(); ensure data_dir.mkdir(parents=True,
exist_ok=True). Replace exit(1) with sys.exit(1) when HF_TOKEN is missing. For
backwards-compatible HF auth, call load_dataset("nvidia/compute-eval",
args.release, token=token) inside a try/except that catches TypeError and
retries with load_dataset(..., use_auth_token=token). Keep writing the file to
data_dir / "eval.jsonl" and keep using _format_context_files_block for records.

In @nemo_skills/evaluation/metrics/code_metrics.py:
- Around line 126-135: The linter flags an unused parameter in
get_incorrect_sample; rename the parameter prediction to _prediction (i.e., def
get_incorrect_sample(self, _prediction: dict) -> dict:) or alternatively append
an inline noqa (e.g., prediction: dict  # noqa: ARG002) to silence Ruff, and
keep the returned dict as {"passed": False}; update the function signature in
class ComputeEvalMetrics accordingly so references remain valid.
🧹 Nitpick comments (5)
nemo_skills/inference/eval/compute_eval.py (1)

38-54: Improve exception handling specificity.

The exception handling has a couple of issues:

  1. Line 49: The KeyError handler doesn't provide context about which field is missing. The error could be from task_id not being in data_point, or from issues within _parse_solution.
  2. Lines 52-54: The broad Exception catch might hide programming errors or unexpected issues that should fail fast rather than being caught and logged.
♻️ Suggested improvements
     async def process_single_datapoint(self, data_point, data):
         res = await super().process_single_datapoint(data_point, data)
         try:
+            # Validate required fields first
+            if "task_id" not in data_point:
+                raise ValueError("Missing required field 'task_id' in data_point")
+            
             solution = FileSolution(
                 task_id=data_point["task_id"],
                 files=_parse_solution(res["generation"]),
             )
             return {
                 "solution": solution.model_dump(),
                 "generation": res["generation"],
             }
-        except KeyError as e:
-            _LOG.error(f"Missing required field: {e}")
+        except (KeyError, ValueError) as e:
+            _LOG.error(f"Failed to create solution for task {data_point.get('task_id', 'unknown')}: {e}")
             raise
-        except Exception as e:
-            _LOG.error(f"Failed to parse solution: {e}")
+        except (TypeError, AttributeError) as e:
+            # Catch specific parsing-related errors only
+            _LOG.error(f"Failed to parse solution for task {data_point.get('task_id', 'unknown')}: {e}")
             raise
docs/evaluation/code.md (2)

184-184: Use descriptive link text instead of "here".

The link text "here" is not descriptive for accessibility and clarity. Consider using more descriptive text that indicates what the link points to.

📝 Suggested improvement
-- Original benchmark source is [here](https://github.com/NVIDIA/compute-eval).
+- Original benchmark source: [compute-eval on GitHub](https://github.com/NVIDIA/compute-eval)

Based on learnings from static analysis tools.


230-235: Add language identifier to code fence for proper syntax highlighting.

The code block showing the metrics output should specify a language identifier (e.g., json or text) for proper rendering and syntax highlighting.

📝 Suggested improvement
-```
+```text
 ---------------------------- compute-eval -----------------------------
 evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy
 pass@1          | 50          | 8432       | 1245        | 64.00%
-```
+```

Based on learnings from static analysis tools.

nemo_skills/dataset/compute-eval/prepare.py (1)

21-54: **Make context-file fencing robust (avoid accidental termination) and avoid leading newline.** Today `_CONTEXT_FILES_BLOCK_TEMPLATE` uses unconditionally; if a context file contains triple backticks (rare, but possible), the prompt block will break.

Proposed diff
-_CONTEXT_FILES_BLOCK_TEMPLATE = """
---- file: {path}
-```{fence}
-{content}
-```
-"""
+_CONTEXT_FILES_BLOCK_TEMPLATE = """--- file: {path}
+````{fence}
+{content}
+````
+"""
nemo_skills/evaluation/evaluator/compute_eval.py (1)

35-44: Initialization logic is sound.

The NVCC version check appropriately prevents instantiation when the CUDA Toolkit is unavailable. The version parsing and storage are correct.

Static analysis suggests extracting the error message to reduce complexity, but the current inline message is clear and acceptable:

♻️ Optional: Extract error message to exception class or constant

If you prefer to address the static analysis hint:

+_NVCC_NOT_FOUND_MSG = "NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH."
+
 class ComputeEvalEvaluator(BaseEvaluator):
     _installed_ctk_major: int
     _installed_ctk_minor: int

     def __init__(self, config: dict, num_parallel_requests=10):
         super().__init__(config, num_parallel_requests)
         nvcc_version = get_nvcc_version()
         if not nvcc_version:
-            raise RuntimeError(
-                "NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH."
-            )
+            raise RuntimeError(_NVCC_NOT_FOUND_MSG)

         self._installed_ctk_major, self._installed_ctk_minor, _ = parse_semver(nvcc_version)

Based on static analysis hint TRY003.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52dbc32 and 9a4f608.

📒 Files selected for processing (11)
  • docs/evaluation/code.md
  • nemo_skills/dataset/compute-eval/__init__.py
  • nemo_skills/dataset/compute-eval/prepare.py
  • nemo_skills/evaluation/evaluator/__init__.py
  • nemo_skills/evaluation/evaluator/compute_eval.py
  • nemo_skills/evaluation/metrics/code_metrics.py
  • nemo_skills/evaluation/metrics/map_metrics.py
  • nemo_skills/inference/eval/compute_eval.py
  • nemo_skills/prompt/config/compute-eval/baseline.yaml
  • requirements/main.txt
  • tests/test_datasets.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-23T17:56:57.556Z
Learnt from: Glorf
Repo: NVIDIA-NeMo/Skills PR: 908
File: requirements/main.txt:16-16
Timestamp: 2025-11-23T17:56:57.556Z
Learning: faiss-cpu must be explicitly listed in requirements/main.txt for BFCLv4 memory evaluations (memory_kv, memory_vector, memory_rec_sum) as it is an optional dependency of sentence_transformers that is required for vector similarity search functionality in the memory backends.

Applied to files:

  • requirements/main.txt
🧬 Code graph analysis (4)
nemo_skills/inference/eval/compute_eval.py (3)
nemo_skills/inference/generate.py (2)
  • GenerateSolutionsConfig (87-253)
  • GenerationTask (260-791)
nemo_skills/inference/model/__init__.py (1)
  • server_params (141-145)
nemo_skills/utils.py (3)
  • get_help_message (341-361)
  • get_logger_name (39-43)
  • setup_logging (105-140)
nemo_skills/evaluation/evaluator/compute_eval.py (2)
nemo_skills/evaluation/evaluator/base.py (1)
  • BaseEvaluator (34-91)
nemo_skills/utils.py (1)
  • get_logger_name (39-43)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/code_metrics.py (1)
  • ComputeEvalMetrics (126-135)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/compute_eval.py (1)
  • ComputeEvalEvaluator (31-85)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/code.md

184-184: Link text should be descriptive

(MD059, descriptive-link-text)


230-230: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)
nemo_skills/evaluation/evaluator/compute_eval.py

39-41: Avoid specifying long messages outside the exception class

(TRY003)


59-65: Consider moving this statement to an else block

(TRY300)


76-76: Do not catch blind exception: Exception

(BLE001)

nemo_skills/evaluation/metrics/code_metrics.py

130-130: Unused method argument: prediction

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (13)
nemo_skills/inference/eval/compute_eval.py (1)

60-73: LGTM! Clean Hydra integration.

The CLI setup properly handles help messages before Hydra initialization and sets up logging appropriately. The pattern follows best practices for Hydra-based command-line tools.

tests/test_datasets.py (1)

62-62: LGTM! Dataset entry follows the established pattern.

The new compute-eval dataset entry is correctly formatted and positioned in the list.

nemo_skills/evaluation/evaluator/__init__.py (1)

30-30: LGTM! Evaluator registration follows the established pattern.

The ComputeEvalEvaluator is correctly imported and registered in EVALUATOR_CLASS_MAP, following the same pattern as other class-based evaluators. The validation logic at lines 76-84 ensures no conflicts between class and function maps.

Also applies to: 73-73

docs/evaluation/code.md (1)

181-240: Excellent documentation coverage!

The compute-eval section provides comprehensive guidance including prerequisites, data preparation, evaluation steps, result verification, and important security considerations. The structure is clear and follows the pattern established for other benchmarks in this document.

requirements/main.txt (1)

17-17: Git dependency commit exists and package structure is valid.

The commit 2d14770 is verified in the NVIDIA/compute-eval repository, pinned to the main branch as a stable reference. The package includes a proper pyproject.toml configuration and is accessible for installation.

nemo_skills/evaluation/metrics/map_metrics.py (2)

25-33: ComputeEvalMetrics import wiring looks correct.
No concerns with the added import and it follows the existing pattern for code-related metrics.


45-76: Metric key "compute-eval" is correctly registered in METRICS_MAP.
This should make get_metrics("compute-eval") work as expected.

nemo_skills/dataset/compute-eval/__init__.py (1)

15-19: Constants are properly defined and consumed by the pipeline. The module nemo_skills.inference.eval.compute_eval exists and is dynamically loaded via get_arg_from_module_or_dict() in nemo_skills/pipeline/utils/eval.py. The GENERATION_ARGS string is passed through without transformation.

nemo_skills/prompt/config/compute-eval/baseline.yaml (1)

1-65: The prompt specifies // file: format, but the parser implementation cannot be verified in this environment.

The baseline.yaml prompt (lines 22–27) clearly specifies the output format: "Each file must be in its own fenced code block, with the first line indicating its path as a comment" with the example // file: geodistance.cu.

However, the actual parsing is delegated to an external library function (compute_eval.generate_completions._parse_solution()) imported in nemo_skills/inference/eval/compute_eval.py (line 43). Since the compute_eval library is not available for inspection in this environment, I cannot definitively confirm that _parse_solution() actually accepts the // file: format as written in the prompt.

To resolve this, verify that compute_eval._parse_solution() accepts the // file: comment convention by checking the library's source code or running a test with the exact format specified in the prompt.

nemo_skills/evaluation/evaluator/compute_eval.py (4)

26-28: LGTM: Logger and type adapter setup.

The logger setup follows the nemo_skills pattern, and the TypeAdapter configuration with discriminated unions is the correct approach for validating problem and solution types dynamically.


45-65: Evaluation logic correctly implemented.

The async evaluation pattern using asyncio.to_thread properly delegates the synchronous evaluate_solution call to a worker thread, preventing blocking of the async event loop. The pydantic validation ensures type safety for problem and solution inputs.


66-85: Exception handling provides pipeline resilience.

The dual exception handlers (KeyError for missing fields, generic Exception for all other failures) ensure the evaluation pipeline continues processing even when individual samples fail. Both handlers return consistent error payloads that maintain the expected structure.

The broad exception catching (flagged by static analysis BLE001) is intentional per the noinspection comment and appropriate for an evaluation pipeline where continuing to process other samples is more valuable than failing fast.

Note: The KeyError handler catches missing "problem" or "solution" keys, while pydantic ValidationError exceptions (from invalid data structure) are caught by the generic Exception handler. This distinction allows different error messages but both produce the same error payload structure.


18-21: All imported APIs (CudaCppProblem, CudaPythonProblem, FileSolution, PatchSolution, evaluate_solution, get_nvcc_version, parse_semver) are present and stable at the pinned commit 2d14770 of the compute-eval package. No compatibility issues detected.

Comment on lines +56 to +87
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Download and prepare nvidia/compute-eval dataset")
parser.add_argument(
"--release",
type=str,
default=None,
help="Release to download (e.g., '2025-1', '2025-2'). If not specified, downloads default release.",
)

args = parser.parse_args()

token = os.getenv("HF_TOKEN", None)
if not token:
print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.")
exit(1)

dataset = load_dataset("nvidia/compute-eval", args.release, token=token)
data_dir = Path(__file__).absolute().parent
data_dir.mkdir(exist_ok=True)

with open(data_dir / "eval.jsonl", "wt", encoding="utf-8") as f:
for item in dataset["eval"]:
record = {
"problem": item,
"task_id": item["task_id"],
"problem_prompt": item["prompt"],
"build_command": item["build_command"],
"context_files_block": _format_context_files_block(item["context_files"]),
}

# Dumping using default=str to handle datetime serialization from the problem records
f.write(json.dumps(record, default=str) + "\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# 1) Find datasets version constraints
rg -n '(?:^|\b)datasets\b' -S requirements.txt pyproject.toml setup.cfg setup.py nemo_skills | head -n 200

Repository: NVIDIA-NeMo/Skills

Length of output: 15038


🏁 Script executed:

# 2) Search for eval.jsonl usage in the codebase
rg -n --fixed-strings 'eval.jsonl' nemo_skills tests 2>/dev/null | head -n 200

Repository: NVIDIA-NeMo/Skills

Length of output: 298


🏁 Script executed:

# 3) Find the actual prepare.py file to confirm current implementation
fd -e py -type f -name "prepare.py" | xargs cat -n 2>/dev/null

Repository: NVIDIA-NeMo/Skills

Length of output: 232


🏁 Script executed:

# Read the prepare.py file in compute-eval
cat -n nemo_skills/dataset/compute-eval/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 3524


🏁 Script executed:

# Find pyproject.toml or other version files
find . -name "pyproject.toml" -o -name "requirements*.txt" -o -name "setup.py" -o -name "setup.cfg" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 220


🏁 Script executed:

# Look for where eval.jsonl is actually used/imported
rg -n 'eval\.jsonl' nemo_skills tests 2>/dev/null

Repository: NVIDIA-NeMo/Skills

Length of output: 298


🏁 Script executed:

# Check pyproject.toml for datasets version
cat pyproject.toml | grep -A 5 -B 5 datasets

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Also check if there's a requirements file at root
ls -la requirements*.txt 2>/dev/null || echo "No requirements files at root"

Repository: NVIDIA-NeMo/Skills

Length of output: 91


🌐 Web query:

datasets library token= vs use_auth_token= API compatibility versions

💡 Result:

Short answer

  • The datasets library deprecated use_auth_token in favor of token in version 2.14.0; use_auth_token is scheduled for removal in 3.0.0. Use token=<str|bool> for authentication going forward. [1][2]

Compatibility notes and advice

  • Older code that passes use_auth_token may still work on older datasets releases, but can cause errors in newer releases because load_dataset forwards unknown kwargs into builder/configs (there have been bugs and fixes to strip the deprecated argument). If you maintain code that must run across multiple datasets versions, switch to token and/or detect both names and pass token to the call. [2][3]
  • If you call functions that accept DownloadConfig/BuilderConfig/ push_to_hub, ensure you pass token (or set HF token via environment/HF CLI) because some internals expect token now. [2][4]

Example (portable pattern)

  • Preferred (new):
    load_dataset(..., token="hf_XXXX")
  • Backward-compatible wrapper:
    def load_ds(*args, **kwargs):
    if "use_auth_token" in kwargs and "token" not in kwargs:
    kwargs["token"] = kwargs.pop("use_auth_token")
    return load_dataset(*args, **kwargs)

References
[1] Hugging Face datasets docs — loading methods (deprecation note: use_auth_token deprecated in 2.14.0).
[2] Hugging Face datasets docs (package reference / load_dataset parameters).
[3] GH issue/PR and fixes discussing forwarding/stripping deprecated use_auth_token and related compatibility bugs.


🏁 Script executed:

# Read the full pyproject.toml to see datasets constraints
cat pyproject.toml

Repository: NVIDIA-NeMo/Skills

Length of output: 2343


🏁 Script executed:

# Check the main requirements file
cat requirements/main.txt | grep -i datasets

Repository: NVIDIA-NeMo/Skills

Length of output: 107


🏁 Script executed:

# Also check if there are any other requirements files that might specify datasets
find requirements -type f -name "*.txt" | xargs grep -l datasets 2>/dev/null

Repository: NVIDIA-NeMo/Skills

Length of output: 105


🏁 Script executed:

# Check stem.txt to see if it has more specific datasets constraints
cat requirements/stem.txt | grep -i datasets

Repository: NVIDIA-NeMo/Skills

Length of output: 90


🏁 Script executed:

# Search for any try/except or version detection around load_dataset calls
rg -n "load_dataset" nemo_skills/dataset/compute-eval/prepare.py -A 3 -B 3

Repository: NVIDIA-NeMo/Skills

Length of output: 507


🏁 Script executed:

# Check other prepare.py files to see how they handle load_dataset and token/use_auth_token
rg -n "load_dataset" nemo_skills/dataset -l | head -5 | xargs -I {} bash -c 'echo "=== {} ===" && cat {} | grep -A 2 "load_dataset"' 2>/dev/null | head -60

Repository: NVIDIA-NeMo/Skills

Length of output: 1336


🏁 Script executed:

# Check if datasets versions support token= parameter
# Search for any version-specific handling in the codebase
rg -n "use_auth_token" nemo_skills

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Check if the compute-eval dataset is actually used elsewhere or if eval.jsonl is read somewhere
rg -n "compute-eval" nemo_skills --type py

Repository: NVIDIA-NeMo/Skills

Length of output: 721


🏁 Script executed:

# Check if there's a minimum datasets version constraint elsewhere
grep -r "datasets" requirements/ nemo_skills --include="*.txt" --include="*.py" | grep -E "(>=|==|<=)"

Repository: NVIDIA-NeMo/Skills

Length of output: 547


🏁 Script executed:

# Verify if eval.jsonl is read from the package directory somewhere
rg -n "compute-eval.*eval\.jsonl" nemo_skills

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Check how the dataset is actually used after preparation
rg -n "eval\.jsonl" -B 3 -A 3 nemo_skills

Repository: NVIDIA-NeMo/Skills

Length of output: 1190


Add --output-dir argument and handle backwards-compatible HF authentication.

Writing to Path(__file__).parent fails in installed environments (package directories are typically read-only). Additionally, while the code currently uses token=, the datasets<4 constraint lacks a lower bound, so users with older versions may need use_auth_token= for compatibility. Add --output-dir argument (defaulting to current working directory), use sys.exit() instead of exit(), and use try/except to handle both authentication parameter names across datasets versions.

Proposed diff
 import argparse
 import json
 import os
+import sys
 from pathlib import Path

 from datasets import load_dataset
@@
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Download and prepare nvidia/compute-eval dataset")
     parser.add_argument(
         "--release",
         type=str,
         default=None,
         help="Release to download (e.g., '2025-1', '2025-2'). If not specified, downloads default release.",
     )
+    parser.add_argument(
+        "--output-dir",
+        type=Path,
+        default=Path.cwd(),
+        help="Directory to write eval.jsonl into (default: current working directory).",
+    )

     args = parser.parse_args()

     token = os.getenv("HF_TOKEN", None)
     if not token:
         print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.")
-        exit(1)
+        sys.exit(1)

-    dataset = load_dataset("nvidia/compute-eval", args.release, token=token)
-    data_dir = Path(__file__).absolute().parent
-    data_dir.mkdir(exist_ok=True)
+    try:
+        dataset = load_dataset("nvidia/compute-eval", args.release, token=token)
+    except TypeError:
+        # Backward compatibility with older `datasets` versions.
+        dataset = load_dataset("nvidia/compute-eval", args.release, use_auth_token=token)
+
+    data_dir: Path = args.output_dir
+    data_dir.mkdir(parents=True, exist_ok=True)

     with open(data_dir / "eval.jsonl", "wt", encoding="utf-8") as f:
         for item in dataset["eval"]:
             record = {
                 "problem": item,
                 "task_id": item["task_id"],
                 "problem_prompt": item["prompt"],
                 "build_command": item["build_command"],
                 "context_files_block": _format_context_files_block(item["context_files"]),
             }
@@
             f.write(json.dumps(record, default=str) + "\n")
🤖 Prompt for AI Agents
In @nemo_skills/dataset/compute-eval/prepare.py around lines 56 - 87, Add a new
CLI arg via parser.add_argument("--output-dir", type=str, default=".",
help="Directory to write eval.jsonl") and replace data_dir =
Path(__file__).absolute().parent with data_dir =
Path(args.output_dir).resolve(); ensure data_dir.mkdir(parents=True,
exist_ok=True). Replace exit(1) with sys.exit(1) when HF_TOKEN is missing. For
backwards-compatible HF auth, call load_dataset("nvidia/compute-eval",
args.release, token=token) inside a try/except that catches TypeError and
retries with load_dataset(..., use_auth_token=token). Keep writing the file to
data_dir / "eval.jsonl" and keep using _format_context_files_block for records.

Comment on lines +126 to +135
class ComputeEvalMetrics(BaseMetrics):
def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]:
return {"accuracy": prediction["passed"]}

def get_incorrect_sample(self, prediction: dict) -> dict:
return {"passed": False}

def update(self, predictions):
super().update(predictions)
self._compute_pass_at_k(predictions=predictions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix Ruff ARG002: unused prediction arg in get_incorrect_sample.
Rename to _prediction (or add an inline # noqa: ARG002) to keep lint clean.

Proposed diff
 class ComputeEvalMetrics(BaseMetrics):
     def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]:
         return {"accuracy": prediction["passed"]}

-    def get_incorrect_sample(self, prediction: dict) -> dict:
+    def get_incorrect_sample(self, _prediction: dict) -> dict:
         return {"passed": False}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class ComputeEvalMetrics(BaseMetrics):
def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]:
return {"accuracy": prediction["passed"]}
def get_incorrect_sample(self, prediction: dict) -> dict:
return {"passed": False}
def update(self, predictions):
super().update(predictions)
self._compute_pass_at_k(predictions=predictions)
class ComputeEvalMetrics(BaseMetrics):
def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]:
return {"accuracy": prediction["passed"]}
def get_incorrect_sample(self, _prediction: dict) -> dict:
return {"passed": False}
def update(self, predictions):
super().update(predictions)
self._compute_pass_at_k(predictions=predictions)
🧰 Tools
🪛 Ruff (0.14.10)

130-130: Unused method argument: prediction

(ARG002)

🤖 Prompt for AI Agents
In @nemo_skills/evaluation/metrics/code_metrics.py around lines 126 - 135, The
linter flags an unused parameter in get_incorrect_sample; rename the parameter
prediction to _prediction (i.e., def get_incorrect_sample(self, _prediction:
dict) -> dict:) or alternatively append an inline noqa (e.g., prediction: dict 
# noqa: ARG002) to silence Ruff, and keep the returned dict as {"passed":
False}; update the function signature in class ComputeEvalMetrics accordingly so
references remain valid.

Comment on lines +20 to +21
# noinspection PyProtectedMember
from compute_eval.generate_completions import _parse_solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if compute_eval is an external package or part of the repo
find . -name "compute_eval" -type d 2>/dev/null | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 44


🏁 Script executed:

# Also search for any requirements or dependencies files
git ls-files | grep -E "requirements|setup.py|pyproject.toml|poetry.lock|Pipfile" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 368


🏁 Script executed:

# Look at the specific file mentioned in the review
cat -n nemo_skills/inference/eval/compute_eval.py | head -30

Repository: NVIDIA-NeMo/Skills

Length of output: 1306


🏁 Script executed:

# Search for _parse_solution usage in the codebase
rg "_parse_solution" -A 2 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 851


🏁 Script executed:

# Check if there are other imports from compute_eval in the codebase
rg "from compute_eval|import compute_eval" -A 2 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1305


🏁 Script executed:

# Check requirements files for compute_eval dependency and version
rg "compute.eval|compute-eval" requirements/ pyproject.toml --no-filename

Repository: NVIDIA-NeMo/Skills

Length of output: 131


🏁 Script executed:

# Check if there's any documentation or comments about _parse_solution
rg "_parse_solution|parse_solution" -A 3 -B 3

Repository: NVIDIA-NeMo/Skills

Length of output: 1195


🏁 Script executed:

# Look for any imports from generate_completions module
rg "from compute_eval.generate_completions|import.*generate_completions" -A 5 -B 2

Repository: NVIDIA-NeMo/Skills

Length of output: 693


🏁 Script executed:

# Search for any documentation or comments about what _parse_solution does
rg "FileSolution|files=" nemo_skills/inference/eval/compute_eval.py -A 5 -B 5

Repository: NVIDIA-NeMo/Skills

Length of output: 841


🏁 Script executed:

# Check git history for when this import was added
git log --oneline --all -S "_parse_solution" 2>/dev/null | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 148


🏁 Script executed:

# Look for any test files or examples that might show FileSolution usage
fd -e py . | xargs rg "FileSolution" -l

Repository: NVIDIA-NeMo/Skills

Length of output: 153


🏁 Script executed:

# Check the compute_eval.py evaluator file to see if it has similar needs
cat -n nemo_skills/evaluation/evaluator/compute_eval.py | grep -A 20 -B 5 "FileSolution\|PatchSolution" | head -40

Repository: NVIDIA-NeMo/Skills

Length of output: 1806


Avoid importing protected members from external libraries.

Importing _parse_solution (a protected member indicated by the leading underscore) from compute_eval.generate_completions creates a fragile dependency on internal implementation details. If the compute-eval library refactors or renames this function, your code will break. Note that the codebase is pinned to a specific commit (git+https://github.com/NVIDIA/compute-eval.git@2d14770), which provides some stability, but this protected API could still change in future versions.

Consider one of these approaches:

  1. Request a public API from the compute-eval maintainers to expose this parsing functionality
  2. Vendor the parsing logic if the library doesn't provide a stable public API
  3. Check for public alternatives in the compute-eval package (note: the evaluator uses only public compute_eval APIs, suggesting this parsing function may not have a public equivalent)

Improve exception handling during inference.

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Dan Lord <blahblahasdf@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR successfully integrates ComputeEval, a benchmark for evaluating LLMs on CUDA code generation, into the NeMo-Skills framework. The implementation follows established patterns in the codebase and provides a complete end-to-end workflow.

What Was Added

Dataset Integration:

  • Dataset preparation script that downloads problems from HuggingFace (gated dataset requiring authentication)
  • Formats context files, build commands, and problem prompts into the NeMo-Skills data format
  • Creates eval.jsonl in the standard dataset directory structure

Generation Pipeline:

  • ComputeEvalGenerationTask extends GenerationTask to handle CUDA solution generation
  • Uses compute-eval's _parse_solution function to parse LLM outputs into structured FileSolution objects
  • Properly handles parsing errors by setting solution to None and logging errors

Evaluation System:

  • ComputeEvalEvaluator implements BaseEvaluator for async evaluation
  • Validates CUDA Toolkit version (requires CUDA 12+) at initialization
  • Uses compute-eval's evaluate_solution function to compile and test CUDA code
  • Comprehensive error handling for missing fields, validation errors, and execution failures

Metrics & Configuration:

  • ComputeEvalMetrics follows the same pattern as other code benchmarks
  • Computes pass@k metrics and accuracy
  • Detailed prompt configuration with instructions for CUDA/C/C++ code generation

Architecture & Integration

The implementation properly integrates with NeMo-Skills architecture:

  • Registered in EVALUATOR_CLASS_MAP for class-based evaluation
  • Registered in METRICS_MAP for metrics computation
  • Added to test coverage in test_datasets.py
  • Comprehensive documentation with prerequisites, setup, and usage examples

Code Quality

  • Follows existing codebase patterns (similar to IOI, ICPC evaluators)
  • Proper error handling with try-except blocks and error logging
  • Uses Pydantic TypeAdapters for robust data validation
  • Async/await pattern for concurrent evaluation
  • Dependencies pinned to specific commit hash for reproducibility

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-structured, follows established patterns in the codebase, includes proper error handling, comprehensive documentation, and test coverage. The code integrates cleanly with existing NeMo-Skills infrastructure without breaking changes. All components follow best practices seen in similar evaluators (IOI, ICPC, livecodebench).
  • No files require special attention - all implementations follow codebase conventions and include appropriate error handling

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/compute-eval/prepare.py 4/5 Dataset preparation script that downloads compute-eval from HuggingFace and formats context files; relies on HF_TOKEN environment variable
nemo_skills/evaluation/evaluator/compute_eval.py 4/5 Evaluator implementation that validates CUDA toolkit version, evaluates solutions using compute-eval library with proper error handling
nemo_skills/evaluation/metrics/code_metrics.py 5/5 Added ComputeEvalMetrics class following same pattern as other code metrics (e.g., HumanEvalInfillingMetrics)
nemo_skills/inference/eval/compute_eval.py 4/5 Generation task that parses LLM output into FileSolution format using compute-eval's _parse_solution; handles parsing errors gracefully
nemo_skills/prompt/config/compute-eval/baseline.yaml 5/5 Prompt configuration for CUDA code generation with clear instructions on build commands, context files, and output format

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as ns CLI
    participant Prepare as prepare.py
    participant HF as HuggingFace
    participant GenTask as ComputeEvalGenerationTask
    participant LLM as LLM Server
    participant ParseSolution as _parse_solution
    participant Evaluator as ComputeEvalEvaluator
    participant ComputeEval as compute-eval library
    participant Metrics as ComputeEvalMetrics

    Note over User,CLI: Data Preparation Phase
    User->>CLI: ns prepare_data compute-eval --release 2025-1
    CLI->>Prepare: Execute prepare.py
    Prepare->>Prepare: Validate authentication token
    Prepare->>HF: load_dataset with authentication
    HF-->>Prepare: Return dataset with eval split
    Prepare->>Prepare: Format context files & build commands
    Prepare->>Prepare: Write eval.jsonl
    Prepare-->>User: Dataset ready

    Note over User,CLI: Inference Phase
    User->>CLI: ns eval --benchmarks=compute-eval
    CLI->>GenTask: Initialize with prompt config
    GenTask->>GenTask: Load eval.jsonl
    loop For each problem
        GenTask->>LLM: Generate CUDA solution
        LLM-->>GenTask: Return generated code
        GenTask->>ParseSolution: Parse solution into FileSolution
        ParseSolution-->>GenTask: Return structured solution
        GenTask->>GenTask: Write to output file
    end

    Note over User,CLI: Evaluation Phase
    CLI->>Evaluator: Initialize ComputeEvalEvaluator
    Evaluator->>Evaluator: Check nvcc version (CUDA Toolkit 12+)
    loop For each generated solution
        Evaluator->>Evaluator: Validate problem & solution models
        Evaluator->>ComputeEval: evaluate_solution(problem, solution)
        ComputeEval->>ComputeEval: Compile CUDA code
        ComputeEval->>ComputeEval: Run test cases
        ComputeEval-->>Evaluator: Return graded result (passed/failed)
        Evaluator->>Evaluator: Update output with evaluation results
    end

    Note over User,CLI: Metrics Calculation Phase
    CLI->>Metrics: Calculate metrics from evaluation results
    Metrics->>Metrics: Compute pass@k metrics
    Metrics->>Metrics: Calculate accuracy
    Metrics-->>User: Display metrics.json
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Overview

This PR successfully integrates ComputeEval - a CUDA code generation benchmark - into NeMo-Skills with comprehensive support for dataset preparation, generation, evaluation, and metrics tracking.

Key Changes

  1. Dataset Preparation (prepare.py): Downloads ComputeEval problems from HuggingFace (gated dataset) and creates eval.jsonl with problem metadata, build commands, and context files.

  2. Generation Task (inference/eval/compute_eval.py): Implements ComputeEvalGenerationTask that extends GenerationTask to parse LLM-generated CUDA solutions using compute-eval's parsing utility.

  3. Evaluator (evaluation/evaluator/compute_eval.py): Implements ComputeEvalEvaluator that validates solutions and executes them against test suites, with NVCC version checking and comprehensive error handling.

  4. Metrics (evaluation/metrics/code_metrics.py): Implements ComputeEvalMetrics for pass@k computation following established patterns.

  5. Documentation (docs/evaluation/code.md): Comprehensive guide including security warnings about executing machine-generated code, setup requirements (NVIDIA GPU with CUDA 12+), data preparation, and evaluation workflows.

  6. Configuration: Baseline prompt template providing detailed instructions for CUDA code generation, config registration in evaluator/metrics maps, and test coverage.

Architecture

The implementation follows established NeMo-Skills patterns:

  • Evaluator classes inherit from BaseEvaluator and implement async eval_single()
  • Metrics classes implement _get_score_dict() and call _compute_pass_at_k()
  • Generation tasks override process_single_datapoint() for custom solution parsing
  • Clean integration with existing evaluator/metrics registration systems

Issues Found

Critical Issues

  1. Missing import in prepare.py: Uses exit(1) without importing sys module. Should use sys.exit(1) (lines 14-70).

Minor Issues

  1. Error handling in generation task: When solution parsing fails, returns solution: None which will be validated by the evaluator. While error handling catches this, could be more explicit about failure states.

  2. Misplaced comment in evaluator: The # noinspection PyBroadException comment (line 46) is placed above a specific KeyError handler when it should apply to the broad Exception handler on line 76.

Strengths

  • Proper NVCC version validation ensuring CUDA Toolkit 12+ compatibility
  • Comprehensive exception handling at both generation and evaluation stages
  • Security documentation warning about executing machine-generated code
  • Follows established NeMo-Skills architectural patterns
  • Test coverage included in test_datasets.py
  • External dependency properly pinned to specific commit (not yet on PyPI)

Dependencies

Adds compute-eval from NVIDIA GitHub repository (public but not on PyPI). This is appropriate given the specialized nature of the CUDA benchmark.

Confidence Score: 3/5

  • This PR has a critical bug (missing sys import) that will cause runtime failure and should be fixed before merging. Otherwise, the integration is architecturally sound with good error handling.
  • The score reflects one critical syntax error that prevents the prepare.py script from running (missing sys import), which is a blocker. The evaluator and metrics implementation are solid and follow established patterns. The generation task has appropriate error handling. The critical issue must be fixed, and minor issues (misplaced comment, ambiguous error handling) should be addressed for code quality.
  • nemo_skills/dataset/compute-eval/prepare.py requires fixing the sys import issue before merge. nemo_skills/inference/eval/compute_eval.py could benefit from clarifying error handling. nemo_skills/evaluation/evaluator/compute_eval.py has a misplaced comment that should be corrected.

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/dataset/compute-eval/prepare.py 2/5 Script downloads compute-eval dataset from HuggingFace and prepares JSONL file. Contains critical bug: calls exit(1) without importing sys module on line 70.
nemo_skills/evaluation/evaluator/compute_eval.py 4/5 ComputeEvalEvaluator correctly implements BaseEvaluator with proper async handling, NVCC version checking, and comprehensive exception handling. No issues found.
nemo_skills/inference/eval/compute_eval.py 3/5 Generation task extends GenerationTask and processes datapoint to parse solutions. Includes error handling but solution=None could cause issues downstream during evaluation.
nemo_skills/evaluation/metrics/code_metrics.py 5/5 ComputeEvalMetrics correctly implements metrics computation with pass@k support following established patterns. No issues found.

Sequence Diagram

sequenceDiagram
    participant User
    participant prepare.py as prepare.py<br/>(Dataset)
    participant generate as compute_eval.py<br/>(Generation)
    participant LLM as LLM Server
    participant evaluator as ComputeEvalEvaluator
    participant solver as execute_solution<br/>(compute-eval)

    User->>prepare.py: ns prepare_data compute-eval
    prepare.py->>prepare.py: Load from HF (gated)
    prepare.py->>prepare.py: Format context files
    prepare.py->>prepare.py: Write eval.jsonl

    User->>generate: ns eval --benchmarks=compute-eval
    generate->>generate: Read eval.jsonl entries
    generate->>generate: Fill prompt template
    generate->>LLM: Send CUDA problem
    LLM->>generate: Return generated code
    generate->>generate: Parse solution files
    generate->>generate: Write to output.jsonl

    User->>evaluator: Run batch evaluation
    evaluator->>evaluator: Check NVCC version
    evaluator->>evaluator: Validate problem/solution
    evaluator->>solver: execute_solution()
    solver->>solver: Compile code
    solver->>solver: Run tests
    solver->>evaluator: Return results
    evaluator->>evaluator: Compute pass@k metrics
    evaluator->>evaluator: Write results
Loading

Comment on lines +14 to +19
import argparse
import json
import os
from pathlib import Path

from datasets import load_dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sys module needs to be imported to use sys.exit(). Currently, line 70 calls exit(1) which is a builtin but not recommended. The script should import sys and use sys.exit(1) for consistency with Python best practices.

Suggested change
import argparse
import json
import os
from pathlib import Path
from datasets import load_dataset
import argparse
import json
import os
import sys
from pathlib import Path

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +67 to +70
token = os.getenv("HF_TOKEN", None)
if not token:
print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.")
exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing import for sys module. The code calls exit(1) on line 70, but the standard Python idiom is to use sys.exit(). Add import sys to the imports at the top of the file (after line 14).

Suggested change
token = os.getenv("HF_TOKEN", None)
if not token:
print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.")
exit(1)
import sys

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

token = os.getenv("HF_TOKEN", None)
if not token:
print("Error: HF_TOKEN environment variable not set. Please set it to access the dataset.")
exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace exit(1) with sys.exit(1) after importing sys module at the top of the file. The builtin exit() function is intended for interactive mode, while sys.exit() is the standard for scripts.

Suggested change
exit(1)
sys.exit(1)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +49 to +62
except KeyError as e:
_LOG.error(f"Missing required field: {e}")
return {
"solution": None,
"generation": res.get("generation", ""),
"error": f"Missing required field: {e}",
}
except Exception as e:
_LOG.error(f"Failed to parse solution: {e}")
return {
"solution": None,
"generation": res.get("generation", ""),
"error": f"Failed to parse solution: {e}",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When solution parsing fails, the code returns solution: None. However, the evaluator will receive this None value and attempt to validate it with _SOLUTION_ADAPTER.validate_python(data_point["solution"]). While the evaluator's exception handling will catch this, it would be clearer to prevent None solutions from reaching the evaluator. Consider either: (1) not including the solution key when it's None so evaluator can skip evaluation, or (2) including an explicit error indicator that the evaluator can check before attempting validation. This improves clarity of failure modes.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +46 to +48
# noinspection PyBroadException
try:
problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The # noinspection PyBroadException comment on line 46 is placed above the specific KeyError handler, but the broad exception handler is actually on line 76. The comment should be moved to line 76 where the broad Exception catch occurs, or removed if not needed here.

Suggested change
# noinspection PyBroadException
try:
problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])
async def eval_single(self, data_point: dict[str, Any]) -> dict[str, Any]:
try:
problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])
solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"])

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR integrates ComputeEval, a benchmark for evaluating LLMs on CUDA code generation, into the NeMo-Skills framework. The implementation includes:

Components Added:

  • Data Preparation (nemo_skills/dataset/compute-eval/prepare.py): Downloads ComputeEval dataset from HuggingFace (gated, requires HF_TOKEN), formats problem prompts, build commands, and context files into eval.jsonl
  • Generation Task (nemo_skills/inference/eval/compute_eval.py): Extends GenerationTask to generate CUDA solutions, parses generated code using compute-eval's _parse_solution, and creates FileSolution objects
  • Evaluator (nemo_skills/evaluation/evaluator/compute_eval.py): Extends BaseEvaluator to evaluate solutions by compiling CUDA code and running test suites via compute-eval library
  • Metrics (nemo_skills/evaluation/metrics/code_metrics.py): ComputeEvalMetrics class computes pass@k and accuracy metrics
  • Prompt Template (nemo_skills/prompt/config/compute-eval/baseline.yaml): Comprehensive system prompt for CUDA/C++ code generation with instructions on output format, build commands, and context files
  • Documentation (docs/evaluation/code.md): Complete setup and usage instructions with prerequisites (CUDA Toolkit 12+), security warnings about executing machine-generated code

Architecture:
The implementation follows NeMo-Skills patterns with proper registration in EVALUATOR_CLASS_MAP and METRICS_MAP. Data flows from HuggingFace → prepare.py → eval.jsonl → GenerationTask → LLM → Evaluator → compute-eval library → Metrics.

Critical Issue Found:
ComputeEvalEvaluator has a bug where it doesn't convert the config dict to a typed config object. BaseEvaluator.eval_full() expects to access self.config.input_file as an attribute, but ComputeEvalEvaluator stores config as a plain dict, which will cause AttributeError at runtime. This needs to be fixed by creating a ComputeEvalEvaluatorConfig class (like other evaluators do) and converting the dict to this object in init.

Strengths:

  • Comprehensive error handling in both generation and evaluation
  • Proper async/await patterns with asyncio.to_thread for blocking operations
  • Clear documentation with security warnings
  • Follows existing codebase patterns for metrics and dataset structure

Confidence Score: 2/5

  • This PR contains a critical bug that will cause runtime failures when the evaluator runs in batch mode
  • Score of 2 reflects a critical logic error in ComputeEvalEvaluator that will cause AttributeError when eval_full() is called. The evaluator doesn't follow the pattern used by other evaluators (MathEvaluator, BirdEvaluator, CodeExecEvaluator, etc.) of converting the config dict to a typed config object. BaseEvaluator.eval_full() attempts to access self.config.input_file as an attribute, but since self.config is stored as a plain dict, this will fail. This is a blocking issue that prevents the batch evaluation functionality from working. The rest of the implementation is solid with proper error handling, async patterns, and comprehensive documentation.
  • Pay close attention to nemo_skills/evaluation/evaluator/compute_eval.py - it needs a config class and proper conversion in init to match other evaluators' patterns

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/evaluation/evaluator/compute_eval.py 2/5 Evaluator implementation has critical bug: missing config class conversion that will cause AttributeError in eval_full()
nemo_skills/inference/eval/compute_eval.py 4/5 Generation task properly inherits from GenerationTask, handles solution parsing with error handling
nemo_skills/dataset/compute-eval/prepare.py 4/5 Data preparation script downloads dataset and formats it correctly for evaluation pipeline
nemo_skills/evaluation/metrics/code_metrics.py 5/5 Added ComputeEvalMetrics class following existing pattern, correctly implements pass@k computation
docs/evaluation/code.md 5/5 Comprehensive documentation added with clear setup, preparation, and evaluation instructions

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareData as prepare.py
    participant HuggingFace as HuggingFace Dataset
    participant GenerationTask as ComputeEvalGenerationTask
    participant LLM as LLM Server
    participant Evaluator as ComputeEvalEvaluator
    participant ComputeEval as compute-eval Library
    participant Metrics as ComputeEvalMetrics

    User->>PrepareData: ns prepare_data compute-eval
    PrepareData->>HuggingFace: load_dataset("nvidia/compute-eval")
    HuggingFace-->>PrepareData: Return dataset items
    PrepareData->>PrepareData: Format context_files_block
    PrepareData->>PrepareData: Write eval.jsonl with problem, task_id, etc.

    User->>GenerationTask: ns eval --benchmarks=compute-eval
    GenerationTask->>GenerationTask: Load eval.jsonl
    loop For each problem
        GenerationTask->>LLM: Generate CUDA code with prompt template
        LLM-->>GenerationTask: Return generated code
        GenerationTask->>GenerationTask: Parse solution using _parse_solution
        GenerationTask->>GenerationTask: Create FileSolution object
        opt If eval during generation
            GenerationTask->>Evaluator: eval_single(data_point)
            Evaluator->>ComputeEval: evaluate_solution(problem, solution)
            ComputeEval->>ComputeEval: Compile CUDA code
            ComputeEval->>ComputeEval: Run test suite
            ComputeEval-->>Evaluator: Return passed, skipped, build/test output
            Evaluator-->>GenerationTask: Return evaluation results
        end
        GenerationTask->>GenerationTask: Write results to output file
    end

    opt If batch evaluation
        User->>Evaluator: Run evaluation
        Evaluator->>Evaluator: Load output file
        loop For each result
            Evaluator->>ComputeEval: evaluate_solution(problem, solution)
            ComputeEval-->>Evaluator: Return evaluation results
        end
        Evaluator->>Evaluator: Update file with evaluation results
    end

    User->>Metrics: Calculate metrics
    Metrics->>Metrics: Load evaluation results
    Metrics->>Metrics: Compute pass@k, accuracy
    Metrics-->>User: Display metrics.json
Loading

Comment on lines +35 to +43
def __init__(self, config: dict, num_parallel_requests=10):
super().__init__(config, num_parallel_requests)
nvcc_version = get_nvcc_version()
if not nvcc_version:
raise RuntimeError(
"NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH."
)

self._installed_ctk_major, self._installed_ctk_minor, _ = parse_semver(nvcc_version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ComputeEvalEvaluator is missing a config class and doesn't convert the dict config to an object, which will cause AttributeError when BaseEvaluator.eval_full() tries to access self.config.input_file.

Following the pattern from other evaluators (MathEvaluator, BirdEvaluator, CodeExecEvaluator, etc.), you should:

  1. Create a ComputeEvalEvaluatorConfig class that extends BaseEvaluatorConfig
  2. In init, convert the config dict to this object: self.eval_config = ComputeEvalEvaluatorConfig(**self.config)
  3. Update self.config to be the eval_config object so BaseEvaluator.eval_full() can access config.input_file
Suggested change
def __init__(self, config: dict, num_parallel_requests=10):
super().__init__(config, num_parallel_requests)
nvcc_version = get_nvcc_version()
if not nvcc_version:
raise RuntimeError(
"NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH."
)
self._installed_ctk_major, self._installed_ctk_minor, _ = parse_semver(nvcc_version)
def __init__(self, config: dict, num_parallel_requests=10):
super().__init__(config, num_parallel_requests)
# Convert dict config to typed config object for attribute access
from nemo_skills.evaluation.evaluator.base import BaseEvaluatorConfig
from nemo_skills.utils import nested_dataclass
@nested_dataclass(kw_only=True)
class ComputeEvalEvaluatorConfig(BaseEvaluatorConfig):
pass
self.config = ComputeEvalEvaluatorConfig(**self.config)
nvcc_version = get_nvcc_version()
if not nvcc_version:
raise RuntimeError(
"NVCC not found. Please ensure that the CUDA Toolkit is installed and nvcc is in your PATH."
)

Signed-off-by: dlord <dlord@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 9, 2026

Greptile Overview

Greptile Summary

This PR successfully integrates the ComputeEval benchmark for evaluating LLMs on CUDA code generation tasks. The implementation follows NeMo-Skills architectural patterns and adds all necessary components:

What Changed

  • Dataset Integration: Adds nemo_skills/dataset/compute-eval/ with data preparation script that downloads problems from HuggingFace and formats them with context files
  • Custom Generation Task: Implements ComputeEvalGenerationTask that parses generated CUDA code using compute-eval's solution parser
  • Evaluator: Implements ComputeEvalEvaluator as a class-based evaluator that compiles and executes CUDA code using the compute-eval library
  • Metrics: Adds ComputeEvalMetrics for computing pass@k accuracy metrics
  • Configuration: Includes prompt template with detailed CUDA coding instructions
  • Documentation: Comprehensive docs with setup, evaluation steps, and security warnings
  • Dependencies: Adds compute-eval package pinned to specific commit

Integration Quality

The integration is well-structured and follows existing patterns in the codebase. All registration points are correctly updated (EVALUATOR_CLASS_MAP, METRICS_MAP, test datasets). The documentation is thorough and includes important security notes about executing machine-generated code.

Key Observations

  1. Security: The documentation appropriately warns about executing machine-generated CUDA code and recommends sandboxed environments. The evaluation relies on the compute-eval library's built-in security measures.

  2. Dependency on Internal API: The generation task uses _parse_solution from compute-eval (a protected member), which is acknowledged with a comment. This is mitigated by pinning to a specific git commit.

  3. Error Handling: Both generation and evaluation have comprehensive try-catch blocks for handling failures. However, the evaluator could be slightly more efficient by checking for None solutions before attempting validation.

  4. Testing: Appropriately excludes compute-eval from automated GPU tests due to CUDA Toolkit requirements, but includes it in dataset validation tests.

Confidence Score: 4/5

  • This PR is safe to merge with minor code quality improvements recommended
  • The implementation is solid and follows established patterns in the codebase. All integration points are correctly wired, documentation is comprehensive, and error handling is in place. The score is 4 rather than 5 due to two minor style issues: (1) use of a protected API from compute-eval that could be fragile, and (2) the evaluator could handle None solutions more gracefully. Both issues have workarounds in place (pinned dependency and exception handling), so they don't block merging but would benefit from cleanup.
  • Files nemo_skills/evaluation/evaluator/compute_eval.py and nemo_skills/inference/eval/compute_eval.py have minor style improvements noted in comments, but no critical issues

Important Files Changed

File Analysis

Filename Score Overview
nemo_skills/evaluation/evaluator/compute_eval.py 4/5 Implements ComputeEvalEvaluator for CUDA code evaluation. Clean implementation but could handle None solution values more gracefully before validation.
nemo_skills/inference/eval/compute_eval.py 4/5 Implements generation task for ComputeEval. Uses protected API _parse_solution from compute-eval library which may be fragile.
nemo_skills/dataset/compute-eval/prepare.py 5/5 Dataset preparation script that downloads and formats ComputeEval problems from HuggingFace. Implementation is clean and follows existing patterns.
nemo_skills/evaluation/metrics/code_metrics.py 5/5 Adds ComputeEvalMetrics class for computing pass@k metrics. Simple implementation that correctly inherits from BaseMetrics.
nemo_skills/prompt/config/compute-eval/baseline.yaml 5/5 Prompt configuration for CUDA code generation. Well-structured with clear instructions for models to generate compilable CUDA code.

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareData as ns prepare_data
    participant HuggingFace as HuggingFace Dataset
    participant Eval as ns eval
    participant GenTask as ComputeEvalGenerationTask
    participant LLM as LLM Server
    participant Parser as _parse_solution
    participant Evaluator as ComputeEvalEvaluator
    participant ComputeEval as compute_eval library
    participant NVCC as CUDA Toolkit
    participant Metrics as ComputeEvalMetrics

    User->>PrepareData: Run ns prepare_data compute-eval
    PrepareData->>HuggingFace: load_dataset(nvidia/compute-eval)
    HuggingFace-->>PrepareData: Return problem data
    PrepareData->>PrepareData: Format context files & prompts
    PrepareData-->>User: Generate eval.jsonl

    User->>Eval: Run ns eval --benchmarks=compute-eval
    Eval->>GenTask: Initialize generation task
    GenTask->>LLM: Send prompts with CUDA problems
    LLM-->>GenTask: Return generated CUDA code
    GenTask->>Parser: Parse solution with _parse_solution
    alt Parsing succeeds
        Parser-->>GenTask: Return FileSolution object
        GenTask->>GenTask: Create solution dict
    else Parsing fails
        Parser-->>GenTask: Raise exception
        GenTask->>GenTask: Set solution = None, log error
    end

    GenTask->>Evaluator: Pass data_point with solution
    Evaluator->>Evaluator: Check nvcc version
    alt Solution is not None
        Evaluator->>ComputeEval: evaluate_solution(problem, solution)
        ComputeEval->>NVCC: Compile CUDA code
        alt Compilation succeeds
            NVCC-->>ComputeEval: Binary compiled
            ComputeEval->>ComputeEval: Execute tests
            ComputeEval-->>Evaluator: Return graded results
        else Compilation fails
            NVCC-->>ComputeEval: Build errors
            ComputeEval-->>Evaluator: Return failed result
        end
    else Solution is None
        Evaluator->>Evaluator: Return failed evaluation
    end

    Evaluator-->>Eval: Return evaluation results
    Eval->>Metrics: Compute pass@k metrics
    Metrics-->>User: Display accuracy and pass rates
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +45 to +50
async def eval_single(self, data_point: dict[str, Any]) -> dict[str, Any]:
# noinspection PyBroadException
try:
problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])
solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evaluator does not check if data_point["solution"] is None before attempting to validate it. When the generation task fails to parse a solution (in nemo_skills/inference/eval/compute_eval.py lines 51-54, 58-61), it sets solution to None. Attempting to validate None with the TypeAdapter will raise a validation error.

While this is caught by the exception handler, it would be cleaner and more efficient to explicitly check for None first:

Suggested change
async def eval_single(self, data_point: dict[str, Any]) -> dict[str, Any]:
# noinspection PyBroadException
try:
problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])
solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"])
async def eval_single(self, data_point: dict[str, Any]) -> dict[str, Any]:
# noinspection PyBroadException
try:
# Check if solution is None (parsing failed during generation)
if data_point.get("solution") is None:
return {
"passed": False,
"skipped": False,
"elapsed_time": 0.0,
"build_output": "",
"test_output": "",
"error": "Solution parsing failed during generation",
}
problem = _PROBLEM_ADAPTER.validate_python(data_point["problem"])
solution = _SOLUTION_ADAPTER.validate_python(data_point["solution"])

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +20 to +21
# noinspection PyProtectedMember
from compute_eval.generate_completions import _parse_solution
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This imports a protected member _parse_solution from the compute_eval library. Protected members (prefixed with underscore) are internal APIs that may change without notice in future versions of compute-eval, which could break this integration.

Consider:

  1. Checking if compute_eval provides a public API for parsing solutions
  2. If not, documenting this dependency on an internal API and pinning to a specific commit (which is already done in requirements/main.txt at line 17)
  3. Requesting the compute-eval maintainers to expose this as a public API

Copy link
Collaborator

@gwarmstrong gwarmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Comments primarily addressed in the previous version of PR #1124

@gwarmstrong gwarmstrong merged commit 3a06378 into main Jan 10, 2026
6 checks passed
@gwarmstrong gwarmstrong deleted the add-compute-eval branch January 10, 2026 01:51
hsiehjackson pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Dan Lord <blahblahasdf@gmail.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
arnavkomaragiri pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Dan Lord <blahblahasdf@gmail.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
arnavkomaragiri pushed a commit that referenced this pull request Jan 13, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Dan Lord <blahblahasdf@gmail.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Feb 3, 2026
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Dan Lord <blahblahasdf@gmail.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Dan Lord <blahblahasdf@gmail.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.