Skip to content

Added AAI-Omniscience Benchmark#1161

Merged
arnavkomaragiri merged 48 commits intomainfrom
akomaragiri/aai_omniscience
Jan 16, 2026
Merged

Added AAI-Omniscience Benchmark#1161
arnavkomaragiri merged 48 commits intomainfrom
akomaragiri/aai_omniscience

Conversation

@arnavkomaragiri
Copy link
Collaborator

@arnavkomaragiri arnavkomaragiri commented Jan 12, 2026

Draft of AAI-Omniscience Benchmark in Nemo-Skills, will clean up before merging.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added support for AA-Omniscience dataset evaluation with specialized metrics
    • Introduced judge-based correctness scoring for comprehensive evaluation assessment
    • Enhanced prompt configuration with dynamic system message handling

✏️ Tip: You can customize this high-level summary in your review settings.

@@ -0,0 +1,45 @@
import argparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not Nemo-skills file right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, just a basic debug file I added to test it since ns wasn't working in my env. I'll delete to clean things up.

self.answer_key = answer_key

# use same RM code as MathMetrics
def _compute_reward_at_k(self, predictions: list[dict]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function required for all datasets? Seems like math has it but most of the datasets don't

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's required, but I figured it might be helpful since a reward model could serve as a proxy for a judge model and may be useful for the downstream task. It's not critical to the benchmark itself though, so we can drop if need be.

if self.config.system is not None:
messages = [
{"role": "system", "content": self.config.system},
{"role": "system", "content": self.config.system.format(**input_dict)},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change touches all Nemo-skills codes, not just omniscience. If its really required, maybe ask Igor to validate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required since the system prompt is data dependent for Omniscience for the topic and domain fields; I'll ask Igor to validate to make sure this doesn't break other code.

RUN pip install --no-cache-dir -r /opt/NeMo-Skills/requirements/main.txt
# Fix http mismatch between lepton and dggs by manually downloading dggs here
RUN pip install ddgs
RUN pip install func-timeout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is handled in current Nemo skills docker

@arnavkomaragiri arnavkomaragiri force-pushed the akomaragiri/aai_omniscience branch from c5a60b8 to 5fc257b Compare January 13, 2026 17:43
@arnavkomaragiri arnavkomaragiri requested a review from Kipok January 13, 2026 18:07
@arnavkomaragiri arnavkomaragiri marked this pull request as ready for review January 13, 2026 18:07
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Greptile Summary

Adds AAI-Omniscience benchmark evaluation with judge-based correctness scoring and specialized hallucination metrics. The implementation includes dataset preparation, custom metrics computation (omni-index and hallucination rate), and comprehensive documentation with configuration examples.

Critical Issue:

  • nemo_skills/prompt/utils.py:267 - The change to call .format(**input_dict) on system messages breaks existing prompts that use {{}} for literal braces. Multiple existing prompts use \boxed{{}} (math-cot.yaml, math-tir.yaml, etc.) and {{your explanation}} (hle.yaml) which will be unescaped to single braces, causing KeyError or unintended replacements

Other Issues:

  • nemo_skills/evaluation/metrics/omni_metrics.py:121 - Potential IndexError when checking predictions[0] without verifying the list is non-empty

Confidence Score: 2/5

  • Not safe to merge - contains breaking change to existing prompts
  • The .format() change in prompt/utils.py will break all existing prompts that use {{}} for literal braces (like \boxed{{}}), causing either KeyError or unintended string replacements. Additionally, there's an IndexError risk in omni_metrics.py when accessing empty predictions
  • Critical attention needed for nemo_skills/prompt/utils.py (breaking change) and nemo_skills/evaluation/metrics/omni_metrics.py (potential crash)

Important Files Changed

Filename Overview
nemo_skills/prompt/utils.py Added .format(**input_dict) to system messages - breaks existing prompts with {{}} literal braces
nemo_skills/evaluation/metrics/omni_metrics.py New metrics implementation with potential IndexError on empty predictions at line 121
nemo_skills/dataset/omniscience/prepare.py Dataset preparation script - clean implementation, unused variable at line 64

Sequence Diagram

sequenceDiagram
    participant User
    participant Eval Pipeline
    participant Prompt Utils
    participant Model
    participant Judge
    participant OmniMetrics

    User->>Eval Pipeline: eval(benchmarks="omniscience")
    Eval Pipeline->>Prompt Utils: Load omni.yaml config
    Prompt Utils->>Prompt Utils: system.format(domain, topic, question)
    Prompt Utils->>Model: Generate answer
    Model-->>Prompt Utils: generation response
    Prompt Utils-->>Eval Pipeline: predictions with generation
    
    Eval Pipeline->>Judge: Load aa-omni-judge.yaml
    Judge->>Judge: Compare generation vs expected_answer
    Judge-->>Eval Pipeline: judgement (A/B/C/D)
    
    Eval Pipeline->>OmniMetrics: update(predictions)
    OmniMetrics->>OmniMetrics: _get_score_dict(judgement)
    OmniMetrics->>OmniMetrics: _compute_pass_at_k()
    alt reward_model_score exists
        OmniMetrics->>OmniMetrics: _compute_reward_at_k()
    end
    OmniMetrics->>OmniMetrics: get_metrics()
    OmniMetrics->>OmniMetrics: Calculate omni_index & hallucination_rate
    OmniMetrics-->>Eval Pipeline: Final metrics
    Eval Pipeline-->>User: Results with accuracy, omni-index, hallucination rate
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

@@ -0,0 +1,26 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright year is 2026 (future year)

Suggested change
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

@@ -0,0 +1,80 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright year is 2026 (future year)

Suggested change
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

@@ -0,0 +1,120 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright year is 2026 (future year)

Suggested change
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

args = parse_args()

dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")
jsonl_data = [format_entry(d) for d in dataset]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jsonl_data variable is unused


# If no valid answers, it's incorrect
if not valid_answers_and_results:
is_correct = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_correct variable is assigned but never used

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 13, 2026

📝 Walkthrough

Walkthrough

This PR introduces the "omniscience" evaluation metric system for the AA-Omniscience dataset. It adds dataset preparation logic, a new metrics class that computes evaluation scores using judge signals and reward model scoring, prompt configurations for evaluation and judging, and integration into the existing metrics framework.

Changes

Cohort / File(s) Summary
Omniscience Dataset Module
nemo_skills/dataset/omniscience/__init__.py
Exports configuration constants for omniscience evaluation: dataset group, metrics type, generation arguments, evaluation split, judge pipeline details (using gemini-2.5-flash model), and judge prompt configuration.
Omniscience Dataset Preparation
nemo_skills/dataset/omniscience/prepare.py
New dataset preparation script that loads AA-Omniscience-Public dataset, maps topics to split names, standardizes entry format (id, domain, topic, question, expected_answer), and writes per-split JSONL outputs including a full "text" split and per-domain filtered splits.
Omniscience Metrics
nemo_skills/evaluation/metrics/omni_metrics.py
New OmniMetrics class (extends BaseMetrics) that evaluates predictions using judge correctness signals and reward model scores. Includes pass-at-k computation with best/majority selection, omniscience index and hallucination metrics derived from judge flags, and configurable no_answer tracking.
Metrics Registration
nemo_skills/evaluation/metrics/map_metrics.py
Registers OmniMetrics class in METRICS_MAP under the "omniscience" key for framework integration.
Prompt Configurations
nemo_skills/prompt/config/eval/aai/omni.yaml, nemo_skills/prompt/config/judge/aa-omni-judge.yaml
Adds evaluation prompt that constrains answers to direct responses and explicit inability statements, and detailed judge prompt with grading rubric covering CORRECT, INCORRECT, PARTIAL_ANSWER, and NOT_ATTEMPTED categories with examples and edge case handling.
Prompt Utilities Update
nemo_skills/prompt/utils.py
Modified fill() function to apply Python str.format() to system message content, enabling dynamic variable substitution in system prompts.

Sequence Diagrams

sequenceDiagram
    participant User
    participant PrepareScript as Prepare Script
    participant HFDataset as HuggingFace Dataset
    participant FileSystem
    
    User->>PrepareScript: python prepare.py --splits text,math,...
    PrepareScript->>HFDataset: Load AA-Omniscience-Public
    HFDataset-->>PrepareScript: Dataset loaded
    
    loop For each split (text + per-domain)
        PrepareScript->>PrepareScript: Format entries (id, domain, topic, question, answer)
        alt text split
            PrepareScript->>PrepareScript: Use full dataset
        else domain split
            PrepareScript->>HFDataset: Filter by domain
            HFDataset-->>PrepareScript: Filtered entries
        end
        PrepareScript->>FileSystem: Write split_name.jsonl
    end
    
    FileSystem-->>User: JSONL files generated
Loading
sequenceDiagram
    participant Evaluator
    participant OmniMetrics
    participant BaseMetrics
    participant Judge
    participant RewardModel
    
    Evaluator->>OmniMetrics: update(predictions)
    OmniMetrics->>BaseMetrics: Parent update logic
    
    loop For each prediction
        OmniMetrics->>OmniMetrics: _get_score_dict (extract judge_correct, etc.)
        alt reward_model_score exists
            OmniMetrics->>RewardModel: Evaluate prediction
            RewardModel-->>OmniMetrics: Score returned
            OmniMetrics->>OmniMetrics: _compute_reward_at_k (best/majority selection)
        end
    end
    
    OmniMetrics->>OmniMetrics: get_metrics (compute judge_omni_index, hallucination)
    OmniMetrics-->>Evaluator: Augmented metrics returned
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • Kipok
  • titu1994
  • ekmb
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Added AAI-Omniscience Benchmark' accurately reflects the main change: introducing the AAI-Omniscience benchmark to the nemo_skills repository, as evidenced by new modules, evaluation metrics, dataset handling, and configuration files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In @nemo_skills/dataset/omniscience/__init__.py:
- Around line 21-25: JUDGE_PIPELINE_ARGS currently sets the model to the preview
variant "gemini-2.5-flash-preview-09-2025"; update the "model" value in the
JUDGE_PIPELINE_ARGS dict to the stable GA name "gemini-2.5-flash" (leave other
keys like "server_type" and "server_address" unchanged) so the code uses the
supported GA model.

In @nemo_skills/evaluation/metrics/omni_metrics.py:
- Around line 96-100: The update method accesses predictions[0] without checking
for an empty list, risking IndexError; after calling super().update(predictions)
add a guard like "if not predictions: return" to avoid further processing on an
empty list, or at minimum change the reward-model check to "if predictions and
'reward_model_score' in predictions[0]:"; ensure this guard is applied before
calling _compute_pass_at_k and _compute_reward_at_k so both methods aren't
invoked with an empty predictions list (refer to the update method and helpers
_compute_pass_at_k and _compute_reward_at_k).
🧹 Nitpick comments (4)
nemo_skills/dataset/omniscience/prepare.py (2)

64-64: Remove unused variable.

jsonl_data is computed but never used. This appears to be dead code from earlier iterations.

🧹 Suggested fix
     dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")
-    jsonl_data = [format_entry(d) for d in dataset]
     output_dir = Path(__file__).absolute().parent

68-74: Lambda closure captures loop variable by reference.

The lambda lambda x: x["domain"] == t captures t by reference. In a dict comprehension, this is typically problematic because all lambdas would reference the final value of t. While dataset.filter() likely evaluates immediately (avoiding the bug), this is fragile and flagged by static analysis (B023).

♻️ Recommended fix using default argument capture
     splits = {
         "text": dataset,
         **{
-            TOPIC_TO_SPLIT_MAP.get(t, str(t).lower()): dataset.filter(lambda x: x["domain"] == t)
+            TOPIC_TO_SPLIT_MAP.get(t, str(t).lower()): dataset.filter(lambda x, domain=t: x["domain"] == domain)
             for t in dataset.unique("domain")
         },
     }
nemo_skills/evaluation/metrics/omni_metrics.py (2)

87-94: In-place mutation may cause unexpected side effects.

get_incorrect_sample mutates the input prediction dict directly. If the caller doesn't expect this, it could lead to subtle bugs. Consider returning a copy instead.

Safer approach using copy
 def get_incorrect_sample(self, prediction: dict) -> dict:
+    prediction = prediction.copy()
     if "judgement" in prediction:
         prediction["judgement"] = "B"
         prediction["judge_correct"] = 0
         prediction["judge_incorrect"] = 1
         prediction["judge_partially_correct"] = 0
         prediction["judge_abstained"] = 0
     return prediction

15-17: Import directly from base module for clarity and consistency.

BaseMetrics, as_int, and as_percentage are all defined in base.py, not math_metrics.py. While importing from math_metrics.py works because it re-exports these symbols, the codebase standard (used in all other metrics files) is to import directly from base.py.

Suggested import path
-from nemo_skills.evaluation.metrics.math_metrics import BaseMetrics, as_int, as_percentage
+from nemo_skills.evaluation.metrics.base import BaseMetrics, as_int, as_percentage
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 079b106 and b154a4a.

📒 Files selected for processing (7)
  • nemo_skills/dataset/omniscience/__init__.py
  • nemo_skills/dataset/omniscience/prepare.py
  • nemo_skills/evaluation/metrics/map_metrics.py
  • nemo_skills/evaluation/metrics/omni_metrics.py
  • nemo_skills/prompt/config/eval/aai/omni.yaml
  • nemo_skills/prompt/config/judge/aa-omni-judge.yaml
  • nemo_skills/prompt/utils.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-12T16:09:53.870Z
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.

Applied to files:

  • nemo_skills/prompt/config/judge/aa-omni-judge.yaml
🧬 Code graph analysis (2)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/omni_metrics.py (1)
  • OmniMetrics (20-120)
nemo_skills/evaluation/metrics/omni_metrics.py (2)
nemo_skills/evaluation/metrics/base.py (4)
  • BaseMetrics (23-434)
  • as_int (443-446)
  • as_percentage (437-440)
  • _compute_pass_at_k (352-423)
nemo_skills/evaluation/metrics/map_metrics.py (1)
  • get_metrics (81-110)
🪛 Ruff (0.14.11)
nemo_skills/dataset/omniscience/prepare.py

71-71: Function definition does not bind loop variable t

(B023)

nemo_skills/evaluation/metrics/omni_metrics.py

34-34: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (8)
nemo_skills/prompt/utils.py (1)

265-268: LGTM! Dynamic system message formatting enables data-driven prompts.

This change aligns with the existing pattern used for user messages (line 193) and enables the new omni.yaml config to substitute {domain} and {topic} at runtime. The behavior is consistent—if input_dict lacks required keys, a KeyError will be raised, matching how user message formatting already behaves.

nemo_skills/prompt/config/eval/aai/omni.yaml (1)

1-7: LGTM! Well-structured evaluation prompt config.

The prompt correctly uses placeholders ({domain}, {topic}, {question}) that align with the fields produced by prepare.py's format_entry(). The system prompt appropriately instructs the model to provide direct answers and explicitly state when it lacks sufficient context.

nemo_skills/prompt/config/judge/aa-omni-judge.yaml (1)

1-99: LGTM! Comprehensive and well-documented judge prompt.

The grading rubric is thorough with clear distinctions:

  • Numeric precision rules (lines 14-16) correctly differentiate between measurement values (allow rounding) vs. identifiers/versions (require exact match).
  • Edge cases for typos, inferred context, and coding equivalence are well-covered.
  • The A/B/C/D mapping (lines 93-97) provides unambiguous output parsing.

The placeholders {question}, {expected_answer}, and {generation} align with the evaluation pipeline's data flow.

nemo_skills/dataset/omniscience/__init__.py (1)

16-16: Verify DATASET_GROUP value.

DATASET_GROUP = "math" seems inconsistent with the AA-Omniscience benchmark, which covers diverse domains (humanities, health, law, finance, SWE, STEM). Is this intentional for framework compatibility, or should it be a more generic value like "omniscience" or "knowledge"?

nemo_skills/evaluation/metrics/omni_metrics.py (2)

20-23: LGTM!

Constructor correctly initializes the parent class and stores the answer_key attribute.


102-119: LGTM!

evaluations_to_print and metrics_to_print are correctly implemented, returning appropriate evaluation labels and metric formatters consistent with the base class patterns.

nemo_skills/evaluation/metrics/map_metrics.py (2)

41-41: LGTM!

Import follows the existing alphabetical ordering convention in the file.


76-76: LGTM!

Registration of "omniscience" metric type in METRICS_MAP follows the established pattern and correctly maps to OmniMetrics.

Comment on lines +21 to +25
JUDGE_PIPELINE_ARGS = {
"model": "gemini-2.5-flash-preview-09-2025",
"server_type": "gemini",
"server_address": "https://generativelanguage.googleapis.com",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Gemini 2.5 flash model versions January 2026 availability

💡 Result:

Short answer — as of January 13, 2026:

  • Gemini 2.5 Flash (gemini-2.5-flash) — GA, available in the Gemini app, Google AI Studio and Vertex AI (launched June 17, 2025; scheduled retirement listed as June 17, 2026). [1][2]
  • Gemini 2.5 Flash‑Lite (gemini-2.5-flash-lite) — GA, available to developers (released July 22, 2025; earliest retirement July 22, 2026). [2][3]
  • Gemini 2.5 Flash‑Image (gemini-2.5-flash-image) — available (released Oct 2, 2025). [2][3]
  • Live / native‑audio variants — there were preview/live variants (e.g., gemini-live-2.5-flash and native‑audio preview) that were preview/private GA; some Live preview model IDs were shut down or replaced in Dec 2025, and newer Gemini 3 Flash rollout is superseding parts of the 2.5 Flash lineup. Check the Gemini API changelog and Vertex AI model pages for exact model IDs and shutdown dates. [1][2][4]

Sources: Google blog & Gemini changelog / Vertex AI docs and recent coverage. [1][2][3][4]

References
[1] Google blog (Gemini updates).
[2] Vertex AI — Gemini 2.5 Flash model page.
[3] Vertex AI — model versions & lifecycle.
[4] Gemini API changelog / deprecations.


Replace with stable GA model version.

The model "gemini-2.5-flash-preview-09-2025" is a preview variant. As of January 2026, preview Gemini 2.5 variants were shut down or replaced in December 2025. Use the stable GA version "gemini-2.5-flash" instead (available through June 17, 2026).

🤖 Prompt for AI Agents
In @nemo_skills/dataset/omniscience/__init__.py around lines 21 - 25,
JUDGE_PIPELINE_ARGS currently sets the model to the preview variant
"gemini-2.5-flash-preview-09-2025"; update the "model" value in the
JUDGE_PIPELINE_ARGS dict to the stable GA name "gemini-2.5-flash" (leave other
keys like "server_type" and "server_address" unchanged) so the code uses the
supported GA model.

Comment on lines +26 to +59
def _compute_reward_at_k(self, predictions: list[dict]):
score_dicts = [self._get_score_dict(pred) for pred in predictions]

for k in range(1, len(predictions) + 1):
for score_method in score_dicts[0].keys():
# Get valid answers and their results for this field
valid_answers_and_results = [
(elem[self.answer_key], correctness_dict[score_method], elem["reward_model_score"])
for elem, correctness_dict in zip(predictions[:k], score_dicts[:k])
if elem[self.answer_key] is not None
]

# If no valid answers, it's incorrect
if not valid_answers_and_results:
is_correct = False
else:
is_correct_best = sorted(valid_answers_and_results, key=lambda x: x[2], reverse=True)[0][1]
self.eval_dict[f"rm_best@{k}"][score_method] += is_correct_best

answer_to_score_dict = defaultdict(float)
answer_to_correctness_dict = {}
for predicted_answer, is_correct, reward_score in valid_answers_and_results:
answer_to_score_dict[predicted_answer] += reward_score
answer_to_correctness_dict[predicted_answer] = is_correct

top_cum_reward_answer = sorted(
list(answer_to_score_dict.items()), key=lambda x: x[1], reverse=True
)[0][0]
is_correct_majority = answer_to_correctness_dict[top_cum_reward_answer]
self.eval_dict[f"rm_majority@{k}"][score_method] += is_correct_majority

no_answer = all(elem[self.answer_key] is None for elem in predictions[:k])
self.eval_dict[f"rm_best@{k}"]["no_answer"] += no_answer
self.eval_dict[f"rm_majority@{k}"]["no_answer"] += no_answer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Multiple issues in _compute_reward_at_k.

  1. Potential IndexError (Line 30): If predictions is empty, score_dicts will be empty and score_dicts[0].keys() will raise IndexError.

  2. Dead code (Line 40): is_correct = False is assigned but never used.

  3. Variable shadowing (Line 47): Loop variable is_correct shadows the outer is_correct from line 40, causing confusion.

  4. Missing strict= on zip (Line 34): Per static analysis, adding strict=True would catch length mismatches.

Suggested fix
 def _compute_reward_at_k(self, predictions: list[dict]):
+    if not predictions:
+        return
+
     score_dicts = [self._get_score_dict(pred) for pred in predictions]
+    if not score_dicts or not score_dicts[0]:
+        return

     for k in range(1, len(predictions) + 1):
         for score_method in score_dicts[0].keys():
             # Get valid answers and their results for this field
             valid_answers_and_results = [
                 (elem[self.answer_key], correctness_dict[score_method], elem["reward_model_score"])
-                for elem, correctness_dict in zip(predictions[:k], score_dicts[:k])
+                for elem, correctness_dict in zip(predictions[:k], score_dicts[:k], strict=True)
                 if elem[self.answer_key] is not None
             ]

             # If no valid answers, it's incorrect
             if not valid_answers_and_results:
-                is_correct = False
+                pass  # No valid answers to process
             else:
                 is_correct_best = sorted(valid_answers_and_results, key=lambda x: x[2], reverse=True)[0][1]
                 self.eval_dict[f"rm_best@{k}"][score_method] += is_correct_best

                 answer_to_score_dict = defaultdict(float)
                 answer_to_correctness_dict = {}
-                for predicted_answer, is_correct, reward_score in valid_answers_and_results:
+                for predicted_answer, correctness, reward_score in valid_answers_and_results:
                     answer_to_score_dict[predicted_answer] += reward_score
-                    answer_to_correctness_dict[predicted_answer] = is_correct
+                    answer_to_correctness_dict[predicted_answer] = correctness
🧰 Tools
🪛 Ruff (0.14.11)

34-34: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

Comment on lines +61 to +69
def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]:
correctness_dict = {}
if "judgement" in prediction:
judgement = prediction["judgement"]
correctness_dict["judge_correct"] = int(judgement.lower() == "a")
correctness_dict["judge_incorrect"] = int(judgement.lower() == "b")
correctness_dict["judge_partially_correct"] = int(judgement.lower() == "c")
correctness_dict["judge_abstained"] = int(judgement.lower() == "d")
return correctness_dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find the file and understand its structure
fd -p omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 108


🏁 Script executed:

# Read the omni_metrics.py file to see the full context
wc -l nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 112


🏁 Script executed:

# Read relevant sections of the file
head -150 nemo_skills/evaluation/metrics/omni_metrics.py | tail -100

Repository: NVIDIA-NeMo/Skills

Length of output: 4822


🏁 Script executed:

# Look at the _compute_reward_at_k method around line 30
sed -n '20,50p' nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1678


🏁 Script executed:

# Check the full structure of the class and find base class
sed -n '1,100p' nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 4883


🏁 Script executed:

# Search for _compute_pass_at_k in the file
grep -n "_compute_pass_at_k\|_compute_reward_at_k" nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 246


🏁 Script executed:

# Find BaseMetrics class
fd -p math_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 108


🏁 Script executed:

# Check the BaseMetrics class and _compute_pass_at_k method
cat nemo_skills/evaluation/metrics/math_metrics.py | head -200

Repository: NVIDIA-NeMo/Skills

Length of output: 6732


🏁 Script executed:

# Find and read the base metrics class
fd -p base.py | grep evaluation

Repository: NVIDIA-NeMo/Skills

Length of output: 141


🏁 Script executed:

# Read the base metrics file
cat nemo_skills/evaluation/metrics/base.py | head -250

Repository: NVIDIA-NeMo/Skills

Length of output: 10587


🏁 Script executed:

# Search for _compute_pass_at_k implementation
grep -n "_compute_pass_at_k" nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 624


🏁 Script executed:

# Read more of base.py to find _compute_pass_at_k
wc -l nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 104


🏁 Script executed:

# Read the _compute_pass_at_k method from base.py
sed -n '200,350p' nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 6200


🏁 Script executed:

# Read the _compute_pass_at_k method
sed -n '352,420p' nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 3433


🏁 Script executed:

# Verify the get_metrics method and understand the issue better
sed -n '73,85p' nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 717


🏁 Script executed:

# Check if there's any initialization or default values in eval_dict
grep -A 10 "def __init__" nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 597


🏁 Script executed:

# Check the base class initialization
sed -n '25,35p' nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 525


🏁 Script executed:

# Check the reset method to see how eval_dict is initialized
sed -n '435,445p' nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 591


🏁 Script executed:

# Let's trace through exactly what happens - check how metrics are accessed
# Check if get_metrics handles missing keys gracefully
sed -n '35,55p' nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1050


🏁 Script executed:

# Let's verify the actual flow by checking OmniMetrics.update() completely
sed -n '94,102p' nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 448


🏁 Script executed:

# Check what happens with defaultdict when accessing missing keys
python3 << 'EOF'
from collections import defaultdict

# This is how eval_dict is structured
eval_dict = defaultdict(lambda: defaultdict(float))

# Simulate what happens when no metrics are added
score_dicts = [{}]  # Empty score dict

# This loop doesn't execute
for score_method in score_dicts[0].keys():
    eval_dict[f"pass@1"]["method"] += 1

# Now try to access keys like get_metrics does
print("Keys in eval_dict:", dict(eval_dict))
print("Keys in eval_dict['pass@1']:", dict(eval_dict["pass@1"]))

# Check if accessing missing key raises error or returns default
try:
    value = eval_dict["pass@1"]["judge_correct"]
    print(f"Value for judge_correct: {value}")
except KeyError as e:
    print(f"KeyError: {e}")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 144


🏁 Script executed:

# Let's test what happens in the actual OmniMetrics.get_metrics() flow
python3 << 'EOF'
from collections import defaultdict

# Simulate the flow
eval_dict = defaultdict(lambda: defaultdict(float))

# When no metrics are computed (empty score dicts), nothing is added
# Simulate calling get_metrics from base class
metrics_dict = {}
for agg_mode, agg_metric_dict in eval_dict.items():
    metrics_dict[agg_mode] = {}
    for metric_key, metric_value in agg_metric_dict.items():
        if isinstance(metric_value, float):
            metrics_dict[agg_mode][metric_key] = 100.0 * metric_value / 1  # assume total=1

print("After base get_metrics:", metrics_dict)

# Now OmniMetrics.get_metrics() tries to access keys
if metrics_dict:  # Only if there are entries
    for agg_method, agg_metric_dict in metrics_dict.items():
        try:
            correct = agg_metric_dict["judge_correct"]
            print(f"Got judge_correct: {correct}")
        except KeyError as e:
            print(f"KeyError accessing judge_correct: {e}")
else:
    print("metrics_dict is empty, so the loop won't even execute")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 142


🏁 Script executed:

# Let's check if there's always at least one agg_mode in eval_dict
grep -n "eval_dict\[" nemo_skills/evaluation/metrics/base.py | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 614


🏁 Script executed:

# Let's trace through a realistic scenario - when _compute_pass_at_k is called with None predicted_answers
python3 << 'EOF'
from collections import defaultdict

# Simulate OmniMetrics scenario
eval_dict = defaultdict(lambda: defaultdict(float))

# Simulate _compute_pass_at_k with empty score_dicts[0] and predicted_answers=None
score_dicts = [{}]  # Empty because no "judgement" field
predicted_answers = None
compute_no_answer = True

# The loop that depends on score_dicts[0]
for score_method in score_dicts[0].keys():
    # This doesn't execute
    eval_dict[f"pass@1"]["method"] += 1
    pass

# The no_answer logic
for k in range(1, 2):
    if predicted_answers is not None and compute_no_answer:
        # This condition is False because predicted_answers is None
        eval_dict[f"pass@{k}"]["no_answer"] += True
    pass

# Check what's in eval_dict
print("eval_dict after _compute_pass_at_k:", dict(eval_dict))
print("Is eval_dict empty?", len(eval_dict) == 0)

# Now simulate get_metrics flow
metrics_dict = {}
for agg_mode, agg_metric_dict in eval_dict.items():
    metrics_dict[agg_mode] = dict(agg_metric_dict)

print("metrics_dict from base get_metrics:", metrics_dict)

# Now OmniMetrics.get_metrics() tries to access
print("\nWould OmniMetrics.get_metrics() be called?")
if metrics_dict:
    print("  Yes, and it would try to access judge_correct, judge_incorrect, etc.")
    for agg_method, agg_metric_dict in metrics_dict.items():
        try:
            val = agg_metric_dict["judge_correct"]
            print(f"  Got {val}")
        except KeyError as e:
            print(f"  KeyError: {e}")
else:
    print("  No, metrics_dict is empty so the loop doesn't execute")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 264


🏁 Script executed:

# Test the actual problematic scenario - mixed predictions with and without "judgement"
python3 << 'EOF'
from collections import defaultdict

# Simulate mixed case - some predictions have judgement, others don't
def _get_score_dict_omni(prediction):
    correctness_dict = {}
    if "judgement" in prediction:
        judgement = prediction["judgement"]
        correctness_dict["judge_correct"] = int(judgement.lower() == "a")
        correctness_dict["judge_incorrect"] = int(judgement.lower() == "b")
        correctness_dict["judge_partially_correct"] = int(judgement.lower() == "c")
        correctness_dict["judge_abstained"] = int(judgement.lower() == "d")
    return correctness_dict

# Test scenario: some predictions have judgement, others don't
predictions = [
    {"judgement": "A", "generation": "answer1"},
    {"generation": "answer2"},  # Missing judgement
]

score_dicts = [_get_score_dict_omni(pred) for pred in predictions]
print("score_dicts:", score_dicts)

# Now try _compute_pass_at_k logic
try:
    for score_method in score_dicts[0].keys():
        print(f"Processing score_method: {score_method}")
        # This is line 369 in _compute_pass_at_k
        scores_list = [correctness_dict[score_method] for correctness_dict in score_dicts]
        print(f"  scores_list: {scores_list}")
except KeyError as e:
    print(f"KeyError: {e}")
    print(f"Trying to access '{e.args[0]}' in one of the score_dicts")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 299


🏁 Script executed:

# Check the update() method to see if there are any guards
sed -n '94,102p' nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 448


🏁 Script executed:

# Also check if OmniMetrics is used with consistent data
grep -r "OmniMetrics" nemo_skills/evaluation/ | grep -v ".pyc" | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 332


🏁 Script executed:

# Check if there's any validation or guarantee that all predictions have consistent structure
grep -B 5 -A 10 "def _compute_pass_at_k" nemo_skills/evaluation/metrics/base.py | head -25

Repository: NVIDIA-NeMo/Skills

Length of output: 799


🏁 Script executed:

# Test the exact error scenario again with the actual code
python3 << 'EOF'
# Reproduce the exact crash with line numbers from base.py
import math
from collections import defaultdict

# Simulate BaseMetrics._compute_pass_at_k
def _get_score_dict(prediction):
    """OmniMetrics version"""
    correctness_dict = {}
    if "judgement" in prediction:
        judgement = prediction["judgement"]
        correctness_dict["judge_correct"] = int(judgement.lower() == "a")
        correctness_dict["judge_incorrect"] = int(judgement.lower() == "b")
        correctness_dict["judge_partially_correct"] = int(judgement.lower() == "c")
        correctness_dict["judge_abstained"] = int(judgement.lower() == "d")
    return correctness_dict

def test_crash():
    predictions = [
        {"judgement": "A", "generation": "answer1"},
        {"generation": "answer2"},  # Missing judgement - will cause crash
    ]
    
    eval_dict = defaultdict(lambda: defaultdict(float))
    score_dicts = [_get_score_dict(pred) for pred in predictions]
    
    print(f"score_dicts = {score_dicts}")
    print(f"score_dicts[0].keys() = {list(score_dicts[0].keys())}")
    
    try:
        # This is line 368 in base.py _compute_pass_at_k
        for score_method in score_dicts[0].keys():
            # This is line 369 in base.py
            scores_list = [correctness_dict[score_method] for correctness_dict in score_dicts]
            print(f"Success: {score_method} = {scores_list}")
    except KeyError as e:
        print(f"CRASH at line 369: KeyError accessing '{e.args[0]}' in score_dicts[1]")
        return False
    return True

test_crash()
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 355


KeyError crash when predictions have inconsistent "judgement" field.

When some (but not all) predictions lack the "judgement" field, _get_score_dict returns an empty dict for those predictions. This causes a KeyError crash in _compute_pass_at_k (base.py line 369) when trying to build scores_list:

scores_list = [correctness_dict[score_method] for correctness_dict in score_dicts]

If score_dicts[0] has keys like "judge_correct" but score_dicts[i] is empty, accessing score_dicts[i]["judge_correct"] raises KeyError.

Validate that all predictions in a batch contain the "judgement" field, or ensure _get_score_dict returns consistent keys across all predictions (e.g., with default values).

Comment on lines +71 to +85
def get_metrics(self):
metrics = super().get_metrics()

for agg_method, agg_metric_dict in metrics.items():
correct, incorrect, part_correct, abstained = (
agg_metric_dict["judge_correct"],
agg_metric_dict["judge_incorrect"],
agg_metric_dict["judge_partially_correct"],
agg_metric_dict["judge_abstained"],
)
metrics[agg_method]["judge_omni_index"] = (
100 * (correct - incorrect) / (correct + incorrect + part_correct + abstained)
)
metrics[agg_method]["judge_omni_hallucination"] = 100 * incorrect / (incorrect + part_correct + abstained)
return metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential ZeroDivisionError in metric calculations.

Two division operations can fail:

  1. Line 82-83: (correct + incorrect + part_correct + abstained) equals zero if no judgements exist.
  2. Line 84: (incorrect + part_correct + abstained) equals zero when all responses are judge_correct (judgement "A").

This will crash metrics computation in edge cases (empty data or perfect scores).

Suggested fix with guards
     def get_metrics(self):
         metrics = super().get_metrics()

         for agg_method, agg_metric_dict in metrics.items():
             correct, incorrect, part_correct, abstained = (
                 agg_metric_dict["judge_correct"],
                 agg_metric_dict["judge_incorrect"],
                 agg_metric_dict["judge_partially_correct"],
                 agg_metric_dict["judge_abstained"],
             )
-            metrics[agg_method]["judge_omni_index"] = (
-                100 * (correct - incorrect) / (correct + incorrect + part_correct + abstained)
-            )
-            metrics[agg_method]["judge_omni_hallucination"] = 100 * incorrect / (incorrect + part_correct + abstained)
+            total = correct + incorrect + part_correct + abstained
+            non_correct_total = incorrect + part_correct + abstained
+            
+            metrics[agg_method]["judge_omni_index"] = (
+                100 * (correct - incorrect) / total if total > 0 else 0.0
+            )
+            metrics[agg_method]["judge_omni_hallucination"] = (
+                100 * incorrect / non_correct_total if non_correct_total > 0 else 0.0
+            )
         return metrics

Comment on lines +96 to +100
def update(self, predictions):
super().update(predictions)
self._compute_pass_at_k(predictions, None)
if "reward_model_score" in predictions[0]:
self._compute_reward_at_k(predictions=predictions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

IndexError risk with empty predictions.

Line 99 accesses predictions[0] without checking if predictions is non-empty. If update is called with an empty list, this will raise IndexError.

Suggested fix
 def update(self, predictions):
     super().update(predictions)
+    if not predictions:
+        return
     self._compute_pass_at_k(predictions, None)
     if "reward_model_score" in predictions[0]:
         self._compute_reward_at_k(predictions=predictions)
🤖 Prompt for AI Agents
In @nemo_skills/evaluation/metrics/omni_metrics.py around lines 96 - 100, The
update method accesses predictions[0] without checking for an empty list,
risking IndexError; after calling super().update(predictions) add a guard like
"if not predictions: return" to avoid further processing on an empty list, or at
minimum change the reward-model check to "if predictions and
'reward_model_score' in predictions[0]:"; ensure this guard is applied before
calling _compute_pass_at_k and _compute_reward_at_k so both methods aren't
invoked with an empty predictions list (refer to the update method and helpers
_compute_pass_at_k and _compute_reward_at_k).

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

@@ -0,0 +1,26 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright year is 2026

Suggested change
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

@@ -0,0 +1,80 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright year is 2026

Suggested change
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

args = parse_args()

dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")
jsonl_data = [format_entry(d) for d in dataset]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jsonl_data variable is assigned but never used - can be removed

@@ -0,0 +1,124 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright year is 2026

Suggested change
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.


# If no valid answers, it's incorrect
if not valid_answers_and_results:
is_correct = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_correct variable is assigned but never used - can be removed

Comment on lines +85 to +88
metrics[agg_method]["judge_omni_hallucination"] = (
100 * incorrect / (incorrect + part_correct + abstained)
if (incorrect + part_correct + abstained) > 0 else 0
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider verifying the denominator logic - if all predictions are correct, this returns 0, but the hallucination metric definition may need clarification for this edge case

Froxyy-dev and others added 17 commits January 13, 2026 10:27
Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
…added data-dependent system prompt formatting

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

score_dicts = [self._get_score_dict(pred) for pred in predictions]

for k in range(1, len(predictions) + 1):
for score_method in score_dicts[0].keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: IndexError if all predictions have no judgement field - score_dicts will be a list of empty dicts, and score_dicts[0].keys() will be empty, but attempting to iterate over it will still try to access score_dicts[0] when it doesn't exist if predictions list is empty

Suggested change
for score_method in score_dicts[0].keys():
if not score_dicts or not score_dicts[0]:
continue
for score_method in score_dicts[0].keys():

def update(self, predictions):
super().update(predictions)
self._compute_pass_at_k(predictions, None)
if "reward_model_score" in predictions[0]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: IndexError if predictions list is empty - accessing predictions[0] without checking if list is non-empty

Suggested change
if "reward_model_score" in predictions[0]:
if predictions and "reward_model_score" in predictions[0]:

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

if self.config.system is not None:
messages = [
{"role": "system", "content": self.config.system},
{"role": "system", "content": self.config.system.format(**input_dict)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: .format() on system messages will unescape {{ to { in existing prompts. Files like nemo_skills/prompt/config/generic/hle.yaml (line 5-7: {{your explanation}}, {{your succinct, final answer}}, {{your confidence score}}) and nemo_skills/prompt/config/qwen/math-cot.yaml (line 3: \boxed{{}}) use {{ to display literal braces to the model. After this change, these become {your explanation}, {your answer}, and \boxed{}, which will either cause KeyError if those keys don't exist in input_dict, or worse, silently replace them with actual values if keys happen to match.

Suggested change
{"role": "system", "content": self.config.system.format(**input_dict)},
{"role": "system", "content": self.config.system if isinstance(self.config.system, str) and '{' not in self.config.system else self.config.system.format(**input_dict) if '{' in self.config.system and '{{' not in self.config.system else self.config.system},

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

if self.config.system is not None:
messages = [
{"role": "system", "content": self.config.system},
{"role": "system", "content": self.config.system.format(**input_dict)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: calling .format(**input_dict) will convert {{ to { in existing system prompts that use double braces for literal display. For example, nemo_skills/prompt/config/qwen/math-cot.yaml:3 has \boxed{{}} which becomes \boxed{} after formatting, and nemo_skills/prompt/config/generic/hle.yaml:5-7 has {{your explanation}} which becomes {your explanation} - this will either cause KeyError if those keys don't exist in input_dict, or silently replace them with values if keys happen to match

Suggested change
{"role": "system", "content": self.config.system.format(**input_dict)},
{"role": "system", "content": self.config.system.format_map(defaultdict(str, **input_dict))},

was the intention to unescape double braces in existing prompts, or should literal braces be preserved?

def update(self, predictions):
super().update(predictions)
self._compute_pass_at_k(predictions, None)
if "reward_model_score" in predictions[0]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: IndexError if predictions is empty - accessing predictions[0] without checking length

Suggested change
if "reward_model_score" in predictions[0]:
if predictions and "reward_model_score" in predictions[0]:

score_dicts = [self._get_score_dict(pred) for pred in predictions]

for k in range(1, len(predictions) + 1):
for score_method in score_dicts[0].keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: KeyError if score_dicts is empty (when all predictions lack judgement field) - trying to access .keys() on empty list's first element

Suggested change
for score_method in score_dicts[0].keys():
if not score_dicts or not score_dicts[0]:
continue
for score_method in score_dicts[0].keys():

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

if self.config.system is not None:
messages = [
{"role": "system", "content": self.config.system},
{"role": "system", "content": self.config.system.format(**input_dict)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: breaks existing prompts using {{ for literal braces. Files like nemo_skills/prompt/config/generic/hle.yaml (lines 5-7: {{your explanation}}, {{your succinct, final answer}}) and nemo_skills/prompt/config/eval/aai/math.yaml (lines 4,8: \boxed{{}}) use {{ to display literal braces. .format() converts {{ to {, causing either KeyError if keys don't exist, or unintended replacements if they do.

Suggested change
{"role": "system", "content": self.config.system.format(**input_dict)},
{"role": "system", "content": self.config.system},

def update(self, predictions):
super().update(predictions)
self._compute_pass_at_k(predictions, None)
if "reward_model_score" in predictions[0]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: IndexError if predictions is empty - accessing predictions[0] without checking length first

Suggested change
if "reward_model_score" in predictions[0]:
if predictions and "reward_model_score" in predictions[0]:

score_dicts = [self._get_score_dict(pred) for pred in predictions]

for k in range(1, len(predictions) + 1):
for score_method in score_dicts[0].keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: IndexError if all predictions lack judgement field - score_dicts will be a list of empty dicts, causing score_dicts[0].keys() to fail when trying to iterate

Suggested change
for score_method in score_dicts[0].keys():
if not score_dicts or not score_dicts[0]:
continue
for score_method in score_dicts[0].keys():

Comment on lines +63 to +64
dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")
jsonl_data = [format_entry(d) for d in dataset]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: jsonl_data created but never used - can be removed

Suggested change
dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")
jsonl_data = [format_entry(d) for d in dataset]
dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

if self.config.system is not None:
messages = [
{"role": "system", "content": self.config.system},
{"role": "system", "content": self.config.system.format(**input_dict)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: .format(**input_dict) will unescape {{ to { in existing system prompts. Files like nemo_skills/prompt/config/qwen/math-cot.yaml:3 use \boxed{{}} which will become \boxed{} after formatting, and nemo_skills/prompt/config/generic/hle.yaml:5-7 use {{your explanation}}, {{your succinct, final answer}} which will become {your explanation}, {your answer} - causing KeyError or unintended replacements

Suggested change
{"role": "system", "content": self.config.system.format(**input_dict)},
{"role": "system", "content": self.config.system.format_map(defaultdict(str, input_dict))},

or use a safer approach that only formats when needed. Was the intention to unescape double braces in existing prompts, or should they be preserved?

score_dicts = [self._get_score_dict(pred) for pred in predictions]

for k in range(1, len(predictions) + 1):
for score_method in score_dicts[0].keys():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: IndexError if predictions is empty - accessing score_dicts[0] without checking length

Suggested change
for score_method in score_dicts[0].keys():
for score_method in (score_dicts[0].keys() if score_dicts else []):

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

if self.config.system is not None:
messages = [
{"role": "system", "content": self.config.system},
{"role": "system", "content": self.config.system.format(**input_dict)},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: calling .format(**input_dict) unescapes {{ to { in existing system prompts. Prompts like nemo_skills/prompt/config/qwen/math-cot.yaml:3 (\boxed{{}}) become \boxed{} after formatting, and nemo_skills/prompt/config/generic/hle.yaml:5-7 ({{your explanation}}, {{your succinct, final answer}}) become {your explanation}, {your answer} - causing KeyError if those keys don't exist in input_dict, or unintended replacements if they do

Suggested change
{"role": "system", "content": self.config.system.format(**input_dict)},
{"role": "system", "content": self.config.system.format_map(defaultdict(str, **input_dict))},

alternatively, only format when the system message contains the specific keys from input_dict, or escape existing braces before formatting. was the intention to unescape {{}} in existing prompts, or should literal braces be preserved?

def update(self, predictions):
super().update(predictions)
self._compute_pass_at_k(predictions, None)
if "reward_model_score" in predictions[0]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: IndexError if predictions list is empty - accessing predictions[0] without checking length first

Suggested change
if "reward_model_score" in predictions[0]:
if predictions and "reward_model_score" in predictions[0]:

@arnavkomaragiri arnavkomaragiri merged commit ba25265 into main Jan 16, 2026
6 checks passed
@arnavkomaragiri arnavkomaragiri deleted the akomaragiri/aai_omniscience branch January 16, 2026 23:25
@coderabbitai coderabbitai bot mentioned this pull request Feb 5, 2026
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Dan Lord <blahblahasdf@gmail.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: Arnav Komaragiri <arnav.komaragiri@gmail.com>
Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Valentin Mendelev <vmendelev@nvidia.com>
Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com>
Co-authored-by: Dan Lord <blahblahasdf@gmail.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Dan Lord <blahblahasdf@gmail.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: Arnav Komaragiri <arnav.komaragiri@gmail.com>
Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Valentin Mendelev <vmendelev@nvidia.com>
Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com>
Co-authored-by: Dan Lord <blahblahasdf@gmail.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants