Added AAI-Omniscience Benchmark by arnavkomaragiri · Pull Request #1161 · NVIDIA-NeMo/Skills

arnavkomaragiri · 2026-01-12T17:17:38Z

Draft of AAI-Omniscience Benchmark in Nemo-Skills, will clean up before merging.

Summary by CodeRabbit

Release Notes

New Features
- Added support for AA-Omniscience dataset evaluation with specialized metrics
- Introduced judge-based correctness scoring for comprehensive evaluation assessment
- Enhanced prompt configuration with dynamic system message handling

_{✏️ Tip: You can customize this high-level summary in your review settings.}

gnalbandyan · 2026-01-13T07:42:04Z

fiddle/omni_eval.py

@@ -0,0 +1,45 @@
+import argparse


This is not Nemo-skills file right?

It's not, just a basic debug file I added to test it since ns wasn't working in my env. I'll delete to clean things up.

gnalbandyan · 2026-01-13T07:52:29Z

nemo_skills/evaluation/metrics/omni_metrics.py

+        self.answer_key = answer_key
+
+    # use same RM code as MathMetrics
+    def _compute_reward_at_k(self, predictions: list[dict]):


Is this function required for all datasets? Seems like math has it but most of the datasets don't

I don't think it's required, but I figured it might be helpful since a reward model could serve as a proxy for a judge model and may be useful for the downstream task. It's not critical to the benchmark itself though, so we can drop if need be.

gnalbandyan · 2026-01-13T07:53:42Z

nemo_skills/prompt/utils.py

        if self.config.system is not None:
            messages = [
-                {"role": "system", "content": self.config.system},
+                {"role": "system", "content": self.config.system.format(**input_dict)},


this change touches all Nemo-skills codes, not just omniscience. If its really required, maybe ask Igor to validate?

This is required since the system prompt is data dependent for Omniscience for the topic and domain fields; I'll ask Igor to validate to make sure this doesn't break other code.

gnalbandyan · 2026-01-13T10:55:54Z

dockerfiles/Dockerfile.nemo-skills

 RUN pip install --no-cache-dir -r /opt/NeMo-Skills/requirements/main.txt
 # Fix http mismatch between lepton and dggs by manually downloading dggs here
 RUN pip install ddgs
+RUN pip install func-timeout


this is handled in current Nemo skills docker

greptile-apps · 2026-01-13T18:13:09Z

Greptile Summary

Adds AAI-Omniscience benchmark evaluation with judge-based correctness scoring and specialized hallucination metrics. The implementation includes dataset preparation, custom metrics computation (omni-index and hallucination rate), and comprehensive documentation with configuration examples.

Critical Issue:

nemo_skills/prompt/utils.py:267 - The change to call .format(**input_dict) on system messages breaks existing prompts that use {{}} for literal braces. Multiple existing prompts use \boxed{{}} (math-cot.yaml, math-tir.yaml, etc.) and {{your explanation}} (hle.yaml) which will be unescaped to single braces, causing KeyError or unintended replacements

Other Issues:

nemo_skills/evaluation/metrics/omni_metrics.py:121 - Potential IndexError when checking predictions[0] without verifying the list is non-empty

Confidence Score: 2/5

Not safe to merge - contains breaking change to existing prompts
The .format() change in prompt/utils.py will break all existing prompts that use {{}} for literal braces (like \boxed{{}}), causing either KeyError or unintended string replacements. Additionally, there's an IndexError risk in omni_metrics.py when accessing empty predictions
Critical attention needed for nemo_skills/prompt/utils.py (breaking change) and nemo_skills/evaluation/metrics/omni_metrics.py (potential crash)

Important Files Changed

Filename	Overview
nemo_skills/prompt/utils.py	Added `.format(**input_dict)` to system messages - breaks existing prompts with `{{}}` literal braces
nemo_skills/evaluation/metrics/omni_metrics.py	New metrics implementation with potential `IndexError` on empty predictions at line 121
nemo_skills/dataset/omniscience/prepare.py	Dataset preparation script - clean implementation, unused variable at line 64

Sequence Diagram

sequenceDiagram
    participant User
    participant Eval Pipeline
    participant Prompt Utils
    participant Model
    participant Judge
    participant OmniMetrics

    User->>Eval Pipeline: eval(benchmarks="omniscience")
    Eval Pipeline->>Prompt Utils: Load omni.yaml config
    Prompt Utils->>Prompt Utils: system.format(domain, topic, question)
    Prompt Utils->>Model: Generate answer
    Model-->>Prompt Utils: generation response
    Prompt Utils-->>Eval Pipeline: predictions with generation
    
    Eval Pipeline->>Judge: Load aa-omni-judge.yaml
    Judge->>Judge: Compare generation vs expected_answer
    Judge-->>Eval Pipeline: judgement (A/B/C/D)
    
    Eval Pipeline->>OmniMetrics: update(predictions)
    OmniMetrics->>OmniMetrics: _get_score_dict(judgement)
    OmniMetrics->>OmniMetrics: _compute_pass_at_k()
    alt reward_model_score exists
        OmniMetrics->>OmniMetrics: _compute_reward_at_k()
    end
    OmniMetrics->>OmniMetrics: get_metrics()
    OmniMetrics->>OmniMetrics: Calculate omni_index & hallucination_rate
    OmniMetrics-->>Eval Pipeline: Final metrics
    Eval Pipeline-->>User: Results with accuracy, omni-index, hallucination rate

greptile-apps

_{6 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T18:13:12Z

nemo_skills/dataset/omniscience/__init__.py

@@ -0,0 +1,26 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.


Copyright year is 2026 (future year)

Suggested change

# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

greptile-apps · 2026-01-13T18:13:13Z

nemo_skills/dataset/omniscience/prepare.py

@@ -0,0 +1,80 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.


Copyright year is 2026 (future year)

Suggested change

# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

greptile-apps · 2026-01-13T18:13:14Z

nemo_skills/evaluation/metrics/omni_metrics.py

@@ -0,0 +1,120 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.


Copyright year is 2026 (future year)

Suggested change

# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

nemo_skills/evaluation/metrics/omni_metrics.py

greptile-apps · 2026-01-13T18:13:16Z

nemo_skills/dataset/omniscience/prepare.py

+    args = parse_args()
+
+    dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")
+    jsonl_data = [format_entry(d) for d in dataset]


jsonl_data variable is unused

greptile-apps · 2026-01-13T18:13:17Z

nemo_skills/evaluation/metrics/omni_metrics.py

+
+                # If no valid answers, it's incorrect
+                if not valid_answers_and_results:
+                    is_correct = False


is_correct variable is assigned but never used

coderabbitai · 2026-01-13T18:14:20Z

📝 Walkthrough

Walkthrough

This PR introduces the "omniscience" evaluation metric system for the AA-Omniscience dataset. It adds dataset preparation logic, a new metrics class that computes evaluation scores using judge signals and reward model scoring, prompt configurations for evaluation and judging, and integration into the existing metrics framework.

Changes

Cohort / File(s)	Summary
Omniscience Dataset Module `nemo_skills/dataset/omniscience/__init__.py`	Exports configuration constants for omniscience evaluation: dataset group, metrics type, generation arguments, evaluation split, judge pipeline details (using gemini-2.5-flash model), and judge prompt configuration.
Omniscience Dataset Preparation `nemo_skills/dataset/omniscience/prepare.py`	New dataset preparation script that loads AA-Omniscience-Public dataset, maps topics to split names, standardizes entry format (id, domain, topic, question, expected_answer), and writes per-split JSONL outputs including a full "text" split and per-domain filtered splits.
Omniscience Metrics `nemo_skills/evaluation/metrics/omni_metrics.py`	New OmniMetrics class (extends BaseMetrics) that evaluates predictions using judge correctness signals and reward model scores. Includes pass-at-k computation with best/majority selection, omniscience index and hallucination metrics derived from judge flags, and configurable no_answer tracking.
Metrics Registration `nemo_skills/evaluation/metrics/map_metrics.py`	Registers OmniMetrics class in METRICS_MAP under the "omniscience" key for framework integration.
Prompt Configurations `nemo_skills/prompt/config/eval/aai/omni.yaml`, `nemo_skills/prompt/config/judge/aa-omni-judge.yaml`	Adds evaluation prompt that constrains answers to direct responses and explicit inability statements, and detailed judge prompt with grading rubric covering CORRECT, INCORRECT, PARTIAL_ANSWER, and NOT_ATTEMPTED categories with examples and edge case handling.
Prompt Utilities Update `nemo_skills/prompt/utils.py`	Modified fill() function to apply Python str.format() to system message content, enabling dynamic variable substitution in system prompts.

Sequence Diagrams

sequenceDiagram
    participant User
    participant PrepareScript as Prepare Script
    participant HFDataset as HuggingFace Dataset
    participant FileSystem
    
    User->>PrepareScript: python prepare.py --splits text,math,...
    PrepareScript->>HFDataset: Load AA-Omniscience-Public
    HFDataset-->>PrepareScript: Dataset loaded
    
    loop For each split (text + per-domain)
        PrepareScript->>PrepareScript: Format entries (id, domain, topic, question, answer)
        alt text split
            PrepareScript->>PrepareScript: Use full dataset
        else domain split
            PrepareScript->>HFDataset: Filter by domain
            HFDataset-->>PrepareScript: Filtered entries
        end
        PrepareScript->>FileSystem: Write split_name.jsonl
    end
    
    FileSystem-->>User: JSONL files generated

sequenceDiagram
    participant Evaluator
    participant OmniMetrics
    participant BaseMetrics
    participant Judge
    participant RewardModel
    
    Evaluator->>OmniMetrics: update(predictions)
    OmniMetrics->>BaseMetrics: Parent update logic
    
    loop For each prediction
        OmniMetrics->>OmniMetrics: _get_score_dict (extract judge_correct, etc.)
        alt reward_model_score exists
            OmniMetrics->>RewardModel: Evaluate prediction
            RewardModel-->>OmniMetrics: Score returned
            OmniMetrics->>OmniMetrics: _compute_reward_at_k (best/majority selection)
        end
    end
    
    OmniMetrics->>OmniMetrics: get_metrics (compute judge_omni_index, hallucination)
    OmniMetrics-->>Evaluator: Augmented metrics returned

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

Kipok
titu1994
ekmb

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Added AAI-Omniscience Benchmark' accurately reflects the main change: introducing the AAI-Omniscience benchmark to the nemo_skills repository, as evidenced by new modules, evaluation metrics, dataset handling, and configuration files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In @nemo_skills/dataset/omniscience/__init__.py:
- Around line 21-25: JUDGE_PIPELINE_ARGS currently sets the model to the preview
variant "gemini-2.5-flash-preview-09-2025"; update the "model" value in the
JUDGE_PIPELINE_ARGS dict to the stable GA name "gemini-2.5-flash" (leave other
keys like "server_type" and "server_address" unchanged) so the code uses the
supported GA model.

In @nemo_skills/evaluation/metrics/omni_metrics.py:
- Around line 96-100: The update method accesses predictions[0] without checking
for an empty list, risking IndexError; after calling super().update(predictions)
add a guard like "if not predictions: return" to avoid further processing on an
empty list, or at minimum change the reward-model check to "if predictions and
'reward_model_score' in predictions[0]:"; ensure this guard is applied before
calling _compute_pass_at_k and _compute_reward_at_k so both methods aren't
invoked with an empty predictions list (refer to the update method and helpers
_compute_pass_at_k and _compute_reward_at_k).

🧹 Nitpick comments (4)

nemo_skills/dataset/omniscience/prepare.py (2)
64-64: Remove unused variable.

jsonl_data is computed but never used. This appears to be dead code from earlier iterations.
🧹 Suggested fix
     dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")
-    jsonl_data = [format_entry(d) for d in dataset]
     output_dir = Path(__file__).absolute().parent
68-74: Lambda closure captures loop variable by reference.

The lambda lambda x: x["domain"] == t captures t by reference. In a dict comprehension, this is typically problematic because all lambdas would reference the final value of t. While dataset.filter() likely evaluates immediately (avoiding the bug), this is fragile and flagged by static analysis (B023).
♻️ Recommended fix using default argument capture
     splits = {
         "text": dataset,
         **{
-            TOPIC_TO_SPLIT_MAP.get(t, str(t).lower()): dataset.filter(lambda x: x["domain"] == t)
+            TOPIC_TO_SPLIT_MAP.get(t, str(t).lower()): dataset.filter(lambda x, domain=t: x["domain"] == domain)
             for t in dataset.unique("domain")
         },
     }
nemo_skills/evaluation/metrics/omni_metrics.py (2)
87-94: In-place mutation may cause unexpected side effects.

get_incorrect_sample mutates the input prediction dict directly. If the caller doesn't expect this, it could lead to subtle bugs. Consider returning a copy instead.
Safer approach using copy
 def get_incorrect_sample(self, prediction: dict) -> dict:
+    prediction = prediction.copy()
     if "judgement" in prediction:
         prediction["judgement"] = "B"
         prediction["judge_correct"] = 0
         prediction["judge_incorrect"] = 1
         prediction["judge_partially_correct"] = 0
         prediction["judge_abstained"] = 0
     return prediction
15-17: Import directly from base module for clarity and consistency.

BaseMetrics, as_int, and as_percentage are all defined in base.py, not math_metrics.py. While importing from math_metrics.py works because it re-exports these symbols, the codebase standard (used in all other metrics files) is to import directly from base.py.
Suggested import path
-from nemo_skills.evaluation.metrics.math_metrics import BaseMetrics, as_int, as_percentage
+from nemo_skills.evaluation.metrics.base import BaseMetrics, as_int, as_percentage

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 079b106 and b154a4a.

📒 Files selected for processing (7)

nemo_skills/dataset/omniscience/__init__.py
nemo_skills/dataset/omniscience/prepare.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/omni_metrics.py
nemo_skills/prompt/config/eval/aai/omni.yaml
nemo_skills/prompt/config/judge/aa-omni-judge.yaml
nemo_skills/prompt/utils.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-12-12T16:09:53.870Z

Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.

Applied to files:

nemo_skills/prompt/config/judge/aa-omni-judge.yaml

🧬 Code graph analysis (2)

nemo_skills/evaluation/metrics/map_metrics.py (1)

nemo_skills/evaluation/metrics/omni_metrics.py (1)

OmniMetrics (20-120)

nemo_skills/evaluation/metrics/omni_metrics.py (2)

nemo_skills/evaluation/metrics/base.py (4)

BaseMetrics (23-434)

as_int (443-446)

as_percentage (437-440)

_compute_pass_at_k (352-423)

nemo_skills/evaluation/metrics/map_metrics.py (1)

get_metrics (81-110)

🪛 Ruff (0.14.11)

nemo_skills/dataset/omniscience/prepare.py

71-71: Function definition does not bind loop variable t

(B023)

nemo_skills/evaluation/metrics/omni_metrics.py

34-34: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (8)

nemo_skills/prompt/utils.py (1)

265-268: LGTM! Dynamic system message formatting enables data-driven prompts.

This change aligns with the existing pattern used for user messages (line 193) and enables the new omni.yaml config to substitute {domain} and {topic} at runtime. The behavior is consistent—if input_dict lacks required keys, a KeyError will be raised, matching how user message formatting already behaves.

nemo_skills/prompt/config/eval/aai/omni.yaml (1)

1-7: LGTM! Well-structured evaluation prompt config.

The prompt correctly uses placeholders ({domain}, {topic}, {question}) that align with the fields produced by prepare.py's format_entry(). The system prompt appropriately instructs the model to provide direct answers and explicitly state when it lacks sufficient context.

nemo_skills/prompt/config/judge/aa-omni-judge.yaml (1)

1-99: LGTM! Comprehensive and well-documented judge prompt.

The grading rubric is thorough with clear distinctions:

Numeric precision rules (lines 14-16) correctly differentiate between measurement values (allow rounding) vs. identifiers/versions (require exact match).

Edge cases for typos, inferred context, and coding equivalence are well-covered.

The A/B/C/D mapping (lines 93-97) provides unambiguous output parsing.

The placeholders {question}, {expected_answer}, and {generation} align with the evaluation pipeline's data flow.

nemo_skills/dataset/omniscience/__init__.py (1)

16-16: Verify DATASET_GROUP value.

DATASET_GROUP = "math" seems inconsistent with the AA-Omniscience benchmark, which covers diverse domains (humanities, health, law, finance, SWE, STEM). Is this intentional for framework compatibility, or should it be a more generic value like "omniscience" or "knowledge"?

nemo_skills/evaluation/metrics/omni_metrics.py (2)

20-23: LGTM!

Constructor correctly initializes the parent class and stores the answer_key attribute.

102-119: LGTM!

evaluations_to_print and metrics_to_print are correctly implemented, returning appropriate evaluation labels and metric formatters consistent with the base class patterns.

nemo_skills/evaluation/metrics/map_metrics.py (2)

41-41: LGTM!

Import follows the existing alphabetical ordering convention in the file.

76-76: LGTM!

Registration of "omniscience" metric type in METRICS_MAP follows the established pattern and correctly maps to OmniMetrics.

coderabbitai · 2026-01-13T18:14:23Z

nemo_skills/dataset/omniscience/__init__.py

+JUDGE_PIPELINE_ARGS = {
+    "model": "gemini-2.5-flash-preview-09-2025",
+    "server_type": "gemini",
+    "server_address": "https://generativelanguage.googleapis.com",
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Gemini 2.5 flash model versions January 2026 availability

💡 Result:

Short answer — as of January 13, 2026:

Gemini 2.5 Flash (gemini-2.5-flash) — GA, available in the Gemini app, Google AI Studio and Vertex AI (launched June 17, 2025; scheduled retirement listed as June 17, 2026). [1][2]

Gemini 2.5 Flash‑Lite (gemini-2.5-flash-lite) — GA, available to developers (released July 22, 2025; earliest retirement July 22, 2026). [2][3]

Gemini 2.5 Flash‑Image (gemini-2.5-flash-image) — available (released Oct 2, 2025). [2][3]

Live / native‑audio variants — there were preview/live variants (e.g., gemini-live-2.5-flash and native‑audio preview) that were preview/private GA; some Live preview model IDs were shut down or replaced in Dec 2025, and newer Gemini 3 Flash rollout is superseding parts of the 2.5 Flash lineup. Check the Gemini API changelog and Vertex AI model pages for exact model IDs and shutdown dates. [1][2][4]

Sources: Google blog & Gemini changelog / Vertex AI docs and recent coverage. [1][2][3][4]

References
[1] Google blog (Gemini updates).
[2] Vertex AI — Gemini 2.5 Flash model page.
[3] Vertex AI — model versions & lifecycle.
[4] Gemini API changelog / deprecations.

Replace with stable GA model version.

The model "gemini-2.5-flash-preview-09-2025" is a preview variant. As of January 2026, preview Gemini 2.5 variants were shut down or replaced in December 2025. Use the stable GA version "gemini-2.5-flash" instead (available through June 17, 2026).

🤖 Prompt for AI Agents

In @nemo_skills/dataset/omniscience/__init__.py around lines 21 - 25, JUDGE_PIPELINE_ARGS currently sets the model to the preview variant "gemini-2.5-flash-preview-09-2025"; update the "model" value in the JUDGE_PIPELINE_ARGS dict to the stable GA name "gemini-2.5-flash" (leave other keys like "server_type" and "server_address" unchanged) so the code uses the supported GA model.

coderabbitai · 2026-01-13T18:14:23Z

nemo_skills/evaluation/metrics/omni_metrics.py

+    def _compute_reward_at_k(self, predictions: list[dict]):
+        score_dicts = [self._get_score_dict(pred) for pred in predictions]
+
+        for k in range(1, len(predictions) + 1):
+            for score_method in score_dicts[0].keys():
+                # Get valid answers and their results for this field
+                valid_answers_and_results = [
+                    (elem[self.answer_key], correctness_dict[score_method], elem["reward_model_score"])
+                    for elem, correctness_dict in zip(predictions[:k], score_dicts[:k])
+                    if elem[self.answer_key] is not None
+                ]
+
+                # If no valid answers, it's incorrect
+                if not valid_answers_and_results:
+                    is_correct = False
+                else:
+                    is_correct_best = sorted(valid_answers_and_results, key=lambda x: x[2], reverse=True)[0][1]
+                    self.eval_dict[f"rm_best@{k}"][score_method] += is_correct_best
+
+                    answer_to_score_dict = defaultdict(float)
+                    answer_to_correctness_dict = {}
+                    for predicted_answer, is_correct, reward_score in valid_answers_and_results:
+                        answer_to_score_dict[predicted_answer] += reward_score
+                        answer_to_correctness_dict[predicted_answer] = is_correct
+
+                    top_cum_reward_answer = sorted(
+                        list(answer_to_score_dict.items()), key=lambda x: x[1], reverse=True
+                    )[0][0]
+                    is_correct_majority = answer_to_correctness_dict[top_cum_reward_answer]
+                    self.eval_dict[f"rm_majority@{k}"][score_method] += is_correct_majority
+
+            no_answer = all(elem[self.answer_key] is None for elem in predictions[:k])
+            self.eval_dict[f"rm_best@{k}"]["no_answer"] += no_answer
+            self.eval_dict[f"rm_majority@{k}"]["no_answer"] += no_answer


⚠️ Potential issue | 🟡 Minor

Multiple issues in _compute_reward_at_k.

Potential IndexError (Line 30): If predictions is empty, score_dicts will be empty and score_dicts[0].keys() will raise IndexError.

Dead code (Line 40): is_correct = False is assigned but never used.

Variable shadowing (Line 47): Loop variable is_correct shadows the outer is_correct from line 40, causing confusion.

Missing strict= on zip (Line 34): Per static analysis, adding strict=True would catch length mismatches.

Suggested fix

def _compute_reward_at_k(self, predictions: list[dict]): + if not predictions: + return + score_dicts = [self._get_score_dict(pred) for pred in predictions] + if not score_dicts or not score_dicts[0]: + return for k in range(1, len(predictions) + 1): for score_method in score_dicts[0].keys(): # Get valid answers and their results for this field valid_answers_and_results = [ (elem[self.answer_key], correctness_dict[score_method], elem["reward_model_score"]) - for elem, correctness_dict in zip(predictions[:k], score_dicts[:k]) + for elem, correctness_dict in zip(predictions[:k], score_dicts[:k], strict=True) if elem[self.answer_key] is not None ] # If no valid answers, it's incorrect if not valid_answers_and_results: - is_correct = False + pass # No valid answers to process else: is_correct_best = sorted(valid_answers_and_results, key=lambda x: x[2], reverse=True)[0][1] self.eval_dict[f"rm_best@{k}"][score_method] += is_correct_best answer_to_score_dict = defaultdict(float) answer_to_correctness_dict = {} - for predicted_answer, is_correct, reward_score in valid_answers_and_results: + for predicted_answer, correctness, reward_score in valid_answers_and_results: answer_to_score_dict[predicted_answer] += reward_score - answer_to_correctness_dict[predicted_answer] = is_correct + answer_to_correctness_dict[predicted_answer] = correctness

🧰 Tools

🪛 Ruff (0.14.11)

34-34: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

coderabbitai · 2026-01-13T18:14:23Z

nemo_skills/evaluation/metrics/omni_metrics.py

+    def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]:
+        correctness_dict = {}
+        if "judgement" in prediction:
+            judgement = prediction["judgement"]
+            correctness_dict["judge_correct"] = int(judgement.lower() == "a")
+            correctness_dict["judge_incorrect"] = int(judgement.lower() == "b")
+            correctness_dict["judge_partially_correct"] = int(judgement.lower() == "c")
+            correctness_dict["judge_abstained"] = int(judgement.lower() == "d")
+        return correctness_dict


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find the file and understand its structure fd -p omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 108

🏁 Script executed:

# Read the omni_metrics.py file to see the full context wc -l nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 112

🏁 Script executed:

# Read relevant sections of the file head -150 nemo_skills/evaluation/metrics/omni_metrics.py | tail -100

Repository: NVIDIA-NeMo/Skills

Length of output: 4822

🏁 Script executed:

# Look at the _compute_reward_at_k method around line 30 sed -n '20,50p' nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1678

🏁 Script executed:

# Check the full structure of the class and find base class sed -n '1,100p' nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 4883

🏁 Script executed:

# Search for _compute_pass_at_k in the file grep -n "_compute_pass_at_k\|_compute_reward_at_k" nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 246

🏁 Script executed:

# Find BaseMetrics class fd -p math_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 108

🏁 Script executed:

# Check the BaseMetrics class and _compute_pass_at_k method cat nemo_skills/evaluation/metrics/math_metrics.py | head -200

Repository: NVIDIA-NeMo/Skills

Length of output: 6732

🏁 Script executed:

# Find and read the base metrics class fd -p base.py | grep evaluation

Repository: NVIDIA-NeMo/Skills

Length of output: 141

🏁 Script executed:

# Read the base metrics file cat nemo_skills/evaluation/metrics/base.py | head -250

Repository: NVIDIA-NeMo/Skills

Length of output: 10587

🏁 Script executed:

# Search for _compute_pass_at_k implementation grep -n "_compute_pass_at_k" nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 624

🏁 Script executed:

# Read more of base.py to find _compute_pass_at_k wc -l nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 104

🏁 Script executed:

# Read the _compute_pass_at_k method from base.py sed -n '200,350p' nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 6200

🏁 Script executed:

# Read the _compute_pass_at_k method sed -n '352,420p' nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 3433

🏁 Script executed:

# Verify the get_metrics method and understand the issue better sed -n '73,85p' nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 717

🏁 Script executed:

# Check if there's any initialization or default values in eval_dict grep -A 10 "def __init__" nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 597

🏁 Script executed:

# Check the base class initialization sed -n '25,35p' nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 525

🏁 Script executed:

# Check the reset method to see how eval_dict is initialized sed -n '435,445p' nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 591

🏁 Script executed:

# Let's trace through exactly what happens - check how metrics are accessed # Check if get_metrics handles missing keys gracefully sed -n '35,55p' nemo_skills/evaluation/metrics/base.py

Repository: NVIDIA-NeMo/Skills

Length of output: 1050

🏁 Script executed:

# Let's verify the actual flow by checking OmniMetrics.update() completely sed -n '94,102p' nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 448

🏁 Script executed:

# Check what happens with defaultdict when accessing missing keys python3 << 'EOF' from collections import defaultdict # This is how eval_dict is structured eval_dict = defaultdict(lambda: defaultdict(float)) # Simulate what happens when no metrics are added score_dicts = [{}] # Empty score dict # This loop doesn't execute for score_method in score_dicts[0].keys(): eval_dict[f"pass@1"]["method"] += 1 # Now try to access keys like get_metrics does print("Keys in eval_dict:", dict(eval_dict)) print("Keys in eval_dict['pass@1']:", dict(eval_dict["pass@1"])) # Check if accessing missing key raises error or returns default try: value = eval_dict["pass@1"]["judge_correct"] print(f"Value for judge_correct: {value}") except KeyError as e: print(f"KeyError: {e}") EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 144

🏁 Script executed:

# Let's test what happens in the actual OmniMetrics.get_metrics() flow python3 << 'EOF' from collections import defaultdict # Simulate the flow eval_dict = defaultdict(lambda: defaultdict(float)) # When no metrics are computed (empty score dicts), nothing is added # Simulate calling get_metrics from base class metrics_dict = {} for agg_mode, agg_metric_dict in eval_dict.items(): metrics_dict[agg_mode] = {} for metric_key, metric_value in agg_metric_dict.items(): if isinstance(metric_value, float): metrics_dict[agg_mode][metric_key] = 100.0 * metric_value / 1 # assume total=1 print("After base get_metrics:", metrics_dict) # Now OmniMetrics.get_metrics() tries to access keys if metrics_dict: # Only if there are entries for agg_method, agg_metric_dict in metrics_dict.items(): try: correct = agg_metric_dict["judge_correct"] print(f"Got judge_correct: {correct}") except KeyError as e: print(f"KeyError accessing judge_correct: {e}") else: print("metrics_dict is empty, so the loop won't even execute") EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 142

🏁 Script executed:

# Let's check if there's always at least one agg_mode in eval_dict grep -n "eval_dict\[" nemo_skills/evaluation/metrics/base.py | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 614

🏁 Script executed:

# Let's trace through a realistic scenario - when _compute_pass_at_k is called with None predicted_answers python3 << 'EOF' from collections import defaultdict # Simulate OmniMetrics scenario eval_dict = defaultdict(lambda: defaultdict(float)) # Simulate _compute_pass_at_k with empty score_dicts[0] and predicted_answers=None score_dicts = [{}] # Empty because no "judgement" field predicted_answers = None compute_no_answer = True # The loop that depends on score_dicts[0] for score_method in score_dicts[0].keys(): # This doesn't execute eval_dict[f"pass@1"]["method"] += 1 pass # The no_answer logic for k in range(1, 2): if predicted_answers is not None and compute_no_answer: # This condition is False because predicted_answers is None eval_dict[f"pass@{k}"]["no_answer"] += True pass # Check what's in eval_dict print("eval_dict after _compute_pass_at_k:", dict(eval_dict)) print("Is eval_dict empty?", len(eval_dict) == 0) # Now simulate get_metrics flow metrics_dict = {} for agg_mode, agg_metric_dict in eval_dict.items(): metrics_dict[agg_mode] = dict(agg_metric_dict) print("metrics_dict from base get_metrics:", metrics_dict) # Now OmniMetrics.get_metrics() tries to access print("\nWould OmniMetrics.get_metrics() be called?") if metrics_dict: print(" Yes, and it would try to access judge_correct, judge_incorrect, etc.") for agg_method, agg_metric_dict in metrics_dict.items(): try: val = agg_metric_dict["judge_correct"] print(f" Got {val}") except KeyError as e: print(f" KeyError: {e}") else: print(" No, metrics_dict is empty so the loop doesn't execute") EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 264

🏁 Script executed:

# Test the actual problematic scenario - mixed predictions with and without "judgement" python3 << 'EOF' from collections import defaultdict # Simulate mixed case - some predictions have judgement, others don't def _get_score_dict_omni(prediction): correctness_dict = {} if "judgement" in prediction: judgement = prediction["judgement"] correctness_dict["judge_correct"] = int(judgement.lower() == "a") correctness_dict["judge_incorrect"] = int(judgement.lower() == "b") correctness_dict["judge_partially_correct"] = int(judgement.lower() == "c") correctness_dict["judge_abstained"] = int(judgement.lower() == "d") return correctness_dict # Test scenario: some predictions have judgement, others don't predictions = [ {"judgement": "A", "generation": "answer1"}, {"generation": "answer2"}, # Missing judgement ] score_dicts = [_get_score_dict_omni(pred) for pred in predictions] print("score_dicts:", score_dicts) # Now try _compute_pass_at_k logic try: for score_method in score_dicts[0].keys(): print(f"Processing score_method: {score_method}") # This is line 369 in _compute_pass_at_k scores_list = [correctness_dict[score_method] for correctness_dict in score_dicts] print(f" scores_list: {scores_list}") except KeyError as e: print(f"KeyError: {e}") print(f"Trying to access '{e.args[0]}' in one of the score_dicts") EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 299

🏁 Script executed:

# Check the update() method to see if there are any guards sed -n '94,102p' nemo_skills/evaluation/metrics/omni_metrics.py

Repository: NVIDIA-NeMo/Skills

Length of output: 448

🏁 Script executed:

# Also check if OmniMetrics is used with consistent data grep -r "OmniMetrics" nemo_skills/evaluation/ | grep -v ".pyc" | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 332

🏁 Script executed:

# Check if there's any validation or guarantee that all predictions have consistent structure grep -B 5 -A 10 "def _compute_pass_at_k" nemo_skills/evaluation/metrics/base.py | head -25

Repository: NVIDIA-NeMo/Skills

Length of output: 799

🏁 Script executed:

# Test the exact error scenario again with the actual code python3 << 'EOF' # Reproduce the exact crash with line numbers from base.py import math from collections import defaultdict # Simulate BaseMetrics._compute_pass_at_k def _get_score_dict(prediction): """OmniMetrics version""" correctness_dict = {} if "judgement" in prediction: judgement = prediction["judgement"] correctness_dict["judge_correct"] = int(judgement.lower() == "a") correctness_dict["judge_incorrect"] = int(judgement.lower() == "b") correctness_dict["judge_partially_correct"] = int(judgement.lower() == "c") correctness_dict["judge_abstained"] = int(judgement.lower() == "d") return correctness_dict def test_crash(): predictions = [ {"judgement": "A", "generation": "answer1"}, {"generation": "answer2"}, # Missing judgement - will cause crash ] eval_dict = defaultdict(lambda: defaultdict(float)) score_dicts = [_get_score_dict(pred) for pred in predictions] print(f"score_dicts = {score_dicts}") print(f"score_dicts[0].keys() = {list(score_dicts[0].keys())}") try: # This is line 368 in base.py _compute_pass_at_k for score_method in score_dicts[0].keys(): # This is line 369 in base.py scores_list = [correctness_dict[score_method] for correctness_dict in score_dicts] print(f"Success: {score_method} = {scores_list}") except KeyError as e: print(f"CRASH at line 369: KeyError accessing '{e.args[0]}' in score_dicts[1]") return False return True test_crash() EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 355

KeyError crash when predictions have inconsistent "judgement" field.

When some (but not all) predictions lack the "judgement" field, _get_score_dict returns an empty dict for those predictions. This causes a KeyError crash in _compute_pass_at_k (base.py line 369) when trying to build scores_list:

scores_list = [correctness_dict[score_method] for correctness_dict in score_dicts]

If score_dicts[0] has keys like "judge_correct" but score_dicts[i] is empty, accessing score_dicts[i]["judge_correct"] raises KeyError.

Validate that all predictions in a batch contain the "judgement" field, or ensure _get_score_dict returns consistent keys across all predictions (e.g., with default values).

coderabbitai · 2026-01-13T18:14:23Z

nemo_skills/evaluation/metrics/omni_metrics.py

+    def get_metrics(self):
+        metrics = super().get_metrics()
+
+        for agg_method, agg_metric_dict in metrics.items():
+            correct, incorrect, part_correct, abstained = (
+                agg_metric_dict["judge_correct"],
+                agg_metric_dict["judge_incorrect"],
+                agg_metric_dict["judge_partially_correct"],
+                agg_metric_dict["judge_abstained"],
+            )
+            metrics[agg_method]["judge_omni_index"] = (
+                100 * (correct - incorrect) / (correct + incorrect + part_correct + abstained)
+            )
+            metrics[agg_method]["judge_omni_hallucination"] = 100 * incorrect / (incorrect + part_correct + abstained)
+        return metrics


⚠️ Potential issue | 🟠 Major

Potential ZeroDivisionError in metric calculations.

Two division operations can fail:

Line 82-83: (correct + incorrect + part_correct + abstained) equals zero if no judgements exist.

Line 84: (incorrect + part_correct + abstained) equals zero when all responses are judge_correct (judgement "A").

This will crash metrics computation in edge cases (empty data or perfect scores).

Suggested fix with guards

def get_metrics(self): metrics = super().get_metrics() for agg_method, agg_metric_dict in metrics.items(): correct, incorrect, part_correct, abstained = ( agg_metric_dict["judge_correct"], agg_metric_dict["judge_incorrect"], agg_metric_dict["judge_partially_correct"], agg_metric_dict["judge_abstained"], ) - metrics[agg_method]["judge_omni_index"] = ( - 100 * (correct - incorrect) / (correct + incorrect + part_correct + abstained) - ) - metrics[agg_method]["judge_omni_hallucination"] = 100 * incorrect / (incorrect + part_correct + abstained) + total = correct + incorrect + part_correct + abstained + non_correct_total = incorrect + part_correct + abstained + + metrics[agg_method]["judge_omni_index"] = ( + 100 * (correct - incorrect) / total if total > 0 else 0.0 + ) + metrics[agg_method]["judge_omni_hallucination"] = ( + 100 * incorrect / non_correct_total if non_correct_total > 0 else 0.0 + ) return metrics

coderabbitai · 2026-01-13T18:14:23Z

nemo_skills/evaluation/metrics/omni_metrics.py

+    def update(self, predictions):
+        super().update(predictions)
+        self._compute_pass_at_k(predictions, None)
+        if "reward_model_score" in predictions[0]:
+            self._compute_reward_at_k(predictions=predictions)


⚠️ Potential issue | 🟡 Minor

IndexError risk with empty predictions.

Line 99 accesses predictions[0] without checking if predictions is non-empty. If update is called with an empty list, this will raise IndexError.

Suggested fix

def update(self, predictions): super().update(predictions) + if not predictions: + return self._compute_pass_at_k(predictions, None) if "reward_model_score" in predictions[0]: self._compute_reward_at_k(predictions=predictions)

🤖 Prompt for AI Agents

In @nemo_skills/evaluation/metrics/omni_metrics.py around lines 96 - 100, The update method accesses predictions[0] without checking for an empty list, risking IndexError; after calling super().update(predictions) add a guard like "if not predictions: return" to avoid further processing on an empty list, or at minimum change the reward-model check to "if predictions and 'reward_model_score' in predictions[0]:"; ensure this guard is applied before calling _compute_pass_at_k and _compute_reward_at_k so both methods aren't invoked with an empty predictions list (refer to the update method and helpers _compute_pass_at_k and _compute_reward_at_k).

greptile-apps

_{6 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T18:19:08Z

nemo_skills/dataset/omniscience/__init__.py

@@ -0,0 +1,26 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.


Copyright year is 2026

Suggested change

# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

greptile-apps · 2026-01-13T18:19:09Z

nemo_skills/dataset/omniscience/prepare.py

@@ -0,0 +1,80 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.


Copyright year is 2026

Suggested change

# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

greptile-apps · 2026-01-13T18:19:10Z

nemo_skills/dataset/omniscience/prepare.py

+    args = parse_args()
+
+    dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")
+    jsonl_data = [format_entry(d) for d in dataset]


jsonl_data variable is assigned but never used - can be removed

greptile-apps · 2026-01-13T18:19:11Z

nemo_skills/evaluation/metrics/omni_metrics.py

@@ -0,0 +1,124 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.


Copyright year is 2026

Suggested change

# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

greptile-apps · 2026-01-13T18:19:12Z

nemo_skills/evaluation/metrics/omni_metrics.py

+
+                # If no valid answers, it's incorrect
+                if not valid_answers_and_results:
+                    is_correct = False


is_correct variable is assigned but never used - can be removed

greptile-apps · 2026-01-13T18:19:13Z

nemo_skills/evaluation/metrics/omni_metrics.py

+            metrics[agg_method]["judge_omni_hallucination"] = (
+                100 * incorrect / (incorrect + part_correct + abstained) 
+                if (incorrect + part_correct + abstained) > 0 else 0
+            )


Consider verifying the denominator logic - if all predictions are correct, this returns 0, but the hallucination metric definition may need clarification for this edge case

Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

…added data-dependent system prompt formatting Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

greptile-apps

_{8 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T22:45:45Z

nemo_skills/evaluation/metrics/omni_metrics.py

+        score_dicts = [self._get_score_dict(pred) for pred in predictions]
+
+        for k in range(1, len(predictions) + 1):
+            for score_method in score_dicts[0].keys():


logic: IndexError if all predictions have no judgement field - score_dicts will be a list of empty dicts, and score_dicts[0].keys() will be empty, but attempting to iterate over it will still try to access score_dicts[0] when it doesn't exist if predictions list is empty

Suggested change

for score_method in score_dicts[0].keys():

if not score_dicts or not score_dicts[0]:

continue

for score_method in score_dicts[0].keys():

greptile-apps · 2026-01-13T22:45:46Z

nemo_skills/evaluation/metrics/omni_metrics.py

+    def update(self, predictions):
+        super().update(predictions)
+        self._compute_pass_at_k(predictions, None)
+        if "reward_model_score" in predictions[0]:


logic: IndexError if predictions list is empty - accessing predictions[0] without checking if list is non-empty

Suggested change

if "reward_model_score" in predictions[0]:

if predictions and "reward_model_score" in predictions[0]:

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T23:59:21Z

nemo_skills/prompt/utils.py

        if self.config.system is not None:
            messages = [
-                {"role": "system", "content": self.config.system},
+                {"role": "system", "content": self.config.system.format(**input_dict)},


logic: .format() on system messages will unescape {{ to { in existing prompts. Files like nemo_skills/prompt/config/generic/hle.yaml (line 5-7: {{your explanation}}, {{your succinct, final answer}}, {{your confidence score}}) and nemo_skills/prompt/config/qwen/math-cot.yaml (line 3: \boxed{{}}) use {{ to display literal braces to the model. After this change, these become {your explanation}, {your answer}, and \boxed{}, which will either cause KeyError if those keys don't exist in input_dict, or worse, silently replace them with actual values if keys happen to match.

Suggested change

{"role": "system", "content": self.config.system.format(**input_dict)},

{"role": "system", "content": self.config.system if isinstance(self.config.system, str) and '{' not in self.config.system else self.config.system.format(**input_dict) if '{' in self.config.system and '{{' not in self.config.system else self.config.system},

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

…Skills into akomaragiri/aai_omniscience

greptile-apps

_{8 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-16T21:23:16Z

nemo_skills/prompt/utils.py

        if self.config.system is not None:
            messages = [
-                {"role": "system", "content": self.config.system},
+                {"role": "system", "content": self.config.system.format(**input_dict)},


logic: calling .format(**input_dict) will convert {{ to { in existing system prompts that use double braces for literal display. For example, nemo_skills/prompt/config/qwen/math-cot.yaml:3 has \boxed{{}} which becomes \boxed{} after formatting, and nemo_skills/prompt/config/generic/hle.yaml:5-7 has {{your explanation}} which becomes {your explanation} - this will either cause KeyError if those keys don't exist in input_dict, or silently replace them with values if keys happen to match

Suggested change

{"role": "system", "content": self.config.system.format(**input_dict)},

{"role": "system", "content": self.config.system.format_map(defaultdict(str, **input_dict))},

was the intention to unescape double braces in existing prompts, or should literal braces be preserved?

greptile-apps · 2026-01-16T21:23:17Z

nemo_skills/evaluation/metrics/omni_metrics.py

+    def update(self, predictions):
+        super().update(predictions)
+        self._compute_pass_at_k(predictions, None)
+        if "reward_model_score" in predictions[0]:


logic: IndexError if predictions is empty - accessing predictions[0] without checking length

Suggested change

if "reward_model_score" in predictions[0]:

if predictions and "reward_model_score" in predictions[0]:

greptile-apps · 2026-01-16T21:23:18Z

nemo_skills/evaluation/metrics/omni_metrics.py

+        score_dicts = [self._get_score_dict(pred) for pred in predictions]
+
+        for k in range(1, len(predictions) + 1):
+            for score_method in score_dicts[0].keys():


logic: KeyError if score_dicts is empty (when all predictions lack judgement field) - trying to access .keys() on empty list's first element

Suggested change

for score_method in score_dicts[0].keys():

if not score_dicts or not score_dicts[0]:

continue

for score_method in score_dicts[0].keys():

greptile-apps

_{8 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-16T21:40:52Z

nemo_skills/prompt/utils.py

        if self.config.system is not None:
            messages = [
-                {"role": "system", "content": self.config.system},
+                {"role": "system", "content": self.config.system.format(**input_dict)},


logic: breaks existing prompts using {{ for literal braces. Files like nemo_skills/prompt/config/generic/hle.yaml (lines 5-7: {{your explanation}}, {{your succinct, final answer}}) and nemo_skills/prompt/config/eval/aai/math.yaml (lines 4,8: \boxed{{}}) use {{ to display literal braces. .format() converts {{ to {, causing either KeyError if keys don't exist, or unintended replacements if they do.

Suggested change

{"role": "system", "content": self.config.system.format(**input_dict)},

{"role": "system", "content": self.config.system},

greptile-apps · 2026-01-16T21:40:52Z

nemo_skills/evaluation/metrics/omni_metrics.py

+    def update(self, predictions):
+        super().update(predictions)
+        self._compute_pass_at_k(predictions, None)
+        if "reward_model_score" in predictions[0]:


logic: IndexError if predictions is empty - accessing predictions[0] without checking length first

Suggested change

if "reward_model_score" in predictions[0]:

if predictions and "reward_model_score" in predictions[0]:

greptile-apps · 2026-01-16T21:40:53Z

nemo_skills/evaluation/metrics/omni_metrics.py

+        score_dicts = [self._get_score_dict(pred) for pred in predictions]
+
+        for k in range(1, len(predictions) + 1):
+            for score_method in score_dicts[0].keys():


logic: IndexError if all predictions lack judgement field - score_dicts will be a list of empty dicts, causing score_dicts[0].keys() to fail when trying to iterate

Suggested change

for score_method in score_dicts[0].keys():

if not score_dicts or not score_dicts[0]:

continue

for score_method in score_dicts[0].keys():

greptile-apps · 2026-01-16T21:40:53Z

nemo_skills/dataset/omniscience/prepare.py

+    dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")
+    jsonl_data = [format_entry(d) for d in dataset]


style: jsonl_data created but never used - can be removed

Suggested change

dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")

jsonl_data = [format_entry(d) for d in dataset]

dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

…Skills into akomaragiri/aai_omniscience

greptile-apps

_{8 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-16T22:31:36Z

nemo_skills/prompt/utils.py

        if self.config.system is not None:
            messages = [
-                {"role": "system", "content": self.config.system},
+                {"role": "system", "content": self.config.system.format(**input_dict)},


logic: .format(**input_dict) will unescape {{ to { in existing system prompts. Files like nemo_skills/prompt/config/qwen/math-cot.yaml:3 use \boxed{{}} which will become \boxed{} after formatting, and nemo_skills/prompt/config/generic/hle.yaml:5-7 use {{your explanation}}, {{your succinct, final answer}} which will become {your explanation}, {your answer} - causing KeyError or unintended replacements

Suggested change

{"role": "system", "content": self.config.system.format(**input_dict)},

{"role": "system", "content": self.config.system.format_map(defaultdict(str, input_dict))},

or use a safer approach that only formats when needed. Was the intention to unescape double braces in existing prompts, or should they be preserved?

greptile-apps · 2026-01-16T22:31:37Z

nemo_skills/evaluation/metrics/omni_metrics.py

+        score_dicts = [self._get_score_dict(pred) for pred in predictions]
+
+        for k in range(1, len(predictions) + 1):
+            for score_method in score_dicts[0].keys():


logic: IndexError if predictions is empty - accessing score_dicts[0] without checking length

Suggested change

for score_method in score_dicts[0].keys():

for score_method in (score_dicts[0].keys() if score_dicts else []):

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

greptile-apps

_{8 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-16T23:10:02Z

nemo_skills/prompt/utils.py

        if self.config.system is not None:
            messages = [
-                {"role": "system", "content": self.config.system},
+                {"role": "system", "content": self.config.system.format(**input_dict)},


logic: calling .format(**input_dict) unescapes {{ to { in existing system prompts. Prompts like nemo_skills/prompt/config/qwen/math-cot.yaml:3 (\boxed{{}}) become \boxed{} after formatting, and nemo_skills/prompt/config/generic/hle.yaml:5-7 ({{your explanation}}, {{your succinct, final answer}}) become {your explanation}, {your answer} - causing KeyError if those keys don't exist in input_dict, or unintended replacements if they do

Suggested change

{"role": "system", "content": self.config.system.format(**input_dict)},

{"role": "system", "content": self.config.system.format_map(defaultdict(str, **input_dict))},

alternatively, only format when the system message contains the specific keys from input_dict, or escape existing braces before formatting. was the intention to unescape {{}} in existing prompts, or should literal braces be preserved?

greptile-apps · 2026-01-16T23:10:03Z

nemo_skills/evaluation/metrics/omni_metrics.py

+    def update(self, predictions):
+        super().update(predictions)
+        self._compute_pass_at_k(predictions, None)
+        if "reward_model_score" in predictions[0]:


logic: IndexError if predictions list is empty - accessing predictions[0] without checking length first

Suggested change

if "reward_model_score" in predictions[0]:

if predictions and "reward_model_score" in predictions[0]:

Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com> Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: Arnav Komaragiri <arnav.komaragiri@gmail.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Valentin Mendelev <vmendelev@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Co-authored-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>

Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com> Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: Arnav Komaragiri <arnav.komaragiri@gmail.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Valentin Mendelev <vmendelev@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Co-authored-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>

arnavkomaragiri requested a review from gnalbandyan January 12, 2026 17:17

gnalbandyan reviewed Jan 13, 2026

View reviewed changes

arnavkomaragiri force-pushed the akomaragiri/aai_omniscience branch from c5a60b8 to 5fc257b Compare January 13, 2026 17:43

arnavkomaragiri requested a review from Kipok January 13, 2026 18:07

arnavkomaragiri marked this pull request as ready for review January 13, 2026 18:07

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

Froxyy-dev and others added 17 commits January 13, 2026 10:27

fix: robust judgement handling (#1134)

194964c

Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

generation.py to respect separate server type for the client (#1135)

8d6458e

Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

added aai-omniscience as benchmark

4d058c9

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

switched omniscience to use omniscience metrics

707ed78

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

renamed default omniscience file to text.jsonl

e1adc1c

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

rename changes

443ccb0

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

fixed func_timeout import error

84246e4

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

dropped math eval from omniscience, renamed target->expected_answer, …

1a74320

…added data-dependent system prompt formatting Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

added debug print

7d0069d

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

fixed bug with system prompt formatting

cdb9a10

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

switched perflab server type to azureopenai

33e4dde

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

fixed bug with answer key not being set

11d2b78

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

only checking rm score if rm score is in data

57fcdff

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

added judge_omni_index to printed metrics

3183367

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

added debug print on correctness_dict

813d792

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

fixed bug with judgement parsing

76683b1

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

added hallucination rate computation

08fc0b7

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

arnavkomaragiri added 2 commits January 13, 2026 14:38

added workaround for premature pct conversion

d08edbe

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

fixed bug in pct workaround

10492a4

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

Merge branch 'main' into akomaragiri/aai_omniscience

6ac32db

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

arnavkomaragiri added 5 commits January 14, 2026 09:14

added debug logging for other omniscience fields

86dc189

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

cleaned up formatting issue with docs

e805cd8

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

fixed bug where judgement may contain whitespace

46235f4

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

cleaned up logs, updated docs with eval setup

20a2c84

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

Merge branch 'akomaragiri/aai_omniscience' of github.com:NVIDIA-NeMo/…

838d182

…Skills into akomaragiri/aai_omniscience

greptile-apps bot reviewed Jan 16, 2026

View reviewed changes

Merge branch 'main' into akomaragiri/aai_omniscience

d1c2fbf

greptile-apps bot reviewed Jan 16, 2026

View reviewed changes

arnavkomaragiri added 3 commits January 16, 2026 14:08

added parse_reasoning to config

774aa1b

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

added eval results on qwen3-8b to docs

9efc5c5

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

Merge branch 'akomaragiri/aai_omniscience' of github.com:NVIDIA-NeMo/…

019a8f9

…Skills into akomaragiri/aai_omniscience

greptile-apps bot reviewed Jan 16, 2026

View reviewed changes

ekmb approved these changes Jan 16, 2026

View reviewed changes

switched docs to qwen3-8b from gpt-oss-20b

29ffea6

Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>

ekmb approved these changes Jan 16, 2026

View reviewed changes

greptile-apps bot reviewed Jan 16, 2026

View reviewed changes

arnavkomaragiri merged commit ba25265 into main Jan 16, 2026
6 checks passed

arnavkomaragiri deleted the akomaragiri/aai_omniscience branch January 16, 2026 23:25

coderabbitai bot mentioned this pull request Jan 28, 2026

add multilingual aime25, gqpa and lcb #1194

Open

coderabbitai bot mentioned this pull request Feb 5, 2026

Gnalbandyan/add physics #1214

Merged

coderabbitai bot mentioned this pull request Feb 13, 2026

Update promt_config to working with openai format + inline setup #1210

Merged

This was referenced Mar 4, 2026

Add HotpotQA multi-hop QA benchmark #1292

Merged

Gnalbandyan/ugph hle verified #1293

Merged

		@@ -0,0 +1,26 @@
		# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

	# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
	# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,80 @@
		# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,120 @@
		# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,124 @@
		# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

	if "reward_model_score" in predictions[0]:
	if predictions and "reward_model_score" in predictions[0]:

	{"role": "system", "content": self.config.system.format(**input_dict)},
	{"role": "system", "content": self.config.system if isinstance(self.config.system, str) and '{' not in self.config.system else self.config.system.format(**input_dict) if '{' in self.config.system and '{{' not in self.config.system else self.config.system},

		dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train")
		jsonl_data = [format_entry(d) for d in dataset]

	for score_method in score_dicts[0].keys():
	for score_method in (score_dicts[0].keys() if score_dicts else []):

Conversation

arnavkomaragiri commented Jan 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

gnalbandyan Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

arnavkomaragiri Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gnalbandyan Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

arnavkomaragiri Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gnalbandyan Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

arnavkomaragiri Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gnalbandyan Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Jan 13, 2026

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

arnavkomaragiri commented Jan 12, 2026 •

edited by coderabbitai bot

Loading

greptile-apps bot commented Jan 13, 2026 •

edited

Loading