Added AAI-Omniscience Benchmark#1161
Conversation
fiddle/omni_eval.py
Outdated
| @@ -0,0 +1,45 @@ | |||
| import argparse | |||
There was a problem hiding this comment.
This is not Nemo-skills file right?
There was a problem hiding this comment.
It's not, just a basic debug file I added to test it since ns wasn't working in my env. I'll delete to clean things up.
| self.answer_key = answer_key | ||
|
|
||
| # use same RM code as MathMetrics | ||
| def _compute_reward_at_k(self, predictions: list[dict]): |
There was a problem hiding this comment.
Is this function required for all datasets? Seems like math has it but most of the datasets don't
There was a problem hiding this comment.
I don't think it's required, but I figured it might be helpful since a reward model could serve as a proxy for a judge model and may be useful for the downstream task. It's not critical to the benchmark itself though, so we can drop if need be.
| if self.config.system is not None: | ||
| messages = [ | ||
| {"role": "system", "content": self.config.system}, | ||
| {"role": "system", "content": self.config.system.format(**input_dict)}, |
There was a problem hiding this comment.
this change touches all Nemo-skills codes, not just omniscience. If its really required, maybe ask Igor to validate?
There was a problem hiding this comment.
This is required since the system prompt is data dependent for Omniscience for the topic and domain fields; I'll ask Igor to validate to make sure this doesn't break other code.
dockerfiles/Dockerfile.nemo-skills
Outdated
| RUN pip install --no-cache-dir -r /opt/NeMo-Skills/requirements/main.txt | ||
| # Fix http mismatch between lepton and dggs by manually downloading dggs here | ||
| RUN pip install ddgs | ||
| RUN pip install func-timeout |
There was a problem hiding this comment.
this is handled in current Nemo skills docker
c5a60b8 to
5fc257b
Compare
Greptile SummaryAdds AAI-Omniscience benchmark evaluation with judge-based correctness scoring and specialized hallucination metrics. The implementation includes dataset preparation, custom metrics computation (omni-index and hallucination rate), and comprehensive documentation with configuration examples. Critical Issue:
Other Issues:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Eval Pipeline
participant Prompt Utils
participant Model
participant Judge
participant OmniMetrics
User->>Eval Pipeline: eval(benchmarks="omniscience")
Eval Pipeline->>Prompt Utils: Load omni.yaml config
Prompt Utils->>Prompt Utils: system.format(domain, topic, question)
Prompt Utils->>Model: Generate answer
Model-->>Prompt Utils: generation response
Prompt Utils-->>Eval Pipeline: predictions with generation
Eval Pipeline->>Judge: Load aa-omni-judge.yaml
Judge->>Judge: Compare generation vs expected_answer
Judge-->>Eval Pipeline: judgement (A/B/C/D)
Eval Pipeline->>OmniMetrics: update(predictions)
OmniMetrics->>OmniMetrics: _get_score_dict(judgement)
OmniMetrics->>OmniMetrics: _compute_pass_at_k()
alt reward_model_score exists
OmniMetrics->>OmniMetrics: _compute_reward_at_k()
end
OmniMetrics->>OmniMetrics: get_metrics()
OmniMetrics->>OmniMetrics: Calculate omni_index & hallucination_rate
OmniMetrics-->>Eval Pipeline: Final metrics
Eval Pipeline-->>User: Results with accuracy, omni-index, hallucination rate
|
| @@ -0,0 +1,26 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright year is 2026 (future year)
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
| @@ -0,0 +1,80 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright year is 2026 (future year)
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
| @@ -0,0 +1,120 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright year is 2026 (future year)
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
| args = parse_args() | ||
|
|
||
| dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train") | ||
| jsonl_data = [format_entry(d) for d in dataset] |
There was a problem hiding this comment.
jsonl_data variable is unused
|
|
||
| # If no valid answers, it's incorrect | ||
| if not valid_answers_and_results: | ||
| is_correct = False |
There was a problem hiding this comment.
is_correct variable is assigned but never used
📝 WalkthroughWalkthroughThis PR introduces the "omniscience" evaluation metric system for the AA-Omniscience dataset. It adds dataset preparation logic, a new metrics class that computes evaluation scores using judge signals and reward model scoring, prompt configurations for evaluation and judging, and integration into the existing metrics framework. Changes
Sequence DiagramssequenceDiagram
participant User
participant PrepareScript as Prepare Script
participant HFDataset as HuggingFace Dataset
participant FileSystem
User->>PrepareScript: python prepare.py --splits text,math,...
PrepareScript->>HFDataset: Load AA-Omniscience-Public
HFDataset-->>PrepareScript: Dataset loaded
loop For each split (text + per-domain)
PrepareScript->>PrepareScript: Format entries (id, domain, topic, question, answer)
alt text split
PrepareScript->>PrepareScript: Use full dataset
else domain split
PrepareScript->>HFDataset: Filter by domain
HFDataset-->>PrepareScript: Filtered entries
end
PrepareScript->>FileSystem: Write split_name.jsonl
end
FileSystem-->>User: JSONL files generated
sequenceDiagram
participant Evaluator
participant OmniMetrics
participant BaseMetrics
participant Judge
participant RewardModel
Evaluator->>OmniMetrics: update(predictions)
OmniMetrics->>BaseMetrics: Parent update logic
loop For each prediction
OmniMetrics->>OmniMetrics: _get_score_dict (extract judge_correct, etc.)
alt reward_model_score exists
OmniMetrics->>RewardModel: Evaluate prediction
RewardModel-->>OmniMetrics: Score returned
OmniMetrics->>OmniMetrics: _compute_reward_at_k (best/majority selection)
end
end
OmniMetrics->>OmniMetrics: get_metrics (compute judge_omni_index, hallucination)
OmniMetrics-->>Evaluator: Augmented metrics returned
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In @nemo_skills/dataset/omniscience/__init__.py:
- Around line 21-25: JUDGE_PIPELINE_ARGS currently sets the model to the preview
variant "gemini-2.5-flash-preview-09-2025"; update the "model" value in the
JUDGE_PIPELINE_ARGS dict to the stable GA name "gemini-2.5-flash" (leave other
keys like "server_type" and "server_address" unchanged) so the code uses the
supported GA model.
In @nemo_skills/evaluation/metrics/omni_metrics.py:
- Around line 96-100: The update method accesses predictions[0] without checking
for an empty list, risking IndexError; after calling super().update(predictions)
add a guard like "if not predictions: return" to avoid further processing on an
empty list, or at minimum change the reward-model check to "if predictions and
'reward_model_score' in predictions[0]:"; ensure this guard is applied before
calling _compute_pass_at_k and _compute_reward_at_k so both methods aren't
invoked with an empty predictions list (refer to the update method and helpers
_compute_pass_at_k and _compute_reward_at_k).
🧹 Nitpick comments (4)
nemo_skills/dataset/omniscience/prepare.py (2)
64-64: Remove unused variable.
jsonl_datais computed but never used. This appears to be dead code from earlier iterations.🧹 Suggested fix
dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train") - jsonl_data = [format_entry(d) for d in dataset] output_dir = Path(__file__).absolute().parent
68-74: Lambda closure captures loop variable by reference.The lambda
lambda x: x["domain"] == tcapturestby reference. In a dict comprehension, this is typically problematic because all lambdas would reference the final value oft. Whiledataset.filter()likely evaluates immediately (avoiding the bug), this is fragile and flagged by static analysis (B023).♻️ Recommended fix using default argument capture
splits = { "text": dataset, **{ - TOPIC_TO_SPLIT_MAP.get(t, str(t).lower()): dataset.filter(lambda x: x["domain"] == t) + TOPIC_TO_SPLIT_MAP.get(t, str(t).lower()): dataset.filter(lambda x, domain=t: x["domain"] == domain) for t in dataset.unique("domain") }, }nemo_skills/evaluation/metrics/omni_metrics.py (2)
87-94: In-place mutation may cause unexpected side effects.
get_incorrect_samplemutates the inputpredictiondict directly. If the caller doesn't expect this, it could lead to subtle bugs. Consider returning a copy instead.Safer approach using copy
def get_incorrect_sample(self, prediction: dict) -> dict: + prediction = prediction.copy() if "judgement" in prediction: prediction["judgement"] = "B" prediction["judge_correct"] = 0 prediction["judge_incorrect"] = 1 prediction["judge_partially_correct"] = 0 prediction["judge_abstained"] = 0 return prediction
15-17: Import directly frombasemodule for clarity and consistency.
BaseMetrics,as_int, andas_percentageare all defined inbase.py, notmath_metrics.py. While importing frommath_metrics.pyworks because it re-exports these symbols, the codebase standard (used in all other metrics files) is to import directly frombase.py.Suggested import path
-from nemo_skills.evaluation.metrics.math_metrics import BaseMetrics, as_int, as_percentage +from nemo_skills.evaluation.metrics.base import BaseMetrics, as_int, as_percentage
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
nemo_skills/dataset/omniscience/__init__.pynemo_skills/dataset/omniscience/prepare.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/evaluation/metrics/omni_metrics.pynemo_skills/prompt/config/eval/aai/omni.yamlnemo_skills/prompt/config/judge/aa-omni-judge.yamlnemo_skills/prompt/utils.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-12T16:09:53.870Z
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
Applied to files:
nemo_skills/prompt/config/judge/aa-omni-judge.yaml
🧬 Code graph analysis (2)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/omni_metrics.py (1)
OmniMetrics(20-120)
nemo_skills/evaluation/metrics/omni_metrics.py (2)
nemo_skills/evaluation/metrics/base.py (4)
BaseMetrics(23-434)as_int(443-446)as_percentage(437-440)_compute_pass_at_k(352-423)nemo_skills/evaluation/metrics/map_metrics.py (1)
get_metrics(81-110)
🪛 Ruff (0.14.11)
nemo_skills/dataset/omniscience/prepare.py
71-71: Function definition does not bind loop variable t
(B023)
nemo_skills/evaluation/metrics/omni_metrics.py
34-34: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
🔇 Additional comments (8)
nemo_skills/prompt/utils.py (1)
265-268: LGTM! Dynamic system message formatting enables data-driven prompts.This change aligns with the existing pattern used for user messages (line 193) and enables the new omni.yaml config to substitute
{domain}and{topic}at runtime. The behavior is consistent—ifinput_dictlacks required keys, aKeyErrorwill be raised, matching how user message formatting already behaves.nemo_skills/prompt/config/eval/aai/omni.yaml (1)
1-7: LGTM! Well-structured evaluation prompt config.The prompt correctly uses placeholders (
{domain},{topic},{question}) that align with the fields produced byprepare.py'sformat_entry(). The system prompt appropriately instructs the model to provide direct answers and explicitly state when it lacks sufficient context.nemo_skills/prompt/config/judge/aa-omni-judge.yaml (1)
1-99: LGTM! Comprehensive and well-documented judge prompt.The grading rubric is thorough with clear distinctions:
- Numeric precision rules (lines 14-16) correctly differentiate between measurement values (allow rounding) vs. identifiers/versions (require exact match).
- Edge cases for typos, inferred context, and coding equivalence are well-covered.
- The A/B/C/D mapping (lines 93-97) provides unambiguous output parsing.
The placeholders
{question},{expected_answer}, and{generation}align with the evaluation pipeline's data flow.nemo_skills/dataset/omniscience/__init__.py (1)
16-16: VerifyDATASET_GROUPvalue.
DATASET_GROUP = "math"seems inconsistent with the AA-Omniscience benchmark, which covers diverse domains (humanities, health, law, finance, SWE, STEM). Is this intentional for framework compatibility, or should it be a more generic value like"omniscience"or"knowledge"?nemo_skills/evaluation/metrics/omni_metrics.py (2)
20-23: LGTM!Constructor correctly initializes the parent class and stores the
answer_keyattribute.
102-119: LGTM!
evaluations_to_printandmetrics_to_printare correctly implemented, returning appropriate evaluation labels and metric formatters consistent with the base class patterns.nemo_skills/evaluation/metrics/map_metrics.py (2)
41-41: LGTM!Import follows the existing alphabetical ordering convention in the file.
76-76: LGTM!Registration of
"omniscience"metric type inMETRICS_MAPfollows the established pattern and correctly maps toOmniMetrics.
| JUDGE_PIPELINE_ARGS = { | ||
| "model": "gemini-2.5-flash-preview-09-2025", | ||
| "server_type": "gemini", | ||
| "server_address": "https://generativelanguage.googleapis.com", | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Gemini 2.5 flash model versions January 2026 availability
💡 Result:
Short answer — as of January 13, 2026:
- Gemini 2.5 Flash (gemini-2.5-flash) — GA, available in the Gemini app, Google AI Studio and Vertex AI (launched June 17, 2025; scheduled retirement listed as June 17, 2026). [1][2]
- Gemini 2.5 Flash‑Lite (gemini-2.5-flash-lite) — GA, available to developers (released July 22, 2025; earliest retirement July 22, 2026). [2][3]
- Gemini 2.5 Flash‑Image (gemini-2.5-flash-image) — available (released Oct 2, 2025). [2][3]
- Live / native‑audio variants — there were preview/live variants (e.g., gemini-live-2.5-flash and native‑audio preview) that were preview/private GA; some Live preview model IDs were shut down or replaced in Dec 2025, and newer Gemini 3 Flash rollout is superseding parts of the 2.5 Flash lineup. Check the Gemini API changelog and Vertex AI model pages for exact model IDs and shutdown dates. [1][2][4]
Sources: Google blog & Gemini changelog / Vertex AI docs and recent coverage. [1][2][3][4]
References
[1] Google blog (Gemini updates).
[2] Vertex AI — Gemini 2.5 Flash model page.
[3] Vertex AI — model versions & lifecycle.
[4] Gemini API changelog / deprecations.
Replace with stable GA model version.
The model "gemini-2.5-flash-preview-09-2025" is a preview variant. As of January 2026, preview Gemini 2.5 variants were shut down or replaced in December 2025. Use the stable GA version "gemini-2.5-flash" instead (available through June 17, 2026).
🤖 Prompt for AI Agents
In @nemo_skills/dataset/omniscience/__init__.py around lines 21 - 25,
JUDGE_PIPELINE_ARGS currently sets the model to the preview variant
"gemini-2.5-flash-preview-09-2025"; update the "model" value in the
JUDGE_PIPELINE_ARGS dict to the stable GA name "gemini-2.5-flash" (leave other
keys like "server_type" and "server_address" unchanged) so the code uses the
supported GA model.
| def _compute_reward_at_k(self, predictions: list[dict]): | ||
| score_dicts = [self._get_score_dict(pred) for pred in predictions] | ||
|
|
||
| for k in range(1, len(predictions) + 1): | ||
| for score_method in score_dicts[0].keys(): | ||
| # Get valid answers and their results for this field | ||
| valid_answers_and_results = [ | ||
| (elem[self.answer_key], correctness_dict[score_method], elem["reward_model_score"]) | ||
| for elem, correctness_dict in zip(predictions[:k], score_dicts[:k]) | ||
| if elem[self.answer_key] is not None | ||
| ] | ||
|
|
||
| # If no valid answers, it's incorrect | ||
| if not valid_answers_and_results: | ||
| is_correct = False | ||
| else: | ||
| is_correct_best = sorted(valid_answers_and_results, key=lambda x: x[2], reverse=True)[0][1] | ||
| self.eval_dict[f"rm_best@{k}"][score_method] += is_correct_best | ||
|
|
||
| answer_to_score_dict = defaultdict(float) | ||
| answer_to_correctness_dict = {} | ||
| for predicted_answer, is_correct, reward_score in valid_answers_and_results: | ||
| answer_to_score_dict[predicted_answer] += reward_score | ||
| answer_to_correctness_dict[predicted_answer] = is_correct | ||
|
|
||
| top_cum_reward_answer = sorted( | ||
| list(answer_to_score_dict.items()), key=lambda x: x[1], reverse=True | ||
| )[0][0] | ||
| is_correct_majority = answer_to_correctness_dict[top_cum_reward_answer] | ||
| self.eval_dict[f"rm_majority@{k}"][score_method] += is_correct_majority | ||
|
|
||
| no_answer = all(elem[self.answer_key] is None for elem in predictions[:k]) | ||
| self.eval_dict[f"rm_best@{k}"]["no_answer"] += no_answer | ||
| self.eval_dict[f"rm_majority@{k}"]["no_answer"] += no_answer |
There was a problem hiding this comment.
Multiple issues in _compute_reward_at_k.
-
Potential IndexError (Line 30): If
predictionsis empty,score_dictswill be empty andscore_dicts[0].keys()will raiseIndexError. -
Dead code (Line 40):
is_correct = Falseis assigned but never used. -
Variable shadowing (Line 47): Loop variable
is_correctshadows the outeris_correctfrom line 40, causing confusion. -
Missing
strict=on zip (Line 34): Per static analysis, addingstrict=Truewould catch length mismatches.
Suggested fix
def _compute_reward_at_k(self, predictions: list[dict]):
+ if not predictions:
+ return
+
score_dicts = [self._get_score_dict(pred) for pred in predictions]
+ if not score_dicts or not score_dicts[0]:
+ return
for k in range(1, len(predictions) + 1):
for score_method in score_dicts[0].keys():
# Get valid answers and their results for this field
valid_answers_and_results = [
(elem[self.answer_key], correctness_dict[score_method], elem["reward_model_score"])
- for elem, correctness_dict in zip(predictions[:k], score_dicts[:k])
+ for elem, correctness_dict in zip(predictions[:k], score_dicts[:k], strict=True)
if elem[self.answer_key] is not None
]
# If no valid answers, it's incorrect
if not valid_answers_and_results:
- is_correct = False
+ pass # No valid answers to process
else:
is_correct_best = sorted(valid_answers_and_results, key=lambda x: x[2], reverse=True)[0][1]
self.eval_dict[f"rm_best@{k}"][score_method] += is_correct_best
answer_to_score_dict = defaultdict(float)
answer_to_correctness_dict = {}
- for predicted_answer, is_correct, reward_score in valid_answers_and_results:
+ for predicted_answer, correctness, reward_score in valid_answers_and_results:
answer_to_score_dict[predicted_answer] += reward_score
- answer_to_correctness_dict[predicted_answer] = is_correct
+ answer_to_correctness_dict[predicted_answer] = correctness🧰 Tools
🪛 Ruff (0.14.11)
34-34: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
| def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]: | ||
| correctness_dict = {} | ||
| if "judgement" in prediction: | ||
| judgement = prediction["judgement"] | ||
| correctness_dict["judge_correct"] = int(judgement.lower() == "a") | ||
| correctness_dict["judge_incorrect"] = int(judgement.lower() == "b") | ||
| correctness_dict["judge_partially_correct"] = int(judgement.lower() == "c") | ||
| correctness_dict["judge_abstained"] = int(judgement.lower() == "d") | ||
| return correctness_dict |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find the file and understand its structure
fd -p omni_metrics.pyRepository: NVIDIA-NeMo/Skills
Length of output: 108
🏁 Script executed:
# Read the omni_metrics.py file to see the full context
wc -l nemo_skills/evaluation/metrics/omni_metrics.pyRepository: NVIDIA-NeMo/Skills
Length of output: 112
🏁 Script executed:
# Read relevant sections of the file
head -150 nemo_skills/evaluation/metrics/omni_metrics.py | tail -100Repository: NVIDIA-NeMo/Skills
Length of output: 4822
🏁 Script executed:
# Look at the _compute_reward_at_k method around line 30
sed -n '20,50p' nemo_skills/evaluation/metrics/omni_metrics.pyRepository: NVIDIA-NeMo/Skills
Length of output: 1678
🏁 Script executed:
# Check the full structure of the class and find base class
sed -n '1,100p' nemo_skills/evaluation/metrics/omni_metrics.pyRepository: NVIDIA-NeMo/Skills
Length of output: 4883
🏁 Script executed:
# Search for _compute_pass_at_k in the file
grep -n "_compute_pass_at_k\|_compute_reward_at_k" nemo_skills/evaluation/metrics/omni_metrics.pyRepository: NVIDIA-NeMo/Skills
Length of output: 246
🏁 Script executed:
# Find BaseMetrics class
fd -p math_metrics.pyRepository: NVIDIA-NeMo/Skills
Length of output: 108
🏁 Script executed:
# Check the BaseMetrics class and _compute_pass_at_k method
cat nemo_skills/evaluation/metrics/math_metrics.py | head -200Repository: NVIDIA-NeMo/Skills
Length of output: 6732
🏁 Script executed:
# Find and read the base metrics class
fd -p base.py | grep evaluationRepository: NVIDIA-NeMo/Skills
Length of output: 141
🏁 Script executed:
# Read the base metrics file
cat nemo_skills/evaluation/metrics/base.py | head -250Repository: NVIDIA-NeMo/Skills
Length of output: 10587
🏁 Script executed:
# Search for _compute_pass_at_k implementation
grep -n "_compute_pass_at_k" nemo_skills/evaluation/metrics/base.pyRepository: NVIDIA-NeMo/Skills
Length of output: 624
🏁 Script executed:
# Read more of base.py to find _compute_pass_at_k
wc -l nemo_skills/evaluation/metrics/base.pyRepository: NVIDIA-NeMo/Skills
Length of output: 104
🏁 Script executed:
# Read the _compute_pass_at_k method from base.py
sed -n '200,350p' nemo_skills/evaluation/metrics/base.pyRepository: NVIDIA-NeMo/Skills
Length of output: 6200
🏁 Script executed:
# Read the _compute_pass_at_k method
sed -n '352,420p' nemo_skills/evaluation/metrics/base.pyRepository: NVIDIA-NeMo/Skills
Length of output: 3433
🏁 Script executed:
# Verify the get_metrics method and understand the issue better
sed -n '73,85p' nemo_skills/evaluation/metrics/omni_metrics.pyRepository: NVIDIA-NeMo/Skills
Length of output: 717
🏁 Script executed:
# Check if there's any initialization or default values in eval_dict
grep -A 10 "def __init__" nemo_skills/evaluation/metrics/omni_metrics.pyRepository: NVIDIA-NeMo/Skills
Length of output: 597
🏁 Script executed:
# Check the base class initialization
sed -n '25,35p' nemo_skills/evaluation/metrics/base.pyRepository: NVIDIA-NeMo/Skills
Length of output: 525
🏁 Script executed:
# Check the reset method to see how eval_dict is initialized
sed -n '435,445p' nemo_skills/evaluation/metrics/base.pyRepository: NVIDIA-NeMo/Skills
Length of output: 591
🏁 Script executed:
# Let's trace through exactly what happens - check how metrics are accessed
# Check if get_metrics handles missing keys gracefully
sed -n '35,55p' nemo_skills/evaluation/metrics/base.pyRepository: NVIDIA-NeMo/Skills
Length of output: 1050
🏁 Script executed:
# Let's verify the actual flow by checking OmniMetrics.update() completely
sed -n '94,102p' nemo_skills/evaluation/metrics/omni_metrics.pyRepository: NVIDIA-NeMo/Skills
Length of output: 448
🏁 Script executed:
# Check what happens with defaultdict when accessing missing keys
python3 << 'EOF'
from collections import defaultdict
# This is how eval_dict is structured
eval_dict = defaultdict(lambda: defaultdict(float))
# Simulate what happens when no metrics are added
score_dicts = [{}] # Empty score dict
# This loop doesn't execute
for score_method in score_dicts[0].keys():
eval_dict[f"pass@1"]["method"] += 1
# Now try to access keys like get_metrics does
print("Keys in eval_dict:", dict(eval_dict))
print("Keys in eval_dict['pass@1']:", dict(eval_dict["pass@1"]))
# Check if accessing missing key raises error or returns default
try:
value = eval_dict["pass@1"]["judge_correct"]
print(f"Value for judge_correct: {value}")
except KeyError as e:
print(f"KeyError: {e}")
EOFRepository: NVIDIA-NeMo/Skills
Length of output: 144
🏁 Script executed:
# Let's test what happens in the actual OmniMetrics.get_metrics() flow
python3 << 'EOF'
from collections import defaultdict
# Simulate the flow
eval_dict = defaultdict(lambda: defaultdict(float))
# When no metrics are computed (empty score dicts), nothing is added
# Simulate calling get_metrics from base class
metrics_dict = {}
for agg_mode, agg_metric_dict in eval_dict.items():
metrics_dict[agg_mode] = {}
for metric_key, metric_value in agg_metric_dict.items():
if isinstance(metric_value, float):
metrics_dict[agg_mode][metric_key] = 100.0 * metric_value / 1 # assume total=1
print("After base get_metrics:", metrics_dict)
# Now OmniMetrics.get_metrics() tries to access keys
if metrics_dict: # Only if there are entries
for agg_method, agg_metric_dict in metrics_dict.items():
try:
correct = agg_metric_dict["judge_correct"]
print(f"Got judge_correct: {correct}")
except KeyError as e:
print(f"KeyError accessing judge_correct: {e}")
else:
print("metrics_dict is empty, so the loop won't even execute")
EOFRepository: NVIDIA-NeMo/Skills
Length of output: 142
🏁 Script executed:
# Let's check if there's always at least one agg_mode in eval_dict
grep -n "eval_dict\[" nemo_skills/evaluation/metrics/base.py | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 614
🏁 Script executed:
# Let's trace through a realistic scenario - when _compute_pass_at_k is called with None predicted_answers
python3 << 'EOF'
from collections import defaultdict
# Simulate OmniMetrics scenario
eval_dict = defaultdict(lambda: defaultdict(float))
# Simulate _compute_pass_at_k with empty score_dicts[0] and predicted_answers=None
score_dicts = [{}] # Empty because no "judgement" field
predicted_answers = None
compute_no_answer = True
# The loop that depends on score_dicts[0]
for score_method in score_dicts[0].keys():
# This doesn't execute
eval_dict[f"pass@1"]["method"] += 1
pass
# The no_answer logic
for k in range(1, 2):
if predicted_answers is not None and compute_no_answer:
# This condition is False because predicted_answers is None
eval_dict[f"pass@{k}"]["no_answer"] += True
pass
# Check what's in eval_dict
print("eval_dict after _compute_pass_at_k:", dict(eval_dict))
print("Is eval_dict empty?", len(eval_dict) == 0)
# Now simulate get_metrics flow
metrics_dict = {}
for agg_mode, agg_metric_dict in eval_dict.items():
metrics_dict[agg_mode] = dict(agg_metric_dict)
print("metrics_dict from base get_metrics:", metrics_dict)
# Now OmniMetrics.get_metrics() tries to access
print("\nWould OmniMetrics.get_metrics() be called?")
if metrics_dict:
print(" Yes, and it would try to access judge_correct, judge_incorrect, etc.")
for agg_method, agg_metric_dict in metrics_dict.items():
try:
val = agg_metric_dict["judge_correct"]
print(f" Got {val}")
except KeyError as e:
print(f" KeyError: {e}")
else:
print(" No, metrics_dict is empty so the loop doesn't execute")
EOFRepository: NVIDIA-NeMo/Skills
Length of output: 264
🏁 Script executed:
# Test the actual problematic scenario - mixed predictions with and without "judgement"
python3 << 'EOF'
from collections import defaultdict
# Simulate mixed case - some predictions have judgement, others don't
def _get_score_dict_omni(prediction):
correctness_dict = {}
if "judgement" in prediction:
judgement = prediction["judgement"]
correctness_dict["judge_correct"] = int(judgement.lower() == "a")
correctness_dict["judge_incorrect"] = int(judgement.lower() == "b")
correctness_dict["judge_partially_correct"] = int(judgement.lower() == "c")
correctness_dict["judge_abstained"] = int(judgement.lower() == "d")
return correctness_dict
# Test scenario: some predictions have judgement, others don't
predictions = [
{"judgement": "A", "generation": "answer1"},
{"generation": "answer2"}, # Missing judgement
]
score_dicts = [_get_score_dict_omni(pred) for pred in predictions]
print("score_dicts:", score_dicts)
# Now try _compute_pass_at_k logic
try:
for score_method in score_dicts[0].keys():
print(f"Processing score_method: {score_method}")
# This is line 369 in _compute_pass_at_k
scores_list = [correctness_dict[score_method] for correctness_dict in score_dicts]
print(f" scores_list: {scores_list}")
except KeyError as e:
print(f"KeyError: {e}")
print(f"Trying to access '{e.args[0]}' in one of the score_dicts")
EOFRepository: NVIDIA-NeMo/Skills
Length of output: 299
🏁 Script executed:
# Check the update() method to see if there are any guards
sed -n '94,102p' nemo_skills/evaluation/metrics/omni_metrics.pyRepository: NVIDIA-NeMo/Skills
Length of output: 448
🏁 Script executed:
# Also check if OmniMetrics is used with consistent data
grep -r "OmniMetrics" nemo_skills/evaluation/ | grep -v ".pyc" | head -10Repository: NVIDIA-NeMo/Skills
Length of output: 332
🏁 Script executed:
# Check if there's any validation or guarantee that all predictions have consistent structure
grep -B 5 -A 10 "def _compute_pass_at_k" nemo_skills/evaluation/metrics/base.py | head -25Repository: NVIDIA-NeMo/Skills
Length of output: 799
🏁 Script executed:
# Test the exact error scenario again with the actual code
python3 << 'EOF'
# Reproduce the exact crash with line numbers from base.py
import math
from collections import defaultdict
# Simulate BaseMetrics._compute_pass_at_k
def _get_score_dict(prediction):
"""OmniMetrics version"""
correctness_dict = {}
if "judgement" in prediction:
judgement = prediction["judgement"]
correctness_dict["judge_correct"] = int(judgement.lower() == "a")
correctness_dict["judge_incorrect"] = int(judgement.lower() == "b")
correctness_dict["judge_partially_correct"] = int(judgement.lower() == "c")
correctness_dict["judge_abstained"] = int(judgement.lower() == "d")
return correctness_dict
def test_crash():
predictions = [
{"judgement": "A", "generation": "answer1"},
{"generation": "answer2"}, # Missing judgement - will cause crash
]
eval_dict = defaultdict(lambda: defaultdict(float))
score_dicts = [_get_score_dict(pred) for pred in predictions]
print(f"score_dicts = {score_dicts}")
print(f"score_dicts[0].keys() = {list(score_dicts[0].keys())}")
try:
# This is line 368 in base.py _compute_pass_at_k
for score_method in score_dicts[0].keys():
# This is line 369 in base.py
scores_list = [correctness_dict[score_method] for correctness_dict in score_dicts]
print(f"Success: {score_method} = {scores_list}")
except KeyError as e:
print(f"CRASH at line 369: KeyError accessing '{e.args[0]}' in score_dicts[1]")
return False
return True
test_crash()
EOFRepository: NVIDIA-NeMo/Skills
Length of output: 355
KeyError crash when predictions have inconsistent "judgement" field.
When some (but not all) predictions lack the "judgement" field, _get_score_dict returns an empty dict for those predictions. This causes a KeyError crash in _compute_pass_at_k (base.py line 369) when trying to build scores_list:
scores_list = [correctness_dict[score_method] for correctness_dict in score_dicts]If score_dicts[0] has keys like "judge_correct" but score_dicts[i] is empty, accessing score_dicts[i]["judge_correct"] raises KeyError.
Validate that all predictions in a batch contain the "judgement" field, or ensure _get_score_dict returns consistent keys across all predictions (e.g., with default values).
| def get_metrics(self): | ||
| metrics = super().get_metrics() | ||
|
|
||
| for agg_method, agg_metric_dict in metrics.items(): | ||
| correct, incorrect, part_correct, abstained = ( | ||
| agg_metric_dict["judge_correct"], | ||
| agg_metric_dict["judge_incorrect"], | ||
| agg_metric_dict["judge_partially_correct"], | ||
| agg_metric_dict["judge_abstained"], | ||
| ) | ||
| metrics[agg_method]["judge_omni_index"] = ( | ||
| 100 * (correct - incorrect) / (correct + incorrect + part_correct + abstained) | ||
| ) | ||
| metrics[agg_method]["judge_omni_hallucination"] = 100 * incorrect / (incorrect + part_correct + abstained) | ||
| return metrics |
There was a problem hiding this comment.
Potential ZeroDivisionError in metric calculations.
Two division operations can fail:
- Line 82-83:
(correct + incorrect + part_correct + abstained)equals zero if no judgements exist. - Line 84:
(incorrect + part_correct + abstained)equals zero when all responses arejudge_correct(judgement "A").
This will crash metrics computation in edge cases (empty data or perfect scores).
Suggested fix with guards
def get_metrics(self):
metrics = super().get_metrics()
for agg_method, agg_metric_dict in metrics.items():
correct, incorrect, part_correct, abstained = (
agg_metric_dict["judge_correct"],
agg_metric_dict["judge_incorrect"],
agg_metric_dict["judge_partially_correct"],
agg_metric_dict["judge_abstained"],
)
- metrics[agg_method]["judge_omni_index"] = (
- 100 * (correct - incorrect) / (correct + incorrect + part_correct + abstained)
- )
- metrics[agg_method]["judge_omni_hallucination"] = 100 * incorrect / (incorrect + part_correct + abstained)
+ total = correct + incorrect + part_correct + abstained
+ non_correct_total = incorrect + part_correct + abstained
+
+ metrics[agg_method]["judge_omni_index"] = (
+ 100 * (correct - incorrect) / total if total > 0 else 0.0
+ )
+ metrics[agg_method]["judge_omni_hallucination"] = (
+ 100 * incorrect / non_correct_total if non_correct_total > 0 else 0.0
+ )
return metrics| def update(self, predictions): | ||
| super().update(predictions) | ||
| self._compute_pass_at_k(predictions, None) | ||
| if "reward_model_score" in predictions[0]: | ||
| self._compute_reward_at_k(predictions=predictions) |
There was a problem hiding this comment.
IndexError risk with empty predictions.
Line 99 accesses predictions[0] without checking if predictions is non-empty. If update is called with an empty list, this will raise IndexError.
Suggested fix
def update(self, predictions):
super().update(predictions)
+ if not predictions:
+ return
self._compute_pass_at_k(predictions, None)
if "reward_model_score" in predictions[0]:
self._compute_reward_at_k(predictions=predictions)🤖 Prompt for AI Agents
In @nemo_skills/evaluation/metrics/omni_metrics.py around lines 96 - 100, The
update method accesses predictions[0] without checking for an empty list,
risking IndexError; after calling super().update(predictions) add a guard like
"if not predictions: return" to avoid further processing on an empty list, or at
minimum change the reward-model check to "if predictions and
'reward_model_score' in predictions[0]:"; ensure this guard is applied before
calling _compute_pass_at_k and _compute_reward_at_k so both methods aren't
invoked with an empty predictions list (refer to the update method and helpers
_compute_pass_at_k and _compute_reward_at_k).
| @@ -0,0 +1,26 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright year is 2026
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
| @@ -0,0 +1,80 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright year is 2026
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
| args = parse_args() | ||
|
|
||
| dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train") | ||
| jsonl_data = [format_entry(d) for d in dataset] |
There was a problem hiding this comment.
jsonl_data variable is assigned but never used - can be removed
| @@ -0,0 +1,124 @@ | |||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright year is 2026
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
|
|
||
| # If no valid answers, it's incorrect | ||
| if not valid_answers_and_results: | ||
| is_correct = False |
There was a problem hiding this comment.
is_correct variable is assigned but never used - can be removed
| metrics[agg_method]["judge_omni_hallucination"] = ( | ||
| 100 * incorrect / (incorrect + part_correct + abstained) | ||
| if (incorrect + part_correct + abstained) > 0 else 0 | ||
| ) |
There was a problem hiding this comment.
Consider verifying the denominator logic - if all predictions are correct, this returns 0, but the hallucination metric definition may need clarification for this edge case
Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
…added data-dependent system prompt formatting Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
| score_dicts = [self._get_score_dict(pred) for pred in predictions] | ||
|
|
||
| for k in range(1, len(predictions) + 1): | ||
| for score_method in score_dicts[0].keys(): |
There was a problem hiding this comment.
logic: IndexError if all predictions have no judgement field - score_dicts will be a list of empty dicts, and score_dicts[0].keys() will be empty, but attempting to iterate over it will still try to access score_dicts[0] when it doesn't exist if predictions list is empty
| for score_method in score_dicts[0].keys(): | |
| if not score_dicts or not score_dicts[0]: | |
| continue | |
| for score_method in score_dicts[0].keys(): |
| def update(self, predictions): | ||
| super().update(predictions) | ||
| self._compute_pass_at_k(predictions, None) | ||
| if "reward_model_score" in predictions[0]: |
There was a problem hiding this comment.
logic: IndexError if predictions list is empty - accessing predictions[0] without checking if list is non-empty
| if "reward_model_score" in predictions[0]: | |
| if predictions and "reward_model_score" in predictions[0]: |
| if self.config.system is not None: | ||
| messages = [ | ||
| {"role": "system", "content": self.config.system}, | ||
| {"role": "system", "content": self.config.system.format(**input_dict)}, |
There was a problem hiding this comment.
logic: .format() on system messages will unescape {{ to { in existing prompts. Files like nemo_skills/prompt/config/generic/hle.yaml (line 5-7: {{your explanation}}, {{your succinct, final answer}}, {{your confidence score}}) and nemo_skills/prompt/config/qwen/math-cot.yaml (line 3: \boxed{{}}) use {{ to display literal braces to the model. After this change, these become {your explanation}, {your answer}, and \boxed{}, which will either cause KeyError if those keys don't exist in input_dict, or worse, silently replace them with actual values if keys happen to match.
| {"role": "system", "content": self.config.system.format(**input_dict)}, | |
| {"role": "system", "content": self.config.system if isinstance(self.config.system, str) and '{' not in self.config.system else self.config.system.format(**input_dict) if '{' in self.config.system and '{{' not in self.config.system else self.config.system}, |
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
…Skills into akomaragiri/aai_omniscience
| if self.config.system is not None: | ||
| messages = [ | ||
| {"role": "system", "content": self.config.system}, | ||
| {"role": "system", "content": self.config.system.format(**input_dict)}, |
There was a problem hiding this comment.
logic: calling .format(**input_dict) will convert {{ to { in existing system prompts that use double braces for literal display. For example, nemo_skills/prompt/config/qwen/math-cot.yaml:3 has \boxed{{}} which becomes \boxed{} after formatting, and nemo_skills/prompt/config/generic/hle.yaml:5-7 has {{your explanation}} which becomes {your explanation} - this will either cause KeyError if those keys don't exist in input_dict, or silently replace them with values if keys happen to match
| {"role": "system", "content": self.config.system.format(**input_dict)}, | |
| {"role": "system", "content": self.config.system.format_map(defaultdict(str, **input_dict))}, |
was the intention to unescape double braces in existing prompts, or should literal braces be preserved?
| def update(self, predictions): | ||
| super().update(predictions) | ||
| self._compute_pass_at_k(predictions, None) | ||
| if "reward_model_score" in predictions[0]: |
There was a problem hiding this comment.
logic: IndexError if predictions is empty - accessing predictions[0] without checking length
| if "reward_model_score" in predictions[0]: | |
| if predictions and "reward_model_score" in predictions[0]: |
| score_dicts = [self._get_score_dict(pred) for pred in predictions] | ||
|
|
||
| for k in range(1, len(predictions) + 1): | ||
| for score_method in score_dicts[0].keys(): |
There was a problem hiding this comment.
logic: KeyError if score_dicts is empty (when all predictions lack judgement field) - trying to access .keys() on empty list's first element
| for score_method in score_dicts[0].keys(): | |
| if not score_dicts or not score_dicts[0]: | |
| continue | |
| for score_method in score_dicts[0].keys(): |
| if self.config.system is not None: | ||
| messages = [ | ||
| {"role": "system", "content": self.config.system}, | ||
| {"role": "system", "content": self.config.system.format(**input_dict)}, |
There was a problem hiding this comment.
logic: breaks existing prompts using {{ for literal braces. Files like nemo_skills/prompt/config/generic/hle.yaml (lines 5-7: {{your explanation}}, {{your succinct, final answer}}) and nemo_skills/prompt/config/eval/aai/math.yaml (lines 4,8: \boxed{{}}) use {{ to display literal braces. .format() converts {{ to {, causing either KeyError if keys don't exist, or unintended replacements if they do.
| {"role": "system", "content": self.config.system.format(**input_dict)}, | |
| {"role": "system", "content": self.config.system}, |
| def update(self, predictions): | ||
| super().update(predictions) | ||
| self._compute_pass_at_k(predictions, None) | ||
| if "reward_model_score" in predictions[0]: |
There was a problem hiding this comment.
logic: IndexError if predictions is empty - accessing predictions[0] without checking length first
| if "reward_model_score" in predictions[0]: | |
| if predictions and "reward_model_score" in predictions[0]: |
| score_dicts = [self._get_score_dict(pred) for pred in predictions] | ||
|
|
||
| for k in range(1, len(predictions) + 1): | ||
| for score_method in score_dicts[0].keys(): |
There was a problem hiding this comment.
logic: IndexError if all predictions lack judgement field - score_dicts will be a list of empty dicts, causing score_dicts[0].keys() to fail when trying to iterate
| for score_method in score_dicts[0].keys(): | |
| if not score_dicts or not score_dicts[0]: | |
| continue | |
| for score_method in score_dicts[0].keys(): |
| dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train") | ||
| jsonl_data = [format_entry(d) for d in dataset] |
There was a problem hiding this comment.
style: jsonl_data created but never used - can be removed
| dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train") | |
| jsonl_data = [format_entry(d) for d in dataset] | |
| dataset = load_dataset("ArtificialAnalysis/AA-Omniscience-Public", split="train") |
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
…Skills into akomaragiri/aai_omniscience
| if self.config.system is not None: | ||
| messages = [ | ||
| {"role": "system", "content": self.config.system}, | ||
| {"role": "system", "content": self.config.system.format(**input_dict)}, |
There was a problem hiding this comment.
logic: .format(**input_dict) will unescape {{ to { in existing system prompts. Files like nemo_skills/prompt/config/qwen/math-cot.yaml:3 use \boxed{{}} which will become \boxed{} after formatting, and nemo_skills/prompt/config/generic/hle.yaml:5-7 use {{your explanation}}, {{your succinct, final answer}} which will become {your explanation}, {your answer} - causing KeyError or unintended replacements
| {"role": "system", "content": self.config.system.format(**input_dict)}, | |
| {"role": "system", "content": self.config.system.format_map(defaultdict(str, input_dict))}, |
or use a safer approach that only formats when needed. Was the intention to unescape double braces in existing prompts, or should they be preserved?
| score_dicts = [self._get_score_dict(pred) for pred in predictions] | ||
|
|
||
| for k in range(1, len(predictions) + 1): | ||
| for score_method in score_dicts[0].keys(): |
There was a problem hiding this comment.
logic: IndexError if predictions is empty - accessing score_dicts[0] without checking length
| for score_method in score_dicts[0].keys(): | |
| for score_method in (score_dicts[0].keys() if score_dicts else []): |
Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com>
| if self.config.system is not None: | ||
| messages = [ | ||
| {"role": "system", "content": self.config.system}, | ||
| {"role": "system", "content": self.config.system.format(**input_dict)}, |
There was a problem hiding this comment.
logic: calling .format(**input_dict) unescapes {{ to { in existing system prompts. Prompts like nemo_skills/prompt/config/qwen/math-cot.yaml:3 (\boxed{{}}) become \boxed{} after formatting, and nemo_skills/prompt/config/generic/hle.yaml:5-7 ({{your explanation}}, {{your succinct, final answer}}) become {your explanation}, {your answer} - causing KeyError if those keys don't exist in input_dict, or unintended replacements if they do
| {"role": "system", "content": self.config.system.format(**input_dict)}, | |
| {"role": "system", "content": self.config.system.format_map(defaultdict(str, **input_dict))}, |
alternatively, only format when the system message contains the specific keys from input_dict, or escape existing braces before formatting. was the intention to unescape {{}} in existing prompts, or should literal braces be preserved?
| def update(self, predictions): | ||
| super().update(predictions) | ||
| self._compute_pass_at_k(predictions, None) | ||
| if "reward_model_score" in predictions[0]: |
There was a problem hiding this comment.
logic: IndexError if predictions list is empty - accessing predictions[0] without checking length first
| if "reward_model_score" in predictions[0]: | |
| if predictions and "reward_model_score" in predictions[0]: |
Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com> Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: Arnav Komaragiri <arnav.komaragiri@gmail.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Valentin Mendelev <vmendelev@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Co-authored-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: Arnav Komaragiri <akomaragiri@nvidia.com> Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: Arnav Komaragiri <arnav.komaragiri@gmail.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Valentin Mendelev <vmendelev@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Co-authored-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Draft of AAI-Omniscience Benchmark in Nemo-Skills, will clean up before merging.
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.