diff --git a/docs/evaluation/other-benchmarks.md b/docs/evaluation/other-benchmarks.md index 33ecc199d6..b1b9434d6c 100644 --- a/docs/evaluation/other-benchmarks.md +++ b/docs/evaluation/other-benchmarks.md @@ -127,6 +127,95 @@ After all jobs are complete, you can check the results in `/eval-res } ``` +### HotpotQA + +[HotpotQA](https://hotpotqa.github.io/) is a multi-hop question-answering benchmark that requires reasoning over multiple Wikipedia paragraphs. Two variants are supported: + +| Variant | Slug | Description | +|:---|:---|:---| +| **Distractor** | `hotpotqa` | Model receives the question plus 10 context paragraphs (2 gold + 8 distractors) and must return the answer **and** identify supporting-fact sentences. | +| **Closed-book** | `hotpotqa_closedbook` | Same questions, no context provided — tests the model's parametric knowledge. | + +- Benchmark definitions: [`nemo_skills/dataset/hotpotqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hotpotqa/__init__.py) and [`nemo_skills/dataset/hotpotqa_closedbook/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hotpotqa_closedbook/__init__.py) +- Original benchmark source is the [HotpotQA repository](https://github.com/hotpotqa/hotpot). +- Uses 7,405 distractor-setting validation examples. Both variants share the same data; preparation is unified in [`nemo_skills/dataset/hotpotqa/prepare_utils.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hotpotqa/prepare_utils.py). The closed-book variant copies the prepared file from the distractor dataset (no separate download). +- Metrics follow the [official evaluation script](https://github.com/hotpotqa/hotpot/blob/master/hotpot_evaluate_v1.py): Answer EM/F1, Supporting-facts EM/F1, Joint EM/F1, plus alternative-aware substring matching. +- Both unfiltered and filtered (excluding unreliable questions) metrics are reported automatically. + +#### Data Preparation + +Prepare the distractor validation set (single source of truth), then the closed-book variant (copies from it): + +```bash +ns prepare_data hotpotqa +ns prepare_data hotpotqa_closedbook +``` + +You can also run `ns prepare_data hotpotqa_closedbook` alone; it will run the shared preparation for `hotpotqa` first if that data is not yet present, then copy it. + +#### Running the Evaluation + +Distractor evaluation (with context and supporting-fact scoring). Use `hotpotqa:4` for 4 seeds (produces the example results below): + +```bash +ns eval \ + --cluster= \ + --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ + --server_type=vllm \ + --server_gpus=8 \ + --benchmarks=hotpotqa:4 \ + --output_dir= \ + --server_args="--max-model-len 32768" \ + ++inference.temperature=1.0 \ + ++inference.top_p=1.0 \ + ++inference.tokens_to_generate=16384 +``` + +Closed-book evaluation (no context). Use `hotpotqa_closedbook:4` for 4 seeds (produces the example results below): + +```bash +ns eval \ + --cluster= \ + --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \ + --server_type=vllm \ + --server_gpus=8 \ + --benchmarks=hotpotqa_closedbook:4 \ + --output_dir= \ + --server_args="--max-model-len 32768" \ + ++inference.temperature=1.0 \ + ++inference.top_p=1.0 \ + ++inference.tokens_to_generate=16384 +``` + +#### Verifying Results + +After all jobs are complete, check the results in `/eval-results/hotpotqa/metrics.json`. +The results table is printed to stdout and captured in the summarize-results srun log. + +Example distractor results (Nemotron-3-Nano, `hotpotqa:4`): + +```text +----------------------------------------------------------------------------- hotpotqa ----------------------------------------------------------------------------- +evaluation_mode | num_entries | answer_em | answer_f1 | sp_em | sp_f1 | joint_em | joint_f1 | is_correct | is_correct_strict +pass@1[avg-of-4] | 7405 | 62.92 ± 0.25 | 78.15 ± 0.16 | 21.52 ± 0.12 | 60.75 ± 0.21 | 15.45 ± 0.14 | 49.52 ± 0.15 | 73.35 ± 0.22 | 71.68 ± 0.26 +pass@4 | 7405 | 70.28 | 83.86 | 35.29 | 74.41 | 25.75 | 62.69 | 79.23 | 77.92 +filtered_pass@1[avg-of-4] | 6057 | 67.71 | 79.30 | 22.09 | 60.95 | 17.01 | 50.56 | 78.79 | 77.12 +filtered_pass@4 | 6057 | 74.95 | 85.10 | 35.86 | 74.55 | 27.92 | 63.88 | 84.27 | 83.11 +``` + +Example closed-book results (Nemotron-3-Nano, `hotpotqa_closedbook:4`): + +```text +----------------------------------------- hotpotqa_closedbook ------------------------------------------ +evaluation_mode | num_entries | answer_em | answer_f1 | is_correct | is_correct_strict +pass@1[avg-of-4] | 7405 | 29.05 ± 0.15 | 39.35 ± 0.18 | 33.14 ± 0.32 | 32.36 ± 0.28 +pass@4 | 7405 | 37.91 | 50.40 | 42.50 | 41.30 +filtered_pass@1[avg-of-4] | 6057 | 31.85 | 39.57 | 36.48 | 35.60 +filtered_pass@4 | 6057 | 41.59 | 51.01 | 46.77 | 45.44 +``` + +The closed-book variant reports answer-level metrics only (no supporting-fact or joint metrics). + ### AA-Omniscience This is a benchmark developed by AA to measure hallucinations in LLMs and penalize confidently-false answers. diff --git a/nemo_skills/dataset/hotpotqa/__init__.py b/nemo_skills/dataset/hotpotqa/__init__.py new file mode 100644 index 0000000000..84aadc0a65 --- /dev/null +++ b/nemo_skills/dataset/hotpotqa/__init__.py @@ -0,0 +1,17 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +METRICS_TYPE = "hotpotqa" +GENERATION_ARGS = "++prompt_config=eval/hotpotqa" +EVAL_SPLIT = "validation" diff --git a/nemo_skills/dataset/hotpotqa/prepare.py b/nemo_skills/dataset/hotpotqa/prepare.py new file mode 100644 index 0000000000..01a991e58e --- /dev/null +++ b/nemo_skills/dataset/hotpotqa/prepare.py @@ -0,0 +1,24 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Prepare HotpotQA distractor validation set. Single source of truth for this data.""" + +from pathlib import Path + +from nemo_skills.dataset.hotpotqa.prepare_utils import prepare_validation + +if __name__ == "__main__": + data_dir = Path(__file__).absolute().parent + data_dir.mkdir(exist_ok=True) + prepare_validation(data_dir / "validation.jsonl") diff --git a/nemo_skills/dataset/hotpotqa/prepare_utils.py b/nemo_skills/dataset/hotpotqa/prepare_utils.py new file mode 100644 index 0000000000..f2474a2ed7 --- /dev/null +++ b/nemo_skills/dataset/hotpotqa/prepare_utils.py @@ -0,0 +1,82 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use it except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Shared HotpotQA data formatting and preparation. + +Used by both hotpotqa and hotpotqa_closedbook so there is a single source of truth +for downloading and formatting the distractor validation set. +""" + +import json +from pathlib import Path + +from datasets import load_dataset +from tqdm import tqdm + + +def format_context(context: dict) -> str: + """Format context paragraphs with titles and indexed sentences. + + Each paragraph becomes: + Title: + [0] <sentence 0> + [1] <sentence 1> + ... + + Paragraphs are separated by blank lines. + """ + paragraphs = [] + for title, sentences in zip(context["title"], context["sentences"], strict=True): + lines = [f"Title: {title}"] + for idx, sent in enumerate(sentences): + lines.append(f"[{idx}] {sent.strip()}") + paragraphs.append("\n".join(lines)) + return "\n\n".join(paragraphs) + + +def format_entry(entry: dict) -> dict: + """Format a HotpotQA entry to match NeMo-Skills format.""" + supporting_facts = list(zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"], strict=True)) + + return { + "id": entry["id"], + "question": entry["question"], + "expected_answer": entry["answer"], + "context": format_context(entry["context"]), + "supporting_facts": supporting_facts, + "type": entry["type"], + "level": entry["level"], + } + + +def prepare_validation(output_path: Path) -> int: + """Download HotpotQA distractor validation set and write NeMo-Skills format to output_path. + + Returns the number of examples written. + """ + output_path = Path(output_path) + output_path.parent.mkdir(parents=True, exist_ok=True) + + ds = load_dataset("hotpotqa/hotpot_qa", "distractor", split="validation") + + formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_path.name}")] + tmp_output_path = output_path.with_suffix(".jsonl.tmp") + with open(tmp_output_path, "wt", encoding="utf-8") as fout: + for formatted in formatted_entries: + json.dump(formatted, fout) + fout.write("\n") + tmp_output_path.replace(output_path) + + print(f"Wrote {len(ds)} examples to {output_path}") + return len(ds) diff --git a/nemo_skills/dataset/hotpotqa_closedbook/__init__.py b/nemo_skills/dataset/hotpotqa_closedbook/__init__.py new file mode 100644 index 0000000000..c35667173e --- /dev/null +++ b/nemo_skills/dataset/hotpotqa_closedbook/__init__.py @@ -0,0 +1,21 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Closed-book variant of HotpotQA: same questions, no context provided. +# Reuses the hotpotqa validation data (symlinked) with a different prompt +# and answer-only metrics. + +METRICS_TYPE = "hotpotqa_closedbook" +GENERATION_ARGS = "++prompt_config=eval/hotpotqa_closedbook" +EVAL_SPLIT = "validation" diff --git a/nemo_skills/dataset/hotpotqa_closedbook/prepare.py b/nemo_skills/dataset/hotpotqa_closedbook/prepare.py new file mode 100644 index 0000000000..933a3d79b3 --- /dev/null +++ b/nemo_skills/dataset/hotpotqa_closedbook/prepare.py @@ -0,0 +1,42 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use it except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Closed-book variant uses the same validation data as hotpotqa (distractor setting). +# We reuse that file so there is only one real data preparation (in hotpotqa). + +import shutil +import sys +from pathlib import Path + +# Reuse the shared preparation so we don't require hotpotqa to be prepared first. +from nemo_skills.dataset.hotpotqa.prepare_utils import prepare_validation + +if __name__ == "__main__": + data_dir = Path(__file__).absolute().parent + data_dir.mkdir(exist_ok=True) + output_file = data_dir / "validation.jsonl" + + hotpotqa_source = data_dir.parent / "hotpotqa" / "validation.jsonl" + + if hotpotqa_source.exists(): + shutil.copy2(hotpotqa_source, output_file) + print(f"Copied {hotpotqa_source} -> {output_file}") + else: + # Same data; run shared preparation for hotpotqa then copy here. + prepare_validation(hotpotqa_source) + if not hotpotqa_source.exists(): + print("Preparation did not create the expected file.", file=sys.stderr) + sys.exit(1) + shutil.copy2(hotpotqa_source, output_file) + print(f"Copied {hotpotqa_source} -> {output_file}") diff --git a/nemo_skills/evaluation/metrics/base.py b/nemo_skills/evaluation/metrics/base.py index 6422749def..fe6c0ee881 100644 --- a/nemo_skills/evaluation/metrics/base.py +++ b/nemo_skills/evaluation/metrics/base.py @@ -447,8 +447,7 @@ def as_int(metric_key: str, metric_value: float, all_metrics: dict): def as_float(metric_key: str, metric_value: float, all_metrics: dict): - if (metric_std := all_metrics.get(f"{metric_key}_statistics", {}).get("std_dev_across_runs")) is not None: - return f"{float(metric_value):.2f} ± {metric_std:.2f}" + """Format float for display (for real floats that are not scaled as percentages).""" return f"{float(metric_value):.2f}" diff --git a/nemo_skills/evaluation/metrics/hotpotqa_filtering.py b/nemo_skills/evaluation/metrics/hotpotqa_filtering.py new file mode 100644 index 0000000000..c03483f5b8 --- /dev/null +++ b/nemo_skills/evaluation/metrics/hotpotqa_filtering.py @@ -0,0 +1,287 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# HotpotQA answer normalization and filtering. +# Adapted from hmaron/nvidia-research-tlv-nemotron-hallucination-detection, +# experiments/hotpotqa_filtering/eval_package/hotpotqa_eval.py +# +# Generates alternative surface forms of ground-truth answers and flags +# questions that are unreliable for substring-based evaluation. + +import re + +__all__ = [ + "is_correct", + "is_correct_strict", + "normalize_gt", +] + +_NUM_WORDS = { + "one": "1", + "two": "2", + "three": "3", + "four": "4", + "five": "5", + "six": "6", + "seven": "7", + "eight": "8", + "nine": "9", + "ten": "10", + "eleven": "11", + "twelve": "12", + "thirteen": "13", + "fourteen": "14", + "fifteen": "15", + "sixteen": "16", + "seventeen": "17", + "eighteen": "18", + "nineteen": "19", + "twenty": "20", + "first": "1st", + "second": "2nd", + "third": "3rd", + "fourth": "4th", + "fifth": "5th", + "nineteenth": "19th", + "twentieth": "20th", + "twenty-first": "21st", +} +_NUM_DIGITS = {v: k for k, v in _NUM_WORDS.items()} + +_MAX_GT_LENGTH = 40 +_MIN_ALT_LENGTH = 3 + +_STOPWORDS = frozenset( + [ + "the", + "a", + "an", + "of", + "in", + "on", + "at", + "for", + "and", + "or", + "to", + "by", + "is", + "was", + "are", + "were", + "be", + "been", + "with", + "from", + "that", + "this", + "it", + "its", + "his", + "her", + "my", + "our", + "their", + "no", + "not", + "but", + "if", + "as", + "into", + "about", + "than", + "then", + ] +) + + +def _normalize_unicode(s: str) -> str: + """Normalize unicode whitespace, hyphens, and quotes for matching.""" + for c in "\u202f\u00a0\u2009\u200a\u2002\u2003": + s = s.replace(c, " ") + for c in "\u2010\u2011\u2012\u2013\u2014\u2015": + s = s.replace(c, "-") + s = s.replace("\u2019", "'").replace("\u2018", "'") + s = s.replace("\u201c", '"').replace("\u201d", '"') + while " " in s: + s = s.replace(" ", " ") + return s.strip() + + +def _gt_alternatives(gt: str) -> tuple[list[str], list[str]]: + """Generate valid surface-form alternatives for a ground-truth answer. + + Returns (sorted_alternatives, list_of_rule_tags_that_fired). + """ + alts = {gt} + rules = [] + + for prefix in ("the ", "a ", "an "): + if gt.lower().startswith(prefix): + alts.add(gt[len(prefix) :]) + rules.append("strip_article") + break + + stripped = gt.replace('"', "").replace("\u201c", "").replace("\u201d", "").strip() + if stripped and stripped != gt: + alts.add(stripped) + rules.append("strip_quotes") + + if "(" in gt: + no_parens = re.sub(r"\s*\([^)]*\)\s*", " ", gt).strip() + no_parens = re.sub(r"\s+", " ", no_parens) + if no_parens and no_parens != gt: + alts.add(no_parens) + for inner in re.findall(r"\(([^)]+)\)", gt): + inner = inner.strip() + if len(inner) > 1: + alts.add(inner) + rules.append("normalize_parens") + + gt_low = gt.lower().strip() + if gt_low in _NUM_WORDS: + alts.add(_NUM_WORDS[gt_low]) + rules.append("number_word_to_digit") + if gt_low in _NUM_DIGITS: + alts.add(_NUM_DIGITS[gt_low]) + rules.append("number_digit_to_word") + + no_commas = re.sub(r"(\d),(\d{3})", r"\1\2", gt) + while no_commas != gt and re.search(r"(\d),(\d{3})", no_commas): + no_commas = re.sub(r"(\d),(\d{3})", r"\1\2", no_commas) + if no_commas != gt: + alts.add(no_commas) + rules.append("strip_number_commas") + + if gt and gt[-1] in ".,;:!?": + alts.add(gt[:-1].rstrip()) + rules.append("strip_trailing_punct") + + if "." in gt: + no_dots = re.sub(r"(?<!\d)\.(?!\d)", "", gt) + if no_dots and len(no_dots) > 1 and no_dots != gt: + alts.add(no_dots) + rules.append("strip_abbrev_dots") + + if "-" in gt and not gt.startswith("-"): + no_hyphen = re.sub(r"\s+", " ", gt.replace("-", " ")).strip() + if no_hyphen != gt: + alts.add(no_hyphen) + rules.append("hyphen_to_space") + + if " & " in gt: + alts.add(gt.replace(" & ", " and ")) + rules.append("ampersand_to_and") + if " and " in gt.lower(): + idx = gt.lower().index(" and ") + alts.add(gt[:idx] + " & " + gt[idx + 5 :]) + rules.append("and_to_ampersand") + + extra = set() + for alt in list(alts): + for prefix in ("the ", "a ", "an "): + if alt.lower().startswith(prefix): + extra.add(alt[len(prefix) :]) + alts |= extra + + normed = set() + for a in alts: + a = re.sub(r"\s+", " ", a.strip()) + if a and (len(a) >= _MIN_ALT_LENGTH or a == gt.strip() or a.isdigit()): + normed.add(a) + + return sorted(normed), rules + + +def _is_multi_word_name(gt: str) -> bool: + """True if GT looks like a multi-word proper name unreliable for substring matching.""" + parts = gt.strip().rstrip(".").split() + n = len(parts) + if n in (3, 4): + return all(p[0].isupper() for p in parts) and all(p.lower() not in _STOPWORDS for p in parts) + if n in (5, 6): + caps = [p for p in parts if p[0].isupper() and p.lower() not in _STOPWORDS] + return len(caps) >= 3 + return False + + +def _should_remove(gt: str) -> tuple[bool, str]: + """Return (flag, reason). Reason is '' if not removed.""" + if len(gt) > _MAX_GT_LENGTH: + return True, "gt_too_long" + if _is_multi_word_name(gt): + return True, "multi_word_name" + return False, "" + + +def normalize_gt(gt_answer: str) -> dict: + """Normalize a single ground-truth answer on-the-fly. + + Returns dict with keys: + alternatives (list[str]): Valid surface forms (always includes original). + should_remove (bool): True if unreliable for substring eval. + remove_reason (str): '' | 'gt_too_long' | 'multi_word_name'. + edited (bool): True if any rule fired. + edit_reasons (list[str]): Tags of rules that fired. + """ + alts, alt_rules = _gt_alternatives(gt_answer) + remove, remove_reason = _should_remove(gt_answer) + edit_reasons = list(alt_rules) + if remove_reason: + edit_reasons.append(remove_reason) + return { + "alternatives": alts, + "should_remove": remove, + "remove_reason": remove_reason, + "edited": bool(edit_reasons), + "edit_reasons": edit_reasons, + } + + +def is_correct(alternatives: list[str], model_answer: str) -> bool: + """Check if any alternative is a substring of the model answer. + + Args: + alternatives: List from normalize_gt()['alternatives']. + model_answer: The model's predicted answer string. + """ + ans = _normalize_unicode(model_answer.lower()) + return any(_normalize_unicode(alt.lower()) in ans for alt in alternatives) + + +def is_correct_strict(alternatives: list[str], model_answer: str) -> bool: + """Stricter matching that reduces false positives. + + Additional gates over is_correct(): + - Short alternatives (<=4 chars): require word-boundary match + - Long model answers (>80 chars): reject if match starts after position 40 + """ + ans = _normalize_unicode(model_answer.lower()) + ans_len = len(ans) + + for alt in alternatives: + alt_norm = _normalize_unicode(alt.lower()) + if not alt_norm: + continue + if alt_norm not in ans: + continue + if len(alt_norm) <= 4: + if not re.search(r"(?<!\w)" + re.escape(alt_norm) + r"(?!\w)", ans): + continue + if ans_len > 80: + pos = ans.find(alt_norm) + if pos > 40: + continue + return True + return False diff --git a/nemo_skills/evaluation/metrics/hotpotqa_metrics.py b/nemo_skills/evaluation/metrics/hotpotqa_metrics.py new file mode 100644 index 0000000000..5fedcee030 --- /dev/null +++ b/nemo_skills/evaluation/metrics/hotpotqa_metrics.py @@ -0,0 +1,320 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Metrics for HotpotQA multi-hop question answering. +# Answer normalization and scoring logic faithfully adapted from the official +# evaluation script: https://github.com/hotpotqa/hotpot/blob/master/hotpot_evaluate_v1.py +# +# On-the-fly filtering adapted from hmaron/nvidia-research-tlv-nemotron-hallucination-detection. +# Reports both unfiltered (all questions) and filtered (excluding unreliable questions) +# metrics in the output. + +import json +import re +import string +from collections import Counter, defaultdict + +from nemo_skills.evaluation.metrics.base import BaseMetrics, as_percentage +from nemo_skills.evaluation.metrics.hotpotqa_filtering import ( + is_correct, + is_correct_strict, + normalize_gt, +) + + +def normalize_answer(s: str) -> str: + """Normalize answer string (official HotpotQA / SQuAD normalization).""" + + def remove_articles(text): + return re.sub(r"\b(a|an|the)\b", " ", text) + + def white_space_fix(text): + return " ".join(text.split()) + + def remove_punc(text): + exclude = set(string.punctuation) + return "".join(ch for ch in text if ch not in exclude) + + def lower(text): + return text.lower() + + return white_space_fix(remove_articles(remove_punc(lower(s)))) + + +def answer_f1_score(prediction: str, ground_truth: str) -> tuple[float, float, float]: + """Compute token-overlap F1, precision, and recall between prediction and ground truth. + + Returns (f1, precision, recall). Special-cases yes/no/noanswer tokens. + """ + normalized_prediction = normalize_answer(prediction) + normalized_ground_truth = normalize_answer(ground_truth) + + ZERO_METRIC = (0.0, 0.0, 0.0) + + if normalized_prediction in ["yes", "no", "noanswer"] and normalized_prediction != normalized_ground_truth: + return ZERO_METRIC + if normalized_ground_truth in ["yes", "no", "noanswer"] and normalized_prediction != normalized_ground_truth: + return ZERO_METRIC + + prediction_tokens = normalized_prediction.split() + ground_truth_tokens = normalized_ground_truth.split() + common = Counter(prediction_tokens) & Counter(ground_truth_tokens) + num_same = sum(common.values()) + if num_same == 0: + return ZERO_METRIC + precision = 1.0 * num_same / len(prediction_tokens) + recall = 1.0 * num_same / len(ground_truth_tokens) + f1 = (2 * precision * recall) / (precision + recall) + return f1, precision, recall + + +def answer_exact_match(prediction: str, ground_truth: str) -> float: + """Return 1.0 if normalized prediction matches normalized ground truth, else 0.0.""" + return float(normalize_answer(prediction) == normalize_answer(ground_truth)) + + +def sp_scores(prediction: list, gold: list) -> tuple[float, float, float, float]: + """Compute supporting facts EM, F1, precision, recall. + + Both prediction and gold are lists of [title, sent_id] pairs. + Returns (em, f1, precision, recall). + """ + cur_sp_pred = set(map(tuple, prediction)) + gold_sp_set = set(map(tuple, gold)) + + tp, fp, fn = 0, 0, 0 + for e in cur_sp_pred: + if e in gold_sp_set: + tp += 1 + else: + fp += 1 + for e in gold_sp_set: + if e not in cur_sp_pred: + fn += 1 + + precision = 1.0 * tp / (tp + fp) if tp + fp > 0 else 0.0 + recall = 1.0 * tp / (tp + fn) if tp + fn > 0 else 0.0 + f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0 + em = 1.0 if fp + fn == 0 else 0.0 + return em, f1, precision, recall + + +def _try_parse_answer_json(text: str) -> tuple[str, list] | None: + """Try to parse a JSON string as a HotpotQA answer object. Returns (answer, sp) or None.""" + try: + parsed = json.loads(text) + if not isinstance(parsed, dict) or "answer" not in parsed: + return None + answer = str(parsed["answer"]) + sp = parsed.get("supporting_facts", []) + if isinstance(sp, list): + valid_sp = [] + for item in sp: + if isinstance(item, (list, tuple)) and len(item) == 2: + try: + valid_sp.append([str(item[0]), int(item[1])]) + except (ValueError, TypeError): + continue + return answer, valid_sp + return answer, [] + except (json.JSONDecodeError, ValueError, TypeError): + return None + + +def _extract_json_candidates(text: str) -> list[str]: + """Extract all brace-delimited JSON candidate strings from text, ordered by position.""" + candidates = [] + i = 0 + while i < len(text): + if text[i] == "{": + depth = 0 + for j in range(i, len(text)): + if text[j] == "{": + depth += 1 + elif text[j] == "}": + depth -= 1 + if depth == 0: + candidates.append(text[i : j + 1]) + i = j + 1 + break + else: + break + else: + i += 1 + return candidates + + +def parse_generation(generation: str) -> tuple[str, list]: + """Parse the model generation to extract the predicted answer and supporting facts. + + Searches for JSON objects containing an "answer" key. When reasoning precedes + the JSON output, the *last* valid JSON object is used (the model is prompted + to end its response with the JSON). + + Returns (answer_string, supporting_facts_list). + """ + if not generation: + return "", [] + + text = generation.strip() + + md_matches = list(re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL)) + for md_match in reversed(md_matches): + result = _try_parse_answer_json(md_match.group(1)) + if result is not None: + return result + + candidates = _extract_json_candidates(text) + for candidate in reversed(candidates): + result = _try_parse_answer_json(candidate) + if result is not None: + return result + + return text, [] + + +class HotpotQAMetrics(BaseMetrics): + """Metrics for HotpotQA multi-hop question answering. + + Computes three groups of metrics following the official evaluation script: + - Answer: EM and F1 (token-overlap with SQuAD-style normalization) + - Supporting facts (SP): EM and F1 (set-based over (title, sent_id) tuples) + - Joint: product of answer and SP precision/recall, with derived EM and F1 + + Also computes alternative-aware substring matching (is_correct / is_correct_strict) + and reports both unfiltered (all questions) and filtered (excluding unreliable + questions flagged by normalize_gt) metrics. + + When ``closed_book=True``, only answer-level metrics are computed (no SP or + joint metrics), since no supporting context is provided to the model. + """ + + def __init__(self, compute_no_answer: bool = False, closed_book: bool = False): + self.closed_book = closed_book + super().__init__(compute_no_answer=compute_no_answer) + + def reset(self): + """Reset all counters including filtered-metric accumulators.""" + super().reset() + self.filtered_total = 0 + self.filtered_eval_dict = defaultdict(lambda: defaultdict(float)) + self._current_should_remove = False + + def _get_score_dict(self, prediction: dict) -> dict[str, float]: + """Compute answer, SP, joint, and alternative-match scores for one prediction.""" + generation = prediction["generation"] + expected_answer = prediction["expected_answer"] + + pred_answer, pred_sp = parse_generation(generation) + + ans_em = answer_exact_match(pred_answer, expected_answer) + ans_f1, ans_prec, ans_recall = answer_f1_score(pred_answer, expected_answer) + + gt_info = normalize_gt(expected_answer) + alternatives = gt_info["alternatives"] + alt_correct = float(is_correct(alternatives, pred_answer)) + alt_correct_strict = float(is_correct_strict(alternatives, pred_answer)) + + scores = { + "answer_em": ans_em, + "answer_f1": ans_f1, + "is_correct": alt_correct, + "is_correct_strict": alt_correct_strict, + } + + if not self.closed_book: + gold_sp = prediction["supporting_facts"] + sp_em, sp_f1, sp_prec, sp_recall = sp_scores(pred_sp, gold_sp) + + joint_prec = ans_prec * sp_prec + joint_recall = ans_recall * sp_recall + joint_f1 = ( + 2 * joint_prec * joint_recall / (joint_prec + joint_recall) if joint_prec + joint_recall > 0 else 0.0 + ) + joint_em = ans_em * sp_em + + scores["sp_em"] = sp_em + scores["sp_f1"] = sp_f1 + scores["joint_em"] = joint_em + scores["joint_f1"] = joint_f1 + + return scores + + def _update_score_metrics_for_pass( + self, + eval_dict, + k, + score_method, + score_dicts, + pass_score, + predictions, + predicted_answers, + ): + """Accumulate filtered metrics alongside the standard pass@k computation.""" + if self._current_should_remove: + return + + scores_list = [d[score_method] for d in score_dicts] + self.filtered_eval_dict[f"pass@{k}"][score_method] += pass_score + self.filtered_eval_dict[f"pass@1[avg-of-{k}]"][score_method] += sum(scores_list[:k]) / k + + def update(self, predictions): + """Update metrics with a batch of predictions for one question.""" + expected_answer = predictions[0]["expected_answer"] + gt_info = normalize_gt(expected_answer) + self._current_should_remove = gt_info["should_remove"] + + if not gt_info["should_remove"]: + self.filtered_total += 1 + + super().update(predictions) + self._compute_pass_at_k(predictions=predictions) + + def get_metrics(self): + """Return metrics dict with both unfiltered and filtered evaluation modes.""" + metrics = super().get_metrics() + + if self.filtered_total > 0: + for agg_mode, agg_dict in self.filtered_eval_dict.items(): + filtered_key = f"filtered_{agg_mode}" + metrics[filtered_key] = {"num_entries": self.filtered_total} + for metric_key, metric_value in agg_dict.items(): + if isinstance(metric_value, float): + metrics[filtered_key][metric_key] = 100.0 * metric_value / self.filtered_total + else: + metrics[filtered_key][metric_key] = metric_value + + return metrics + + def evaluations_to_print(self): + """Include filtered evaluation modes alongside the standard ones.""" + base = super().evaluations_to_print() + filtered = [f"filtered_{mode}" for mode in base] + return base + filtered + + def metrics_to_print(self): + """Return the ordered metric columns for the results table.""" + m = { + "num_entries": lambda key, val, metrics: str(val), + "answer_em": as_percentage, + "answer_f1": as_percentage, + } + if not self.closed_book: + m["sp_em"] = as_percentage + m["sp_f1"] = as_percentage + m["joint_em"] = as_percentage + m["joint_f1"] = as_percentage + m["is_correct"] = as_percentage + m["is_correct_strict"] = as_percentage + return m diff --git a/nemo_skills/evaluation/metrics/map_metrics.py b/nemo_skills/evaluation/metrics/map_metrics.py index 92f9f3282c..a10108ac4e 100644 --- a/nemo_skills/evaluation/metrics/map_metrics.py +++ b/nemo_skills/evaluation/metrics/map_metrics.py @@ -32,6 +32,7 @@ from nemo_skills.evaluation.metrics.critpt_metrics import CritPtMetrics from nemo_skills.evaluation.metrics.gradingbench_metrics import GradingBenchMetrics from nemo_skills.evaluation.metrics.hleaa_metrics import HLEAAMetrics +from nemo_skills.evaluation.metrics.hotpotqa_metrics import HotpotQAMetrics from nemo_skills.evaluation.metrics.icpc_metrics import ICPCMetrics from nemo_skills.evaluation.metrics.if_metrics import IFMetrics from nemo_skills.evaluation.metrics.ioi_metrics import IOIMetrics @@ -88,6 +89,8 @@ "gradingbench": GradingBenchMetrics, "critpt": CritPtMetrics, "specdec": SpecdecMetrics, + "hotpotqa": HotpotQAMetrics, + "hotpotqa_closedbook": functools.partial(HotpotQAMetrics, closed_book=True), } diff --git a/nemo_skills/prompt/config/eval/hotpotqa.yaml b/nemo_skills/prompt/config/eval/hotpotqa.yaml new file mode 100644 index 0000000000..03aa87f69f --- /dev/null +++ b/nemo_skills/prompt/config/eval/hotpotqa.yaml @@ -0,0 +1,12 @@ +user: |- + Answer the following question based on the provided context. You must also identify the supporting facts: the specific sentences that are necessary to answer the question. + + Context: + {context} + + Question: {question} + + Provide your response in the following JSON format: + {{"answer": "<your answer>", "supporting_facts": [["<paragraph title>", <sentence index>], ...]}} + + Output only the JSON object, nothing else. diff --git a/nemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml b/nemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml new file mode 100644 index 0000000000..59b24ed3b7 --- /dev/null +++ b/nemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml @@ -0,0 +1,9 @@ +user: |- + Answer the following question using your own knowledge. No external context is provided. + + Question: {question} + + Provide your response in the following JSON format: + {{"answer": "<your answer>"}} + + Output only the JSON object, nothing else.