Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions docs/evaluation/other-benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,95 @@ After all jobs are complete, you can check the results in `<OUTPUT_DIR>/eval-res
}
```

### HotpotQA

[HotpotQA](https://hotpotqa.github.io/) is a multi-hop question-answering benchmark that requires reasoning over multiple Wikipedia paragraphs. Two variants are supported:

| Variant | Slug | Description |
|:---|:---|:---|
| **Distractor** | `hotpotqa` | Model receives the question plus 10 context paragraphs (2 gold + 8 distractors) and must return the answer **and** identify supporting-fact sentences. |
| **Closed-book** | `hotpotqa_closedbook` | Same questions, no context provided — tests the model's parametric knowledge. |

- Benchmark definitions: [`nemo_skills/dataset/hotpotqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hotpotqa/__init__.py) and [`nemo_skills/dataset/hotpotqa_closedbook/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hotpotqa_closedbook/__init__.py)
- Original benchmark source is the [HotpotQA repository](https://github.com/hotpotqa/hotpot).
- Uses 7,405 distractor-setting validation examples. Both variants share the same data; preparation is unified in [`nemo_skills/dataset/hotpotqa/prepare_utils.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hotpotqa/prepare_utils.py). The closed-book variant copies the prepared file from the distractor dataset (no separate download).
- Metrics follow the [official evaluation script](https://github.com/hotpotqa/hotpot/blob/master/hotpot_evaluate_v1.py): Answer EM/F1, Supporting-facts EM/F1, Joint EM/F1, plus alternative-aware substring matching.
- Both unfiltered and filtered (excluding unreliable questions) metrics are reported automatically.

#### Data Preparation

Prepare the distractor validation set (single source of truth), then the closed-book variant (copies from it):

```bash
ns prepare_data hotpotqa
ns prepare_data hotpotqa_closedbook
```

You can also run `ns prepare_data hotpotqa_closedbook` alone; it will run the shared preparation for `hotpotqa` first if that data is not yet present, then copy it.

#### Running the Evaluation

Distractor evaluation (with context and supporting-fact scoring). Use `hotpotqa:4` for 4 seeds (produces the example results below):

```bash
ns eval \
--cluster=<CLUSTER_NAME> \
--model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--server_type=vllm \
--server_gpus=8 \
--benchmarks=hotpotqa:4 \
--output_dir=<OUTPUT_DIR> \
--server_args="--max-model-len 32768" \
++inference.temperature=1.0 \
++inference.top_p=1.0 \
++inference.tokens_to_generate=16384
```

Closed-book evaluation (no context). Use `hotpotqa_closedbook:4` for 4 seeds (produces the example results below):

```bash
ns eval \
--cluster=<CLUSTER_NAME> \
--model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
--server_type=vllm \
--server_gpus=8 \
--benchmarks=hotpotqa_closedbook:4 \
--output_dir=<OUTPUT_DIR> \
--server_args="--max-model-len 32768" \
++inference.temperature=1.0 \
++inference.top_p=1.0 \
++inference.tokens_to_generate=16384
```

#### Verifying Results

After all jobs are complete, check the results in `<OUTPUT_DIR>/eval-results/hotpotqa/metrics.json`.
The results table is printed to stdout and captured in the summarize-results srun log.

Example distractor results (Nemotron-3-Nano, `hotpotqa:4`):

```text
----------------------------------------------------------------------------- hotpotqa -----------------------------------------------------------------------------
evaluation_mode | num_entries | answer_em | answer_f1 | sp_em | sp_f1 | joint_em | joint_f1 | is_correct | is_correct_strict
pass@1[avg-of-4] | 7405 | 62.92 ± 0.25 | 78.15 ± 0.16 | 21.52 ± 0.12 | 60.75 ± 0.21 | 15.45 ± 0.14 | 49.52 ± 0.15 | 73.35 ± 0.22 | 71.68 ± 0.26
pass@4 | 7405 | 70.28 | 83.86 | 35.29 | 74.41 | 25.75 | 62.69 | 79.23 | 77.92
filtered_pass@1[avg-of-4] | 6057 | 67.71 | 79.30 | 22.09 | 60.95 | 17.01 | 50.56 | 78.79 | 77.12
filtered_pass@4 | 6057 | 74.95 | 85.10 | 35.86 | 74.55 | 27.92 | 63.88 | 84.27 | 83.11
```

Example closed-book results (Nemotron-3-Nano, `hotpotqa_closedbook:4`):

```text
----------------------------------------- hotpotqa_closedbook ------------------------------------------
evaluation_mode | num_entries | answer_em | answer_f1 | is_correct | is_correct_strict
pass@1[avg-of-4] | 7405 | 29.05 ± 0.15 | 39.35 ± 0.18 | 33.14 ± 0.32 | 32.36 ± 0.28
pass@4 | 7405 | 37.91 | 50.40 | 42.50 | 41.30
filtered_pass@1[avg-of-4] | 6057 | 31.85 | 39.57 | 36.48 | 35.60
filtered_pass@4 | 6057 | 41.59 | 51.01 | 46.77 | 45.44
```

The closed-book variant reports answer-level metrics only (no supporting-fact or joint metrics).

### AA-Omniscience

This is a benchmark developed by AA to measure hallucinations in LLMs and penalize confidently-false answers.
Expand Down
17 changes: 17 additions & 0 deletions nemo_skills/dataset/hotpotqa/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

METRICS_TYPE = "hotpotqa"
GENERATION_ARGS = "++prompt_config=eval/hotpotqa"
EVAL_SPLIT = "validation"
24 changes: 24 additions & 0 deletions nemo_skills/dataset/hotpotqa/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Prepare HotpotQA distractor validation set. Single source of truth for this data."""

from pathlib import Path

from nemo_skills.dataset.hotpotqa.prepare_utils import prepare_validation

if __name__ == "__main__":
data_dir = Path(__file__).absolute().parent
data_dir.mkdir(exist_ok=True)
prepare_validation(data_dir / "validation.jsonl")
82 changes: 82 additions & 0 deletions nemo_skills/dataset/hotpotqa/prepare_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use it except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Shared HotpotQA data formatting and preparation.

Used by both hotpotqa and hotpotqa_closedbook so there is a single source of truth
for downloading and formatting the distractor validation set.
"""

import json
from pathlib import Path

from datasets import load_dataset
from tqdm import tqdm


def format_context(context: dict) -> str:
"""Format context paragraphs with titles and indexed sentences.

Each paragraph becomes:
Title: <title>
[0] <sentence 0>
[1] <sentence 1>
...

Paragraphs are separated by blank lines.
"""
paragraphs = []
for title, sentences in zip(context["title"], context["sentences"], strict=True):
lines = [f"Title: {title}"]
for idx, sent in enumerate(sentences):
lines.append(f"[{idx}] {sent.strip()}")
paragraphs.append("\n".join(lines))
return "\n\n".join(paragraphs)


def format_entry(entry: dict) -> dict:
"""Format a HotpotQA entry to match NeMo-Skills format."""
supporting_facts = list(zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"], strict=True))

return {
"id": entry["id"],
"question": entry["question"],
"expected_answer": entry["answer"],
"context": format_context(entry["context"]),
"supporting_facts": supporting_facts,
"type": entry["type"],
"level": entry["level"],
}


def prepare_validation(output_path: Path) -> int:
"""Download HotpotQA distractor validation set and write NeMo-Skills format to output_path.

Returns the number of examples written.
"""
output_path = Path(output_path)
output_path.parent.mkdir(parents=True, exist_ok=True)

ds = load_dataset("hotpotqa/hotpot_qa", "distractor", split="validation")

formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_path.name}")]
tmp_output_path = output_path.with_suffix(".jsonl.tmp")
with open(tmp_output_path, "wt", encoding="utf-8") as fout:
for formatted in formatted_entries:
json.dump(formatted, fout)
fout.write("\n")
tmp_output_path.replace(output_path)

print(f"Wrote {len(ds)} examples to {output_path}")
return len(ds)
21 changes: 21 additions & 0 deletions nemo_skills/dataset/hotpotqa_closedbook/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Closed-book variant of HotpotQA: same questions, no context provided.
# Reuses the hotpotqa validation data (symlinked) with a different prompt
# and answer-only metrics.

METRICS_TYPE = "hotpotqa_closedbook"
GENERATION_ARGS = "++prompt_config=eval/hotpotqa_closedbook"
EVAL_SPLIT = "validation"
42 changes: 42 additions & 0 deletions nemo_skills/dataset/hotpotqa_closedbook/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use it except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Closed-book variant uses the same validation data as hotpotqa (distractor setting).
# We reuse that file so there is only one real data preparation (in hotpotqa).

import shutil
import sys
from pathlib import Path

# Reuse the shared preparation so we don't require hotpotqa to be prepared first.
from nemo_skills.dataset.hotpotqa.prepare_utils import prepare_validation

if __name__ == "__main__":
data_dir = Path(__file__).absolute().parent
data_dir.mkdir(exist_ok=True)
output_file = data_dir / "validation.jsonl"

hotpotqa_source = data_dir.parent / "hotpotqa" / "validation.jsonl"

if hotpotqa_source.exists():
shutil.copy2(hotpotqa_source, output_file)
print(f"Copied {hotpotqa_source} -> {output_file}")
else:
# Same data; run shared preparation for hotpotqa then copy here.
prepare_validation(hotpotqa_source)
if not hotpotqa_source.exists():
print("Preparation did not create the expected file.", file=sys.stderr)
sys.exit(1)
shutil.copy2(hotpotqa_source, output_file)
print(f"Copied {hotpotqa_source} -> {output_file}")
3 changes: 1 addition & 2 deletions nemo_skills/evaluation/metrics/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -447,8 +447,7 @@ def as_int(metric_key: str, metric_value: float, all_metrics: dict):


def as_float(metric_key: str, metric_value: float, all_metrics: dict):
if (metric_std := all_metrics.get(f"{metric_key}_statistics", {}).get("std_dev_across_runs")) is not None:
return f"{float(metric_value):.2f} ± {metric_std:.2f}"
"""Format float for display (for real floats that are not scaled as percentages)."""
return f"{float(metric_value):.2f}"


Expand Down
Loading
Loading