NVIDIA-NeMo · Kipok · Mar 12, 2026 · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026
diff --git a/docs/evaluation/other-benchmarks.md b/docs/evaluation/other-benchmarks.md
@@ -127,6 +127,95 @@ After all jobs are complete, you can check the results in `<OUTPUT_DIR>/eval-res
 }
 ```
 
+### HotpotQA
+
+[HotpotQA](https://hotpotqa.github.io/) is a multi-hop question-answering benchmark that requires reasoning over multiple Wikipedia paragraphs. Two variants are supported:
+
+| Variant | Slug | Description |
+|:---|:---|:---|
+| **Distractor** | `hotpotqa` | Model receives the question plus 10 context paragraphs (2 gold + 8 distractors) and must return the answer **and** identify supporting-fact sentences. |
+| **Closed-book** | `hotpotqa_closedbook` | Same questions, no context provided — tests the model's parametric knowledge. |
+
+- Benchmark definitions: [`nemo_skills/dataset/hotpotqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hotpotqa/__init__.py) and [`nemo_skills/dataset/hotpotqa_closedbook/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hotpotqa_closedbook/__init__.py)
+- Original benchmark source is the [HotpotQA repository](https://github.com/hotpotqa/hotpot).
+- Uses 7,405 distractor-setting validation examples. Both variants share the same data; preparation is unified in [`nemo_skills/dataset/hotpotqa/prepare_utils.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hotpotqa/prepare_utils.py). The closed-book variant copies the prepared file from the distractor dataset (no separate download).
+- Metrics follow the [official evaluation script](https://github.com/hotpotqa/hotpot/blob/master/hotpot_evaluate_v1.py): Answer EM/F1, Supporting-facts EM/F1, Joint EM/F1, plus alternative-aware substring matching.
+- Both unfiltered and filtered (excluding unreliable questions) metrics are reported automatically.
+
+#### Data Preparation
+
+Prepare the distractor validation set (single source of truth), then the closed-book variant (copies from it):
+
+```bash
+ns prepare_data hotpotqa
+ns prepare_data hotpotqa_closedbook
+```
+
+You can also run `ns prepare_data hotpotqa_closedbook` alone; it will run the shared preparation for `hotpotqa` first if that data is not yet present, then copy it.
+
+#### Running the Evaluation
+
+Distractor evaluation (with context and supporting-fact scoring). Use `hotpotqa:4` for 4 seeds (produces the example results below):
+
+```bash
+ns eval \
+    --cluster=<CLUSTER_NAME> \
+    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+    --server_type=vllm \
+    --server_gpus=8 \
+    --benchmarks=hotpotqa:4 \
+    --output_dir=<OUTPUT_DIR> \
+    --server_args="--max-model-len 32768" \
+    ++inference.temperature=1.0 \
+    ++inference.top_p=1.0 \
+    ++inference.tokens_to_generate=16384
+```
+
+Closed-book evaluation (no context). Use `hotpotqa_closedbook:4` for 4 seeds (produces the example results below):
+
+```bash
+ns eval \
+    --cluster=<CLUSTER_NAME> \
+    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+    --server_type=vllm \
+    --server_gpus=8 \
+    --benchmarks=hotpotqa_closedbook:4 \
+    --output_dir=<OUTPUT_DIR> \
+    --server_args="--max-model-len 32768" \
+    ++inference.temperature=1.0 \
+    ++inference.top_p=1.0 \
+    ++inference.tokens_to_generate=16384
+```
+
+#### Verifying Results
+
+After all jobs are complete, check the results in `<OUTPUT_DIR>/eval-results/hotpotqa/metrics.json`.
+The results table is printed to stdout and captured in the summarize-results srun log.
+
+Example distractor results (Nemotron-3-Nano, `hotpotqa:4`):
+
+```text
+----------------------------------------------------------------------------- hotpotqa -----------------------------------------------------------------------------
+evaluation_mode           | num_entries | answer_em    | answer_f1    | sp_em        | sp_f1        | joint_em     | joint_f1     | is_correct   | is_correct_strict
+pass@1[avg-of-4]          | 7405        | 62.92 ± 0.25 | 78.15 ± 0.16 | 21.52 ± 0.12 | 60.75 ± 0.21 | 15.45 ± 0.14 | 49.52 ± 0.15 | 73.35 ± 0.22 | 71.68 ± 0.26
+pass@4                    | 7405        | 70.28        | 83.86        | 35.29        | 74.41        | 25.75        | 62.69        | 79.23        | 77.92
+filtered_pass@1[avg-of-4] | 6057        | 67.71        | 79.30        | 22.09        | 60.95        | 17.01        | 50.56        | 78.79        | 77.12
+filtered_pass@4           | 6057        | 74.95        | 85.10        | 35.86        | 74.55        | 27.92        | 63.88        | 84.27        | 83.11
+```
+
+Example closed-book results (Nemotron-3-Nano, `hotpotqa_closedbook:4`):
+
+```text
+----------------------------------------- hotpotqa_closedbook ------------------------------------------
+evaluation_mode           | num_entries | answer_em    | answer_f1    | is_correct   | is_correct_strict
+pass@1[avg-of-4]          | 7405        | 29.05 ± 0.15 | 39.35 ± 0.18 | 33.14 ± 0.32 | 32.36 ± 0.28
+pass@4                    | 7405        | 37.91        | 50.40        | 42.50        | 41.30
+filtered_pass@1[avg-of-4] | 6057        | 31.85        | 39.57        | 36.48        | 35.60
+filtered_pass@4           | 6057        | 41.59        | 51.01        | 46.77        | 45.44
+```
+
+The closed-book variant reports answer-level metrics only (no supporting-fact or joint metrics).
+
 ### AA-Omniscience
 
 This is a benchmark developed by AA to measure hallucinations in LLMs and penalize confidently-false answers.

diff --git a/nemo_skills/dataset/hotpotqa/__init__.py b/nemo_skills/dataset/hotpotqa/__init__.py
@@ -0,0 +1,17 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+METRICS_TYPE = "hotpotqa"
+GENERATION_ARGS = "++prompt_config=eval/hotpotqa"
+EVAL_SPLIT = "validation"
diff --git a/nemo_skills/dataset/hotpotqa/prepare.py b/nemo_skills/dataset/hotpotqa/prepare.py
@@ -0,0 +1,24 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Prepare HotpotQA distractor validation set. Single source of truth for this data."""
+
+from pathlib import Path
+
+from nemo_skills.dataset.hotpotqa.prepare_utils import prepare_validation
+
+if __name__ == "__main__":
+    data_dir = Path(__file__).absolute().parent
+    data_dir.mkdir(exist_ok=True)
+    prepare_validation(data_dir / "validation.jsonl")
diff --git a/nemo_skills/dataset/hotpotqa/prepare_utils.py b/nemo_skills/dataset/hotpotqa/prepare_utils.py
@@ -0,0 +1,82 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use it except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Shared HotpotQA data formatting and preparation.
+
+Used by both hotpotqa and hotpotqa_closedbook so there is a single source of truth
+for downloading and formatting the distractor validation set.
+"""
+
+import json
+from pathlib import Path
+
+from datasets import load_dataset
+from tqdm import tqdm
+
+
+def format_context(context: dict) -> str:
+    """Format context paragraphs with titles and indexed sentences.
+
+    Each paragraph becomes:
+        Title: <title>
+        [0] <sentence 0>
+        [1] <sentence 1>
+        ...
+
+    Paragraphs are separated by blank lines.
+    """
+    paragraphs = []
+    for title, sentences in zip(context["title"], context["sentences"], strict=True):
+        lines = [f"Title: {title}"]
+        for idx, sent in enumerate(sentences):
+            lines.append(f"[{idx}] {sent.strip()}")
+        paragraphs.append("\n".join(lines))
+    return "\n\n".join(paragraphs)
+
+
+def format_entry(entry: dict) -> dict:
+    """Format a HotpotQA entry to match NeMo-Skills format."""
+    supporting_facts = list(zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"], strict=True))
+
+    return {
+        "id": entry["id"],
+        "question": entry["question"],
+        "expected_answer": entry["answer"],
+        "context": format_context(entry["context"]),
+        "supporting_facts": supporting_facts,
+        "type": entry["type"],
+        "level": entry["level"],
+    }
+
+
+def prepare_validation(output_path: Path) -> int:
+    """Download HotpotQA distractor validation set and write NeMo-Skills format to output_path.
+
+    Returns the number of examples written.
+    """
+    output_path = Path(output_path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+
+    ds = load_dataset("hotpotqa/hotpot_qa", "distractor", split="validation")
+
+    formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_path.name}")]
+    tmp_output_path = output_path.with_suffix(".jsonl.tmp")
+    with open(tmp_output_path, "wt", encoding="utf-8") as fout:
+        for formatted in formatted_entries:
+            json.dump(formatted, fout)
+            fout.write("\n")
+    tmp_output_path.replace(output_path)
+
+    print(f"Wrote {len(ds)} examples to {output_path}")
+    return len(ds)
diff --git a/nemo_skills/dataset/hotpotqa_closedbook/__init__.py b/nemo_skills/dataset/hotpotqa_closedbook/__init__.py
@@ -0,0 +1,21 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Closed-book variant of HotpotQA: same questions, no context provided.
+# Reuses the hotpotqa validation data (symlinked) with a different prompt
+# and answer-only metrics.
+
+METRICS_TYPE = "hotpotqa_closedbook"
+GENERATION_ARGS = "++prompt_config=eval/hotpotqa_closedbook"
+EVAL_SPLIT = "validation"
diff --git a/nemo_skills/dataset/hotpotqa_closedbook/prepare.py b/nemo_skills/dataset/hotpotqa_closedbook/prepare.py
@@ -0,0 +1,42 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use it except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Closed-book variant uses the same validation data as hotpotqa (distractor setting).
+# We reuse that file so there is only one real data preparation (in hotpotqa).
+
+import shutil
+import sys
+from pathlib import Path
+
+# Reuse the shared preparation so we don't require hotpotqa to be prepared first.
+from nemo_skills.dataset.hotpotqa.prepare_utils import prepare_validation
+
+if __name__ == "__main__":
+    data_dir = Path(__file__).absolute().parent
+    data_dir.mkdir(exist_ok=True)
+    output_file = data_dir / "validation.jsonl"
+
+    hotpotqa_source = data_dir.parent / "hotpotqa" / "validation.jsonl"
+
+    if hotpotqa_source.exists():
+        shutil.copy2(hotpotqa_source, output_file)
+        print(f"Copied {hotpotqa_source} -> {output_file}")
+    else:
+        # Same data; run shared preparation for hotpotqa then copy here.
+        prepare_validation(hotpotqa_source)
+        if not hotpotqa_source.exists():
+            print("Preparation did not create the expected file.", file=sys.stderr)
+            sys.exit(1)
+        shutil.copy2(hotpotqa_source, output_file)
+        print(f"Copied {hotpotqa_source} -> {output_file}")
diff --git a/nemo_skills/evaluation/metrics/base.py b/nemo_skills/evaluation/metrics/base.py
@@ -447,8 +447,7 @@ def as_int(metric_key: str, metric_value: float, all_metrics: dict):
 
 
 def as_float(metric_key: str, metric_value: float, all_metrics: dict):
-    if (metric_std := all_metrics.get(f"{metric_key}_statistics", {}).get("std_dev_across_runs")) is not None:
-        return f"{float(metric_value):.2f} ± {metric_std:.2f}"
+    """Format float for display (for real floats that are not scaled as percentages)."""
     return f"{float(metric_value):.2f}"