diff --git a/docs/evaluation/other-benchmarks.md b/docs/evaluation/other-benchmarks.md
index 33ecc199d6..b1b9434d6c 100644
--- a/docs/evaluation/other-benchmarks.md
+++ b/docs/evaluation/other-benchmarks.md
@@ -127,6 +127,95 @@ After all jobs are complete, you can check the results in `<OUTPUT_DIR>/eval-res
 }
 ```
 
+### HotpotQA
+
+[HotpotQA](https://hotpotqa.github.io/) is a multi-hop question-answering benchmark that requires reasoning over multiple Wikipedia paragraphs. Two variants are supported:
+
+| Variant | Slug | Description |
+|:---|:---|:---|
+| **Distractor** | `hotpotqa` | Model receives the question plus 10 context paragraphs (2 gold + 8 distractors) and must return the answer **and** identify supporting-fact sentences. |
+| **Closed-book** | `hotpotqa_closedbook` | Same questions, no context provided — tests the model's parametric knowledge. |
+
+- Benchmark definitions: [`nemo_skills/dataset/hotpotqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hotpotqa/__init__.py) and [`nemo_skills/dataset/hotpotqa_closedbook/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hotpotqa_closedbook/__init__.py)
+- Original benchmark source is the [HotpotQA repository](https://github.com/hotpotqa/hotpot).
+- Uses 7,405 distractor-setting validation examples. Both variants share the same data; preparation is unified in [`nemo_skills/dataset/hotpotqa/prepare_utils.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hotpotqa/prepare_utils.py). The closed-book variant copies the prepared file from the distractor dataset (no separate download).
+- Metrics follow the [official evaluation script](https://github.com/hotpotqa/hotpot/blob/master/hotpot_evaluate_v1.py): Answer EM/F1, Supporting-facts EM/F1, Joint EM/F1, plus alternative-aware substring matching.
+- Both unfiltered and filtered (excluding unreliable questions) metrics are reported automatically.
+
+#### Data Preparation
+
+Prepare the distractor validation set (single source of truth), then the closed-book variant (copies from it):
+
+```bash
+ns prepare_data hotpotqa
+ns prepare_data hotpotqa_closedbook
+```
+
+You can also run `ns prepare_data hotpotqa_closedbook` alone; it will run the shared preparation for `hotpotqa` first if that data is not yet present, then copy it.
+
+#### Running the Evaluation
+
+Distractor evaluation (with context and supporting-fact scoring). Use `hotpotqa:4` for 4 seeds (produces the example results below):
+
+```bash
+ns eval \
+    --cluster=<CLUSTER_NAME> \
+    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+    --server_type=vllm \
+    --server_gpus=8 \
+    --benchmarks=hotpotqa:4 \
+    --output_dir=<OUTPUT_DIR> \
+    --server_args="--max-model-len 32768" \
+    ++inference.temperature=1.0 \
+    ++inference.top_p=1.0 \
+    ++inference.tokens_to_generate=16384
+```
+
+Closed-book evaluation (no context). Use `hotpotqa_closedbook:4` for 4 seeds (produces the example results below):
+
+```bash
+ns eval \
+    --cluster=<CLUSTER_NAME> \
+    --model=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
+    --server_type=vllm \
+    --server_gpus=8 \
+    --benchmarks=hotpotqa_closedbook:4 \
+    --output_dir=<OUTPUT_DIR> \
+    --server_args="--max-model-len 32768" \
+    ++inference.temperature=1.0 \
+    ++inference.top_p=1.0 \
+    ++inference.tokens_to_generate=16384
+```
+
+#### Verifying Results
+
+After all jobs are complete, check the results in `<OUTPUT_DIR>/eval-results/hotpotqa/metrics.json`.
+The results table is printed to stdout and captured in the summarize-results srun log.
+
+Example distractor results (Nemotron-3-Nano, `hotpotqa:4`):
+
+```text
+----------------------------------------------------------------------------- hotpotqa -----------------------------------------------------------------------------
+evaluation_mode           | num_entries | answer_em    | answer_f1    | sp_em        | sp_f1        | joint_em     | joint_f1     | is_correct   | is_correct_strict
+pass@1[avg-of-4]          | 7405        | 62.92 ± 0.25 | 78.15 ± 0.16 | 21.52 ± 0.12 | 60.75 ± 0.21 | 15.45 ± 0.14 | 49.52 ± 0.15 | 73.35 ± 0.22 | 71.68 ± 0.26
+pass@4                    | 7405        | 70.28        | 83.86        | 35.29        | 74.41        | 25.75        | 62.69        | 79.23        | 77.92
+filtered_pass@1[avg-of-4] | 6057        | 67.71        | 79.30        | 22.09        | 60.95        | 17.01        | 50.56        | 78.79        | 77.12
+filtered_pass@4           | 6057        | 74.95        | 85.10        | 35.86        | 74.55        | 27.92        | 63.88        | 84.27        | 83.11
+```
+
+Example closed-book results (Nemotron-3-Nano, `hotpotqa_closedbook:4`):
+
+```text
+----------------------------------------- hotpotqa_closedbook ------------------------------------------
+evaluation_mode           | num_entries | answer_em    | answer_f1    | is_correct   | is_correct_strict
+pass@1[avg-of-4]          | 7405        | 29.05 ± 0.15 | 39.35 ± 0.18 | 33.14 ± 0.32 | 32.36 ± 0.28
+pass@4                    | 7405        | 37.91        | 50.40        | 42.50        | 41.30
+filtered_pass@1[avg-of-4] | 6057        | 31.85        | 39.57        | 36.48        | 35.60
+filtered_pass@4           | 6057        | 41.59        | 51.01        | 46.77        | 45.44
+```
+
+The closed-book variant reports answer-level metrics only (no supporting-fact or joint metrics).
+
 ### AA-Omniscience
 
 This is a benchmark developed by AA to measure hallucinations in LLMs and penalize confidently-false answers.
diff --git a/nemo_skills/dataset/hotpotqa/__init__.py b/nemo_skills/dataset/hotpotqa/__init__.py
new file mode 100644
index 0000000000..84aadc0a65
--- /dev/null
+++ b/nemo_skills/dataset/hotpotqa/__init__.py
@@ -0,0 +1,17 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+METRICS_TYPE = "hotpotqa"
+GENERATION_ARGS = "++prompt_config=eval/hotpotqa"
+EVAL_SPLIT = "validation"
diff --git a/nemo_skills/dataset/hotpotqa/prepare.py b/nemo_skills/dataset/hotpotqa/prepare.py
new file mode 100644
index 0000000000..01a991e58e
--- /dev/null
+++ b/nemo_skills/dataset/hotpotqa/prepare.py
@@ -0,0 +1,24 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Prepare HotpotQA distractor validation set. Single source of truth for this data."""
+
+from pathlib import Path
+
+from nemo_skills.dataset.hotpotqa.prepare_utils import prepare_validation
+
+if __name__ == "__main__":
+    data_dir = Path(__file__).absolute().parent
+    data_dir.mkdir(exist_ok=True)
+    prepare_validation(data_dir / "validation.jsonl")
diff --git a/nemo_skills/dataset/hotpotqa/prepare_utils.py b/nemo_skills/dataset/hotpotqa/prepare_utils.py
new file mode 100644
index 0000000000..f2474a2ed7
--- /dev/null
+++ b/nemo_skills/dataset/hotpotqa/prepare_utils.py
@@ -0,0 +1,82 @@
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use it except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Shared HotpotQA data formatting and preparation.
+
+Used by both hotpotqa and hotpotqa_closedbook so there is a single source of truth
+for downloading and formatting the distractor validation set.
+"""
+
+import json
+from pathlib import Path
+
+from datasets import load_dataset
+from tqdm import tqdm
+
+
+def format_context(context: dict) -> str:
+    """Format context paragraphs with titles and indexed sentences.
+
+    Each paragraph becomes:
+        Title: <title>
+        [0] <sentence 0>
+        [1] <sentence 1>
+        ...
+
+    Paragraphs are separated by blank lines.
+    """
+    paragraphs = []
+    for title, sentences in zip(context["title"], context["sentences"], strict=True):
+        lines = [f"Title: {title}"]
+        for idx, sent in enumerate(sentences):
+            lines.append(f"[{idx}] {sent.strip()}")
+        paragraphs.append("\n".join(lines))
+    return "\n\n".join(paragraphs)
+
+
+def format_entry(entry: dict) -> dict:
+    """Format a HotpotQA entry to match NeMo-Skills format."""
+    supporting_facts = list(zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"], strict=True))
+
+    return {
+        "id": entry["id"],
+        "question": entry["question"],
+        "expected_answer": entry["answer"],
+        "context": format_context(entry["context"]),
+        "supporting_facts": supporting_facts,
+        "type": entry["type"],
+        "level": entry["level"],
+    }
+
+
+def prepare_validation(output_path: Path) -> int:
+    """Download HotpotQA distractor validation set and write NeMo-Skills format to output_path.
+
+    Returns the number of examples written.
+    """
+    output_path = Path(output_path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+
+    ds = load_dataset("hotpotqa/hotpot_qa", "distractor", split="validation")
+
+    formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_path.name}")]
+    tmp_output_path = output_path.with_suffix(".jsonl.tmp")
+    with open(tmp_output_path, "wt", encoding="utf-8") as fout:
+        for formatted in formatted_entries:
+            json.dump(formatted, fout)
+            fout.write("\n")
+    tmp_output_path.replace(output_path)
+
+    print(f"Wrote {len(ds)} examples to {output_path}")
+    return len(ds)
diff --git a/nemo_skills/dataset/hotpotqa_closedbook/__init__.py b/nemo_skills/dataset/hotpotqa_closedbook/__init__.py
new file mode 100644
index 0000000000..c35667173e
--- /dev/null
+++ b/nemo_skills/dataset/hotpotqa_closedbook/__init__.py
@@ -0,0 +1,21 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Closed-book variant of HotpotQA: same questions, no context provided.
+# Reuses the hotpotqa validation data (symlinked) with a different prompt
+# and answer-only metrics.
+
+METRICS_TYPE = "hotpotqa_closedbook"
+GENERATION_ARGS = "++prompt_config=eval/hotpotqa_closedbook"
+EVAL_SPLIT = "validation"
diff --git a/nemo_skills/dataset/hotpotqa_closedbook/prepare.py b/nemo_skills/dataset/hotpotqa_closedbook/prepare.py
new file mode 100644
index 0000000000..933a3d79b3
--- /dev/null
+++ b/nemo_skills/dataset/hotpotqa_closedbook/prepare.py
@@ -0,0 +1,42 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use it except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Closed-book variant uses the same validation data as hotpotqa (distractor setting).
+# We reuse that file so there is only one real data preparation (in hotpotqa).
+
+import shutil
+import sys
+from pathlib import Path
+
+# Reuse the shared preparation so we don't require hotpotqa to be prepared first.
+from nemo_skills.dataset.hotpotqa.prepare_utils import prepare_validation
+
+if __name__ == "__main__":
+    data_dir = Path(__file__).absolute().parent
+    data_dir.mkdir(exist_ok=True)
+    output_file = data_dir / "validation.jsonl"
+
+    hotpotqa_source = data_dir.parent / "hotpotqa" / "validation.jsonl"
+
+    if hotpotqa_source.exists():
+        shutil.copy2(hotpotqa_source, output_file)
+        print(f"Copied {hotpotqa_source} -> {output_file}")
+    else:
+        # Same data; run shared preparation for hotpotqa then copy here.
+        prepare_validation(hotpotqa_source)
+        if not hotpotqa_source.exists():
+            print("Preparation did not create the expected file.", file=sys.stderr)
+            sys.exit(1)
+        shutil.copy2(hotpotqa_source, output_file)
+        print(f"Copied {hotpotqa_source} -> {output_file}")
diff --git a/nemo_skills/evaluation/metrics/base.py b/nemo_skills/evaluation/metrics/base.py
index 6422749def..fe6c0ee881 100644
--- a/nemo_skills/evaluation/metrics/base.py
+++ b/nemo_skills/evaluation/metrics/base.py
@@ -447,8 +447,7 @@ def as_int(metric_key: str, metric_value: float, all_metrics: dict):
 
 
 def as_float(metric_key: str, metric_value: float, all_metrics: dict):
-    if (metric_std := all_metrics.get(f"{metric_key}_statistics", {}).get("std_dev_across_runs")) is not None:
-        return f"{float(metric_value):.2f} ± {metric_std:.2f}"
+    """Format float for display (for real floats that are not scaled as percentages)."""
     return f"{float(metric_value):.2f}"
 
 
diff --git a/nemo_skills/evaluation/metrics/hotpotqa_filtering.py b/nemo_skills/evaluation/metrics/hotpotqa_filtering.py
new file mode 100644
index 0000000000..c03483f5b8
--- /dev/null
+++ b/nemo_skills/evaluation/metrics/hotpotqa_filtering.py
@@ -0,0 +1,287 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# HotpotQA answer normalization and filtering.
+# Adapted from hmaron/nvidia-research-tlv-nemotron-hallucination-detection,
+# experiments/hotpotqa_filtering/eval_package/hotpotqa_eval.py
+#
+# Generates alternative surface forms of ground-truth answers and flags
+# questions that are unreliable for substring-based evaluation.
+
+import re
+
+__all__ = [
+    "is_correct",
+    "is_correct_strict",
+    "normalize_gt",
+]
+
+_NUM_WORDS = {
+    "one": "1",
+    "two": "2",
+    "three": "3",
+    "four": "4",
+    "five": "5",
+    "six": "6",
+    "seven": "7",
+    "eight": "8",
+    "nine": "9",
+    "ten": "10",
+    "eleven": "11",
+    "twelve": "12",
+    "thirteen": "13",
+    "fourteen": "14",
+    "fifteen": "15",
+    "sixteen": "16",
+    "seventeen": "17",
+    "eighteen": "18",
+    "nineteen": "19",
+    "twenty": "20",
+    "first": "1st",
+    "second": "2nd",
+    "third": "3rd",
+    "fourth": "4th",
+    "fifth": "5th",
+    "nineteenth": "19th",
+    "twentieth": "20th",
+    "twenty-first": "21st",
+}
+_NUM_DIGITS = {v: k for k, v in _NUM_WORDS.items()}
+
+_MAX_GT_LENGTH = 40
+_MIN_ALT_LENGTH = 3
+
+_STOPWORDS = frozenset(
+    [
+        "the",
+        "a",
+        "an",
+        "of",
+        "in",
+        "on",
+        "at",
+        "for",
+        "and",
+        "or",
+        "to",
+        "by",
+        "is",
+        "was",
+        "are",
+        "were",
+        "be",
+        "been",
+        "with",
+        "from",
+        "that",
+        "this",
+        "it",
+        "its",
+        "his",
+        "her",
+        "my",
+        "our",
+        "their",
+        "no",
+        "not",
+        "but",
+        "if",
+        "as",
+        "into",
+        "about",
+        "than",
+        "then",
+    ]
+)
+
+
+def _normalize_unicode(s: str) -> str:
+    """Normalize unicode whitespace, hyphens, and quotes for matching."""
+    for c in "\u202f\u00a0\u2009\u200a\u2002\u2003":
+        s = s.replace(c, " ")
+    for c in "\u2010\u2011\u2012\u2013\u2014\u2015":
+        s = s.replace(c, "-")
+    s = s.replace("\u2019", "'").replace("\u2018", "'")
+    s = s.replace("\u201c", '"').replace("\u201d", '"')
+    while "  " in s:
+        s = s.replace("  ", " ")
+    return s.strip()
+
+
+def _gt_alternatives(gt: str) -> tuple[list[str], list[str]]:
+    """Generate valid surface-form alternatives for a ground-truth answer.
+
+    Returns (sorted_alternatives, list_of_rule_tags_that_fired).
+    """
+    alts = {gt}
+    rules = []
+
+    for prefix in ("the ", "a ", "an "):
+        if gt.lower().startswith(prefix):
+            alts.add(gt[len(prefix) :])
+            rules.append("strip_article")
+            break
+
+    stripped = gt.replace('"', "").replace("\u201c", "").replace("\u201d", "").strip()
+    if stripped and stripped != gt:
+        alts.add(stripped)
+        rules.append("strip_quotes")
+
+    if "(" in gt:
+        no_parens = re.sub(r"\s*\([^)]*\)\s*", " ", gt).strip()
+        no_parens = re.sub(r"\s+", " ", no_parens)
+        if no_parens and no_parens != gt:
+            alts.add(no_parens)
+        for inner in re.findall(r"\(([^)]+)\)", gt):
+            inner = inner.strip()
+            if len(inner) > 1:
+                alts.add(inner)
+        rules.append("normalize_parens")
+
+    gt_low = gt.lower().strip()
+    if gt_low in _NUM_WORDS:
+        alts.add(_NUM_WORDS[gt_low])
+        rules.append("number_word_to_digit")
+    if gt_low in _NUM_DIGITS:
+        alts.add(_NUM_DIGITS[gt_low])
+        rules.append("number_digit_to_word")
+
+    no_commas = re.sub(r"(\d),(\d{3})", r"\1\2", gt)
+    while no_commas != gt and re.search(r"(\d),(\d{3})", no_commas):
+        no_commas = re.sub(r"(\d),(\d{3})", r"\1\2", no_commas)
+    if no_commas != gt:
+        alts.add(no_commas)
+        rules.append("strip_number_commas")
+
+    if gt and gt[-1] in ".,;:!?":
+        alts.add(gt[:-1].rstrip())
+        rules.append("strip_trailing_punct")
+
+    if "." in gt:
+        no_dots = re.sub(r"(?<!\d)\.(?!\d)", "", gt)
+        if no_dots and len(no_dots) > 1 and no_dots != gt:
+            alts.add(no_dots)
+            rules.append("strip_abbrev_dots")
+
+    if "-" in gt and not gt.startswith("-"):
+        no_hyphen = re.sub(r"\s+", " ", gt.replace("-", " ")).strip()
+        if no_hyphen != gt:
+            alts.add(no_hyphen)
+            rules.append("hyphen_to_space")
+
+    if " & " in gt:
+        alts.add(gt.replace(" & ", " and "))
+        rules.append("ampersand_to_and")
+    if " and " in gt.lower():
+        idx = gt.lower().index(" and ")
+        alts.add(gt[:idx] + " & " + gt[idx + 5 :])
+        rules.append("and_to_ampersand")
+
+    extra = set()
+    for alt in list(alts):
+        for prefix in ("the ", "a ", "an "):
+            if alt.lower().startswith(prefix):
+                extra.add(alt[len(prefix) :])
+    alts |= extra
+
+    normed = set()
+    for a in alts:
+        a = re.sub(r"\s+", " ", a.strip())
+        if a and (len(a) >= _MIN_ALT_LENGTH or a == gt.strip() or a.isdigit()):
+            normed.add(a)
+
+    return sorted(normed), rules
+
+
+def _is_multi_word_name(gt: str) -> bool:
+    """True if GT looks like a multi-word proper name unreliable for substring matching."""
+    parts = gt.strip().rstrip(".").split()
+    n = len(parts)
+    if n in (3, 4):
+        return all(p[0].isupper() for p in parts) and all(p.lower() not in _STOPWORDS for p in parts)
+    if n in (5, 6):
+        caps = [p for p in parts if p[0].isupper() and p.lower() not in _STOPWORDS]
+        return len(caps) >= 3
+    return False
+
+
+def _should_remove(gt: str) -> tuple[bool, str]:
+    """Return (flag, reason). Reason is '' if not removed."""
+    if len(gt) > _MAX_GT_LENGTH:
+        return True, "gt_too_long"
+    if _is_multi_word_name(gt):
+        return True, "multi_word_name"
+    return False, ""
+
+
+def normalize_gt(gt_answer: str) -> dict:
+    """Normalize a single ground-truth answer on-the-fly.
+
+    Returns dict with keys:
+        alternatives (list[str]): Valid surface forms (always includes original).
+        should_remove (bool): True if unreliable for substring eval.
+        remove_reason (str): '' | 'gt_too_long' | 'multi_word_name'.
+        edited (bool): True if any rule fired.
+        edit_reasons (list[str]): Tags of rules that fired.
+    """
+    alts, alt_rules = _gt_alternatives(gt_answer)
+    remove, remove_reason = _should_remove(gt_answer)
+    edit_reasons = list(alt_rules)
+    if remove_reason:
+        edit_reasons.append(remove_reason)
+    return {
+        "alternatives": alts,
+        "should_remove": remove,
+        "remove_reason": remove_reason,
+        "edited": bool(edit_reasons),
+        "edit_reasons": edit_reasons,
+    }
+
+
+def is_correct(alternatives: list[str], model_answer: str) -> bool:
+    """Check if any alternative is a substring of the model answer.
+
+    Args:
+        alternatives: List from normalize_gt()['alternatives'].
+        model_answer: The model's predicted answer string.
+    """
+    ans = _normalize_unicode(model_answer.lower())
+    return any(_normalize_unicode(alt.lower()) in ans for alt in alternatives)
+
+
+def is_correct_strict(alternatives: list[str], model_answer: str) -> bool:
+    """Stricter matching that reduces false positives.
+
+    Additional gates over is_correct():
+      - Short alternatives (<=4 chars): require word-boundary match
+      - Long model answers (>80 chars): reject if match starts after position 40
+    """
+    ans = _normalize_unicode(model_answer.lower())
+    ans_len = len(ans)
+
+    for alt in alternatives:
+        alt_norm = _normalize_unicode(alt.lower())
+        if not alt_norm:
+            continue
+        if alt_norm not in ans:
+            continue
+        if len(alt_norm) <= 4:
+            if not re.search(r"(?<!\w)" + re.escape(alt_norm) + r"(?!\w)", ans):
+                continue
+        if ans_len > 80:
+            pos = ans.find(alt_norm)
+            if pos > 40:
+                continue
+        return True
+    return False
diff --git a/nemo_skills/evaluation/metrics/hotpotqa_metrics.py b/nemo_skills/evaluation/metrics/hotpotqa_metrics.py
new file mode 100644
index 0000000000..5fedcee030
--- /dev/null
+++ b/nemo_skills/evaluation/metrics/hotpotqa_metrics.py
@@ -0,0 +1,320 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Metrics for HotpotQA multi-hop question answering.
+# Answer normalization and scoring logic faithfully adapted from the official
+# evaluation script: https://github.com/hotpotqa/hotpot/blob/master/hotpot_evaluate_v1.py
+#
+# On-the-fly filtering adapted from hmaron/nvidia-research-tlv-nemotron-hallucination-detection.
+# Reports both unfiltered (all questions) and filtered (excluding unreliable questions)
+# metrics in the output.
+
+import json
+import re
+import string
+from collections import Counter, defaultdict
+
+from nemo_skills.evaluation.metrics.base import BaseMetrics, as_percentage
+from nemo_skills.evaluation.metrics.hotpotqa_filtering import (
+    is_correct,
+    is_correct_strict,
+    normalize_gt,
+)
+
+
+def normalize_answer(s: str) -> str:
+    """Normalize answer string (official HotpotQA / SQuAD normalization)."""
+
+    def remove_articles(text):
+        return re.sub(r"\b(a|an|the)\b", " ", text)
+
+    def white_space_fix(text):
+        return " ".join(text.split())
+
+    def remove_punc(text):
+        exclude = set(string.punctuation)
+        return "".join(ch for ch in text if ch not in exclude)
+
+    def lower(text):
+        return text.lower()
+
+    return white_space_fix(remove_articles(remove_punc(lower(s))))
+
+
+def answer_f1_score(prediction: str, ground_truth: str) -> tuple[float, float, float]:
+    """Compute token-overlap F1, precision, and recall between prediction and ground truth.
+
+    Returns (f1, precision, recall). Special-cases yes/no/noanswer tokens.
+    """
+    normalized_prediction = normalize_answer(prediction)
+    normalized_ground_truth = normalize_answer(ground_truth)
+
+    ZERO_METRIC = (0.0, 0.0, 0.0)
+
+    if normalized_prediction in ["yes", "no", "noanswer"] and normalized_prediction != normalized_ground_truth:
+        return ZERO_METRIC
+    if normalized_ground_truth in ["yes", "no", "noanswer"] and normalized_prediction != normalized_ground_truth:
+        return ZERO_METRIC
+
+    prediction_tokens = normalized_prediction.split()
+    ground_truth_tokens = normalized_ground_truth.split()
+    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
+    num_same = sum(common.values())
+    if num_same == 0:
+        return ZERO_METRIC
+    precision = 1.0 * num_same / len(prediction_tokens)
+    recall = 1.0 * num_same / len(ground_truth_tokens)
+    f1 = (2 * precision * recall) / (precision + recall)
+    return f1, precision, recall
+
+
+def answer_exact_match(prediction: str, ground_truth: str) -> float:
+    """Return 1.0 if normalized prediction matches normalized ground truth, else 0.0."""
+    return float(normalize_answer(prediction) == normalize_answer(ground_truth))
+
+
+def sp_scores(prediction: list, gold: list) -> tuple[float, float, float, float]:
+    """Compute supporting facts EM, F1, precision, recall.
+
+    Both prediction and gold are lists of [title, sent_id] pairs.
+    Returns (em, f1, precision, recall).
+    """
+    cur_sp_pred = set(map(tuple, prediction))
+    gold_sp_set = set(map(tuple, gold))
+
+    tp, fp, fn = 0, 0, 0
+    for e in cur_sp_pred:
+        if e in gold_sp_set:
+            tp += 1
+        else:
+            fp += 1
+    for e in gold_sp_set:
+        if e not in cur_sp_pred:
+            fn += 1
+
+    precision = 1.0 * tp / (tp + fp) if tp + fp > 0 else 0.0
+    recall = 1.0 * tp / (tp + fn) if tp + fn > 0 else 0.0
+    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0.0
+    em = 1.0 if fp + fn == 0 else 0.0
+    return em, f1, precision, recall
+
+
+def _try_parse_answer_json(text: str) -> tuple[str, list] | None:
+    """Try to parse a JSON string as a HotpotQA answer object. Returns (answer, sp) or None."""
+    try:
+        parsed = json.loads(text)
+        if not isinstance(parsed, dict) or "answer" not in parsed:
+            return None
+        answer = str(parsed["answer"])
+        sp = parsed.get("supporting_facts", [])
+        if isinstance(sp, list):
+            valid_sp = []
+            for item in sp:
+                if isinstance(item, (list, tuple)) and len(item) == 2:
+                    try:
+                        valid_sp.append([str(item[0]), int(item[1])])
+                    except (ValueError, TypeError):
+                        continue
+            return answer, valid_sp
+        return answer, []
+    except (json.JSONDecodeError, ValueError, TypeError):
+        return None
+
+
+def _extract_json_candidates(text: str) -> list[str]:
+    """Extract all brace-delimited JSON candidate strings from text, ordered by position."""
+    candidates = []
+    i = 0
+    while i < len(text):
+        if text[i] == "{":
+            depth = 0
+            for j in range(i, len(text)):
+                if text[j] == "{":
+                    depth += 1
+                elif text[j] == "}":
+                    depth -= 1
+                if depth == 0:
+                    candidates.append(text[i : j + 1])
+                    i = j + 1
+                    break
+            else:
+                break
+        else:
+            i += 1
+    return candidates
+
+
+def parse_generation(generation: str) -> tuple[str, list]:
+    """Parse the model generation to extract the predicted answer and supporting facts.
+
+    Searches for JSON objects containing an "answer" key. When reasoning precedes
+    the JSON output, the *last* valid JSON object is used (the model is prompted
+    to end its response with the JSON).
+
+    Returns (answer_string, supporting_facts_list).
+    """
+    if not generation:
+        return "", []
+
+    text = generation.strip()
+
+    md_matches = list(re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL))
+    for md_match in reversed(md_matches):
+        result = _try_parse_answer_json(md_match.group(1))
+        if result is not None:
+            return result
+
+    candidates = _extract_json_candidates(text)
+    for candidate in reversed(candidates):
+        result = _try_parse_answer_json(candidate)
+        if result is not None:
+            return result
+
+    return text, []
+
+
+class HotpotQAMetrics(BaseMetrics):
+    """Metrics for HotpotQA multi-hop question answering.
+
+    Computes three groups of metrics following the official evaluation script:
+      - Answer: EM and F1 (token-overlap with SQuAD-style normalization)
+      - Supporting facts (SP): EM and F1 (set-based over (title, sent_id) tuples)
+      - Joint: product of answer and SP precision/recall, with derived EM and F1
+
+    Also computes alternative-aware substring matching (is_correct / is_correct_strict)
+    and reports both unfiltered (all questions) and filtered (excluding unreliable
+    questions flagged by normalize_gt) metrics.
+
+    When ``closed_book=True``, only answer-level metrics are computed (no SP or
+    joint metrics), since no supporting context is provided to the model.
+    """
+
+    def __init__(self, compute_no_answer: bool = False, closed_book: bool = False):
+        self.closed_book = closed_book
+        super().__init__(compute_no_answer=compute_no_answer)
+
+    def reset(self):
+        """Reset all counters including filtered-metric accumulators."""
+        super().reset()
+        self.filtered_total = 0
+        self.filtered_eval_dict = defaultdict(lambda: defaultdict(float))
+        self._current_should_remove = False
+
+    def _get_score_dict(self, prediction: dict) -> dict[str, float]:
+        """Compute answer, SP, joint, and alternative-match scores for one prediction."""
+        generation = prediction["generation"]
+        expected_answer = prediction["expected_answer"]
+
+        pred_answer, pred_sp = parse_generation(generation)
+
+        ans_em = answer_exact_match(pred_answer, expected_answer)
+        ans_f1, ans_prec, ans_recall = answer_f1_score(pred_answer, expected_answer)
+
+        gt_info = normalize_gt(expected_answer)
+        alternatives = gt_info["alternatives"]
+        alt_correct = float(is_correct(alternatives, pred_answer))
+        alt_correct_strict = float(is_correct_strict(alternatives, pred_answer))
+
+        scores = {
+            "answer_em": ans_em,
+            "answer_f1": ans_f1,
+            "is_correct": alt_correct,
+            "is_correct_strict": alt_correct_strict,
+        }
+
+        if not self.closed_book:
+            gold_sp = prediction["supporting_facts"]
+            sp_em, sp_f1, sp_prec, sp_recall = sp_scores(pred_sp, gold_sp)
+
+            joint_prec = ans_prec * sp_prec
+            joint_recall = ans_recall * sp_recall
+            joint_f1 = (
+                2 * joint_prec * joint_recall / (joint_prec + joint_recall) if joint_prec + joint_recall > 0 else 0.0
+            )
+            joint_em = ans_em * sp_em
+
+            scores["sp_em"] = sp_em
+            scores["sp_f1"] = sp_f1
+            scores["joint_em"] = joint_em
+            scores["joint_f1"] = joint_f1
+
+        return scores
+
+    def _update_score_metrics_for_pass(
+        self,
+        eval_dict,
+        k,
+        score_method,
+        score_dicts,
+        pass_score,
+        predictions,
+        predicted_answers,
+    ):
+        """Accumulate filtered metrics alongside the standard pass@k computation."""
+        if self._current_should_remove:
+            return
+
+        scores_list = [d[score_method] for d in score_dicts]
+        self.filtered_eval_dict[f"pass@{k}"][score_method] += pass_score
+        self.filtered_eval_dict[f"pass@1[avg-of-{k}]"][score_method] += sum(scores_list[:k]) / k
+
+    def update(self, predictions):
+        """Update metrics with a batch of predictions for one question."""
+        expected_answer = predictions[0]["expected_answer"]
+        gt_info = normalize_gt(expected_answer)
+        self._current_should_remove = gt_info["should_remove"]
+
+        if not gt_info["should_remove"]:
+            self.filtered_total += 1
+
+        super().update(predictions)
+        self._compute_pass_at_k(predictions=predictions)
+
+    def get_metrics(self):
+        """Return metrics dict with both unfiltered and filtered evaluation modes."""
+        metrics = super().get_metrics()
+
+        if self.filtered_total > 0:
+            for agg_mode, agg_dict in self.filtered_eval_dict.items():
+                filtered_key = f"filtered_{agg_mode}"
+                metrics[filtered_key] = {"num_entries": self.filtered_total}
+                for metric_key, metric_value in agg_dict.items():
+                    if isinstance(metric_value, float):
+                        metrics[filtered_key][metric_key] = 100.0 * metric_value / self.filtered_total
+                    else:
+                        metrics[filtered_key][metric_key] = metric_value
+
+        return metrics
+
+    def evaluations_to_print(self):
+        """Include filtered evaluation modes alongside the standard ones."""
+        base = super().evaluations_to_print()
+        filtered = [f"filtered_{mode}" for mode in base]
+        return base + filtered
+
+    def metrics_to_print(self):
+        """Return the ordered metric columns for the results table."""
+        m = {
+            "num_entries": lambda key, val, metrics: str(val),
+            "answer_em": as_percentage,
+            "answer_f1": as_percentage,
+        }
+        if not self.closed_book:
+            m["sp_em"] = as_percentage
+            m["sp_f1"] = as_percentage
+            m["joint_em"] = as_percentage
+            m["joint_f1"] = as_percentage
+        m["is_correct"] = as_percentage
+        m["is_correct_strict"] = as_percentage
+        return m
diff --git a/nemo_skills/evaluation/metrics/map_metrics.py b/nemo_skills/evaluation/metrics/map_metrics.py
index 92f9f3282c..a10108ac4e 100644
--- a/nemo_skills/evaluation/metrics/map_metrics.py
+++ b/nemo_skills/evaluation/metrics/map_metrics.py
@@ -32,6 +32,7 @@
 from nemo_skills.evaluation.metrics.critpt_metrics import CritPtMetrics
 from nemo_skills.evaluation.metrics.gradingbench_metrics import GradingBenchMetrics
 from nemo_skills.evaluation.metrics.hleaa_metrics import HLEAAMetrics
+from nemo_skills.evaluation.metrics.hotpotqa_metrics import HotpotQAMetrics
 from nemo_skills.evaluation.metrics.icpc_metrics import ICPCMetrics
 from nemo_skills.evaluation.metrics.if_metrics import IFMetrics
 from nemo_skills.evaluation.metrics.ioi_metrics import IOIMetrics
@@ -88,6 +89,8 @@
     "gradingbench": GradingBenchMetrics,
     "critpt": CritPtMetrics,
     "specdec": SpecdecMetrics,
+    "hotpotqa": HotpotQAMetrics,
+    "hotpotqa_closedbook": functools.partial(HotpotQAMetrics, closed_book=True),
 }
 
 
diff --git a/nemo_skills/prompt/config/eval/hotpotqa.yaml b/nemo_skills/prompt/config/eval/hotpotqa.yaml
new file mode 100644
index 0000000000..03aa87f69f
--- /dev/null
+++ b/nemo_skills/prompt/config/eval/hotpotqa.yaml
@@ -0,0 +1,12 @@
+user: |-
+  Answer the following question based on the provided context. You must also identify the supporting facts: the specific sentences that are necessary to answer the question.
+
+  Context:
+  {context}
+
+  Question: {question}
+
+  Provide your response in the following JSON format:
+  {{"answer": "<your answer>", "supporting_facts": [["<paragraph title>", <sentence index>], ...]}}
+
+  Output only the JSON object, nothing else.
diff --git a/nemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml b/nemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml
new file mode 100644
index 0000000000..59b24ed3b7
--- /dev/null
+++ b/nemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml
@@ -0,0 +1,9 @@
+user: |-
+  Answer the following question using your own knowledge. No external context is provided.
+
+  Question: {question}
+
+  Provide your response in the following JSON format:
+  {{"answer": "<your answer>"}}
+
+  Output only the JSON object, nothing else.