Add HotpotQA multi-hop QA benchmark by MahanFathi · Pull Request #1292 · NVIDIA-NeMo/Skills

MahanFathi · 2026-03-04T18:21:12Z

Adds evaluation support for HotpotQA multi-hop question answering, with two variants:

Distractor (hotpotqa): model receives 10 context paragraphs (2 gold + 8 distractors) and must return the answer and identify supporting-fact sentences
Closed-book (hotpotqa_closedbook): same questions, no context provided

What's included:

Dataset preparation scripts pulling from HuggingFace (hotpotqa/hotpot_qa, distractor split, 7405 validation examples)
Metrics following the official evaluation script: Answer EM/F1, Supporting-facts EM/F1, Joint EM/F1
Alternative-aware substring matching with on-the-fly filtering of unreliable questions
JSON parser for extracting structured answers from model generations (handles reasoning before JSON)
Prompt configs for both variants
Documentation with example commands and reference results from Nemotron-3-Nano

Tested end-to-end on cluster with Nemotron-3-Nano, gpt-oss-20b, and gpt-oss-120b.

Summary by CodeRabbit

New Features
- Added HotpotQA benchmark support (Distractor and Closed‑book): shared data‑prep utilities and per‑variant prepare scripts, evaluation metrics, and metric registrations.
Evaluation
- New answer normalization, flexible and strict matching modes, supporting‑fact, answer, and joint scoring; closed‑book mode omits supporting‑fact/joint metrics.
Documentation
- Comprehensive HotpotQA guide with setup, unified data‑prep and evaluation commands, result verification, and example outputs.
Prompts
- Added evaluation prompt templates for both distractor and closed‑book modes.

coderabbitai · 2026-03-04T18:31:29Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds HotpotQA support: documentation, dataset preparation for distractor and closed‑book variants, prompt configs, HotpotQA-specific filtering and full evaluation metrics, and registry entries; includes new preparation utilities and evaluation scripts.

Changes

Cohort / File(s)	Summary
Documentation `docs/evaluation/other-benchmarks.md`	Added HotpotQA section describing Distractor and Closed‑book variants, data preparation and evaluation commands, verification paths, and example outputs.
HotpotQA Dataset `nemo_skills/dataset/hotpotqa/__init__.py`, `nemo_skills/dataset/hotpotqa/prepare.py`, `nemo_skills/dataset/hotpotqa/prepare_utils.py`	Added module constants and license header, a small prepare launcher, and utilities to download/format the HotpotQA distractor validation split into NeMo‑Skills JSONL (formatted context, supporting_facts).
HotpotQA Closed‑Book Dataset `nemo_skills/dataset/hotpotqa_closedbook/__init__.py`, `nemo_skills/dataset/hotpotqa_closedbook/prepare.py`	Added closed‑book constants and a prepare script that reuses/copies the HotpotQA validation JSONL for closed‑book evaluation.
Filtering Logic `nemo_skills/evaluation/metrics/hotpotqa_filtering.py`	New normalization and surface‑form alternative generation for ground truths, heuristics to mark unreliable GTs, and substring/strict matching helpers (`normalize_gt`, `is_correct`, `is_correct_strict`).
HotpotQA Metrics `nemo_skills/evaluation/metrics/hotpotqa_metrics.py`	New HotpotQAMetrics class and utilities: answer/SP normalization and scoring (EM/F1/precision/recall), JSON extraction/parsing from model outputs, parse_generation, pass@k accumulation, closed‑book handling, and filtered vs. unfiltered reporting.
Metrics Registry `nemo_skills/evaluation/metrics/map_metrics.py`	Imported HotpotQAMetrics and registered `"hotpotqa"` and `"hotpotqa_closedbook"` (closed‑book via partial with closed_book=True).
Prompt Configs `nemo_skills/prompt/config/eval/hotpotqa.yaml`, `nemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml`	Added strict JSON‑output prompt configs: one requiring `{"answer","supporting_facts"}` for distractor mode and one requiring `{"answer"}` for closed‑book mode.

Sequence Diagram

sequenceDiagram
    actor User
    participant DataPrep as Data Preparation
    participant Model as Model Generation
    participant Metrics as Metrics Computation
    participant Results as Results Output

    User->>DataPrep: run prepare (distractor or closed-book)
    activate DataPrep
    DataPrep->>DataPrep: download/format validation.jsonl
    DataPrep-->>User: validation.jsonl
    deactivate DataPrep

    User->>Model: run evaluation (prompt config + validation.jsonl)
    activate Model
    Model->>Model: generate JSON outputs (answer[, supporting_facts])
    Model-->>Metrics: send predictions
    deactivate Model

    activate Metrics
    Metrics->>Metrics: extract answer & supporting_facts
    Metrics->>Metrics: normalize GTs (hotpotqa_filtering)
    Metrics->>Metrics: compute EM/F1, SP, joint, pass@k (respect closed-book)
    Metrics-->>Results: aggregated metrics (filtered & unfiltered)
    deactivate Metrics

    Results->>User: display/save evaluation results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

MT datasets FLORES200 and WMT24pp #892 — also modified nemo_skills/evaluation/metrics/map_metrics.py to add new metric registrations; overlaps at metrics registry changes.
Add SPEED-Bench (within repo) #1279 — added metric registration entries to map_metrics.py; related by introducing new metric types into the registry.

Suggested labels

enhancement

Suggested reviewers

Kipok
ekmb

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add HotpotQA multi-hop QA benchmark' accurately and specifically describes the main change—introducing support for the HotpotQA multi-hop question answering benchmark with both distractor and closed-book variants.
Docstring Coverage	✅ Passed	Docstring coverage is 84.85% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch mfathi/hotpotqa

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/other-benchmarks.md`:
- Line 140: Replace the non-descriptive link text "here" that points to
https://github.com/hotpotqa/hotpot with a descriptive label; for example change
the line "Original benchmark source is
[here](https://github.com/hotpotqa/hotpot)." to something like "Original
benchmark source: HotpotQA (https://github.com/hotpotqa/hotpot)" or "Original
benchmark source: HotpotQA repository" so the link text is meaningful and
lint-friendly.
- Around line 193-211: The fenced result blocks for the hotpotqa and
hotpotqa_closedbook examples are missing a fence language and trigger
markdownlint MD040; update the opening fences for both triple-backtick blocks to
include the explicit language specifier `text` (i.e., change ``` to ```text for
the first hotpotqa block and the second hotpotqa_closedbook block) so both
fenced output blocks are language-tagged.

In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py`:
- Around line 64-68: The current write loop opens output_file and streams
entries which risks truncation on failure; instead, first materialize all
formatted entries by iterating over ds and calling format_entry to build a list
(or generator collected into memory/temporary buffer), then open a temporary
file (e.g., output_file.with_suffix(".tmp")) and write the collected entries
using the same json.dump + newline logic inside the tqdm loop, flush and fsync,
and finally atomically replace the original output_file with the temp
(rename/replace) so that writing is atomic and avoids partial overwrite if the
run fails mid-way.
- Line 30: The zip calls in prepare.py currently iterate with
zip(context["title"], context["sentences"]) (and the similar zip over
supporting_facts fields) which can silently truncate mismatched-length
iterables; change both occurrences to use zip(..., strict=True) so a ValueError
is raised on length mismatch and misaligned title/sentences pairs are caught
early (update the zip in the loop over context["title"]/context["sentences"] and
the zip over
context["supporting_facts"]["title"]/context["supporting_facts"]["sentences"]
accordingly).

In `@nemo_skills/dataset/hotpotqa/prepare.py`:
- Around line 64-68: The current loop writes each formatted entry directly to
output_file and can leave a truncated file if formatting fails mid-run; first
iterate ds and collect or stream-serialize all formatted entries by calling
format_entry(entry) and accumulating them (e.g., a list or generator
materialized into a list), then open a temporary file in the same directory (use
tempfile.NamedTemporaryFile or mkstemp) and write the JSONL lines (the json.dump
+ fout.write("\n") logic currently in the loop) to that temp file, close/flush
it, and finally atomically replace output_file with os.replace(temp_path,
output_file) so the write of output_file is atomic and cannot be left partially
written if format_entry raises. Ensure you reference the existing names: ds,
format_entry, tqdm, output_file and preserve the same encoding/"wt" write mode
when writing the temp file.
- Line 34: The two loops iterating with zip over context fields (the for loop
using zip(context["title"], context["sentences"]) and the second zip at the
other loop) must be made strict to prevent silent truncation when arrays
mismatch; update both zip(...) calls to zip(..., strict=True) so any length
mismatch raises immediately, preserving data integrity for the variables title,
sentences (and the other zipped variables).

In `@nemo_skills/evaluation/metrics/hotpotqa_filtering.py`:
- Around line 24-28: The exported names in the __all__ list are not sorted per
Ruff RUF022; reorder the entries in the __all__ list for
nemo_skills/evaluation/metrics/hotpotqa_filtering.py so they follow isort-style
alphabetical order (e.g., "is_correct", "is_correct_strict", "normalize_gt") to
avoid lint failures; update the __all__ variable accordingly.

In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py`:
- Around line 171-175: The loop over fenced JSON blocks currently returns the
first valid parse; change it to collect all matches (e.g., md_matches =
list(re.finditer(...))), iterate them in reverse (for md_match in
reversed(md_matches)) and call _try_parse_answer_json(md_match.group(1)) so the
last valid fenced JSON block is used; ensure you keep the same regex and only
adjust the iteration order/collection and return the first non-None result found
from the reversed list.
- Around line 116-120: The code currently uses dict.get(..., default) for
required fields which masks missing data; update _try_parse_answer_json() to
remove the redundant default for "answer" (since you already check for its
presence) and access parsed["answer"] directly, and replace
parsed.get("supporting_facts", []) with direct parsed["supporting_facts"] in
that function; similarly in _get_score_dict() replace use of generation and
expected_answer via .get(...) with direct indexing (generation =
record["generation"], expected_answer = record["expected_answer"]) and in
update() access record["expected_answer"] directly so missing keys raise
KeyError and surface bad records immediately.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1116a35a-d1cd-4fdb-b5d6-32fc9efe9ba4

📥 Commits

Reviewing files that changed from the base of the PR and between 4618b19 and 7f2ce62.

📒 Files selected for processing (10)

docs/evaluation/other-benchmarks.md
nemo_skills/dataset/hotpotqa/__init__.py
nemo_skills/dataset/hotpotqa/prepare.py
nemo_skills/dataset/hotpotqa_closedbook/__init__.py
nemo_skills/dataset/hotpotqa_closedbook/prepare.py
nemo_skills/evaluation/metrics/hotpotqa_filtering.py
nemo_skills/evaluation/metrics/hotpotqa_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/prompt/config/eval/hotpotqa.yaml
nemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml

docs/evaluation/other-benchmarks.md

nemo_skills/dataset/hotpotqa_closedbook/prepare.py

nemo_skills/dataset/hotpotqa/prepare.py

nemo_skills/evaluation/metrics/hotpotqa_filtering.py

nemo_skills/evaluation/metrics/hotpotqa_metrics.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (9)

docs/evaluation/other-benchmarks.md (2)

193-193: ⚠️ Potential issue | 🟡 Minor

Add fence languages to the result blocks.

Both output fences should be language-tagged (MD040).

Suggested fix

-```
+```text
 ----------------------------------------------------------------------------- hotpotqa -----------------------------------------------------------------------------
 ...

@@
- +text
----------------------------------------- hotpotqa_closedbook ------------------------------------------
...

Also applies to: 204-204

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/other-benchmarks.md` at line 193, Update the two fenced
result blocks so they include a language tag (e.g., "text") after the opening
triple backticks; specifically add the language tag to the fence that contains
the "hotpotqa" result block and to the fence that contains the
"hotpotqa_closedbook" result block so both are ```text ... ``` to satisfy MD040.

140-140: ⚠️ Potential issue | 🟡 Minor

Use descriptive link text on Line 140.

The current “here” anchor is non-descriptive and triggers MD059.

Suggested fix

-- Original benchmark source is [here](https://github.com/hotpotqa/hotpot).
+- Original benchmark source is [HotpotQA benchmark repository](https://github.com/hotpotqa/hotpot).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/other-benchmarks.md` at line 140, Replace the non-descriptive
link anchor "here" with a descriptive label such as "HotpotQA repository" (or
similar meaningful text) in the markdown link to
https://github.com/hotpotqa/hotpot to resolve MD059; locate the current anchor
text "here" and update it to the new descriptive link text while keeping the URL
unchanged.

nemo_skills/evaluation/metrics/hotpotqa_metrics.py (2)

172-175: ⚠️ Potential issue | 🟡 Minor

Use the last valid fenced JSON block, not the first.

Line 172–175 currently returns the first valid fenced JSON, which conflicts with the parser contract in the docstring and can select draft output over the final answer.

Suggested fix

-    for md_match in re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL):
+    md_matches = list(re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL))
+    for md_match in reversed(md_matches):
         result = _try_parse_answer_json(md_match.group(1))
         if result is not None:
             return result

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 172 - 175,
The current loop over re.finditer returns the first valid fenced JSON block but
the parser needs to prefer the last valid block; change the logic in the block
that uses re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL) to
collect matches into a list (e.g., md_matches = list(re.finditer(...))) and
iterate over them in reverse order, calling
_try_parse_answer_json(md_match.group(1)) for each and returning the first
non-None result found (thereby returning the last valid fenced JSON in the
original text).

117-120: ⚠️ Potential issue | 🟠 Major

Fail fast on required keys; avoid .get() defaults in required schema paths.

Using defaults here can silently score malformed records instead of surfacing schema issues.

Suggested fix

-        answer = str(parsed.get("answer", ""))
+        answer = str(parsed["answer"])
@@
-        generation = prediction.get("generation", "")
-        expected_answer = prediction.get("expected_answer", "")
+        generation = prediction["generation"]
+        expected_answer = prediction["expected_answer"]
@@
-            gold_sp = prediction.get("supporting_facts", [])
+            gold_sp = prediction["supporting_facts"]
@@
-        expected_answer = predictions[0].get("expected_answer", "")
+        expected_answer = predictions[0]["expected_answer"]

As per coding guidelines: “Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data”.

Also applies to: 215-217, 236-237, 273-274

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 117 - 120,
The code is silently defaulting required schema fields using parsed.get(...),
which can hide malformed records; replace uses of parsed.get("answer", ""),
parsed.get("supporting_facts", []), and other .get(...) accesses for required
keys with direct indexing (e.g., parsed["answer"], parsed["supporting_facts"])
so the code fails fast with a clear KeyError when required fields are missing;
apply the same change to the other occurrences noted (the later blocks around
the same variable names) ensuring you only use direct access for keys that are
validated as present earlier in the function.

nemo_skills/evaluation/metrics/hotpotqa_filtering.py (1)

24-28: ⚠️ Potential issue | 🟡 Minor

Sort __all__ to satisfy RUF022.

Please keep exported symbols in isort-style order to avoid lint drift.

Suggested fix

 __all__ = [
-    "normalize_gt",
     "is_correct",
     "is_correct_strict",
+    "normalize_gt",
 ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/hotpotqa_filtering.py` around lines 24 - 28,
The __all__ export list is not sorted; reorder the list in isort-style
(alphabetical) order so exported symbols are stable — update the __all__
assignment to ["is_correct", "is_correct_strict", "normalize_gt"] referencing
the existing symbol names normalize_gt, is_correct, and is_correct_strict to
satisfy RUF022.

nemo_skills/dataset/hotpotqa/prepare.py (2)

34-34: ⚠️ Potential issue | 🟠 Major

Use zip(..., strict=True) for alignment safety.

Line 34 and Line 44 can silently truncate mismatched arrays, causing undetected record corruption.

Suggested fix

-    for title, sentences in zip(context["title"], context["sentences"]):
+    for title, sentences in zip(context["title"], context["sentences"], strict=True):
@@
-    supporting_facts = list(zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"]))
+    supporting_facts = list(
+        zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"], strict=True)
+    )

Also applies to: 44-44

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/hotpotqa/prepare.py` at line 34, Replace the two uses of
zip over context lists so mismatched lengths raise errors: change the loops
using zip(context["title"], context["sentences"]) (and the similar occurrence at
the other location) to zip(context["title"], context["sentences"], strict=True)
so alignment mismatches are detected instead of silently truncating records;
update both occurrences where those variables are iterated.

64-68: ⚠️ Potential issue | 🟠 Major

Write validation.jsonl atomically to avoid truncated outputs.

Current flow rewrites the final file while formatting. A mid-run failure can leave a partial dataset.

Suggested fix

-    with open(output_file, "wt", encoding="utf-8") as fout:
-        for entry in tqdm(ds, desc=f"Writing {output_file.name}"):
-            formatted = format_entry(entry)
-            json.dump(formatted, fout)
-            fout.write("\n")
+    formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_file.name}")]
+    tmp_output_file = output_file.with_suffix(".jsonl.tmp")
+    with open(tmp_output_file, "wt", encoding="utf-8") as fout:
+        for formatted in formatted_entries:
+            json.dump(formatted, fout)
+            fout.write("\n")
+    tmp_output_file.replace(output_file)

As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/hotpotqa/prepare.py` around lines 64 - 68, The current
loop writes directly to output_file and can leave a partially-written file on
failure; instead collect or stream writes into a temporary file and atomically
replace output_file when complete. Modify the block that iterates over ds and
calls format_entry(entry) and json.dump by writing to a tempfile in the same
directory (e.g., using tempfile.NamedTemporaryFile or TemporaryDirectory) and,
after the loop finishes and the temp file is closed/flushed, call
os.replace(temp_path, output_file) to atomically rename; ensure you keep the
same encoding and newline behavior as the original and reference the existing
variables/functions format_entry, ds, output_file and the tqdm loop so the
change replaces only the write-out logic.

nemo_skills/dataset/hotpotqa_closedbook/prepare.py (2)

30-30: ⚠️ Potential issue | 🟠 Major

Use zip(..., strict=True) to prevent silent field truncation.

Line 30 and Line 40 can silently drop tail elements on length mismatches, which corrupts context/supporting-fact alignment.

Suggested fix

-    for title, sentences in zip(context["title"], context["sentences"]):
+    for title, sentences in zip(context["title"], context["sentences"], strict=True):
@@
-    supporting_facts = list(zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"]))
+    supporting_facts = list(
+        zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"], strict=True)
+    )

Also applies to: 40-40

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py` at line 30, The for-loops
iterating with zip over context["title"] and context["sentences"] (and the
similar zip over context["supporting_facts"] / sentences) can silently truncate
mismatched lists; replace those zip calls with zip(..., strict=True) so a
ValueError is raised on length mismatch (locate the loops that use
zip(context["title"], context["sentences"]) and the other zip over
supporting_facts/sentences and add the strict=True argument).

64-68: ⚠️ Potential issue | 🟠 Major

Avoid partial overwrite by staging computation before writing.

If formatting fails mid-loop, validation.jsonl can be left truncated. Compute first, write to a temp file, then atomically replace.

Suggested fix

-        with open(output_file, "wt", encoding="utf-8") as fout:
-            for entry in tqdm(ds, desc=f"Writing {output_file.name}"):
-                formatted = format_entry(entry)
-                json.dump(formatted, fout)
-                fout.write("\n")
+        formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_file.name}")]
+        tmp_output_file = output_file.with_suffix(".jsonl.tmp")
+        with open(tmp_output_file, "wt", encoding="utf-8") as fout:
+            for formatted in formatted_entries:
+                json.dump(formatted, fout)
+                fout.write("\n")
+        tmp_output_file.replace(output_file)

As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py` around lines 64 - 68, The
loop that writes formatted entries directly to output_file can leave
validation.jsonl truncated if formatting fails; instead, in prepare.py compute
all formatted entries first (e.g., collect results from iterating ds and calling
format_entry(entry) with tqdm), write those results to a temporary file (use
tempfile.NamedTemporaryFile or a .tmp Path), and then atomically replace the
target file (use os.replace or Path.replace) to swap the temp into output_file;
update the block around the ds/tqdm loop and the json dumping logic to perform
staging then atomic rename.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/other-benchmarks.md`:
- Around line 130-150: The CI is missing HotpotQA variants ('hotpotqa' and
'hotpotqa_closedbook') from default GPU and CPU benchmark runs; update the CI
workflow definitions (gpu_tests.yml and tests.yml) to add jobs or matrix entries
that run evaluation for these two slugs (ensure data prep step runs ns
prepare_data hotpotqa and ns prepare_data hotpotqa_closedbook) so the pipeline
exercises the HotpotQA metric logic (supporting-facts EM/F1, Joint EM/F1,
alternative-aware substring matching and filtering) during PR checks; ensure the
new CI steps invoke the same evaluation command used by existing benchmarks so
coverage and reporting match other datasets.

---

Duplicate comments:
In `@docs/evaluation/other-benchmarks.md`:
- Line 193: Update the two fenced result blocks so they include a language tag
(e.g., "text") after the opening triple backticks; specifically add the language
tag to the fence that contains the "hotpotqa" result block and to the fence that
contains the "hotpotqa_closedbook" result block so both are ```text ... ``` to
satisfy MD040.
- Line 140: Replace the non-descriptive link anchor "here" with a descriptive
label such as "HotpotQA repository" (or similar meaningful text) in the markdown
link to https://github.com/hotpotqa/hotpot to resolve MD059; locate the current
anchor text "here" and update it to the new descriptive link text while keeping
the URL unchanged.

In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py`:
- Line 30: The for-loops iterating with zip over context["title"] and
context["sentences"] (and the similar zip over context["supporting_facts"] /
sentences) can silently truncate mismatched lists; replace those zip calls with
zip(..., strict=True) so a ValueError is raised on length mismatch (locate the
loops that use zip(context["title"], context["sentences"]) and the other zip
over supporting_facts/sentences and add the strict=True argument).
- Around line 64-68: The loop that writes formatted entries directly to
output_file can leave validation.jsonl truncated if formatting fails; instead,
in prepare.py compute all formatted entries first (e.g., collect results from
iterating ds and calling format_entry(entry) with tqdm), write those results to
a temporary file (use tempfile.NamedTemporaryFile or a .tmp Path), and then
atomically replace the target file (use os.replace or Path.replace) to swap the
temp into output_file; update the block around the ds/tqdm loop and the json
dumping logic to perform staging then atomic rename.

In `@nemo_skills/dataset/hotpotqa/prepare.py`:
- Line 34: Replace the two uses of zip over context lists so mismatched lengths
raise errors: change the loops using zip(context["title"], context["sentences"])
(and the similar occurrence at the other location) to zip(context["title"],
context["sentences"], strict=True) so alignment mismatches are detected instead
of silently truncating records; update both occurrences where those variables
are iterated.
- Around line 64-68: The current loop writes directly to output_file and can
leave a partially-written file on failure; instead collect or stream writes into
a temporary file and atomically replace output_file when complete. Modify the
block that iterates over ds and calls format_entry(entry) and json.dump by
writing to a tempfile in the same directory (e.g., using
tempfile.NamedTemporaryFile or TemporaryDirectory) and, after the loop finishes
and the temp file is closed/flushed, call os.replace(temp_path, output_file) to
atomically rename; ensure you keep the same encoding and newline behavior as the
original and reference the existing variables/functions format_entry, ds,
output_file and the tqdm loop so the change replaces only the write-out logic.

In `@nemo_skills/evaluation/metrics/hotpotqa_filtering.py`:
- Around line 24-28: The __all__ export list is not sorted; reorder the list in
isort-style (alphabetical) order so exported symbols are stable — update the
__all__ assignment to ["is_correct", "is_correct_strict", "normalize_gt"]
referencing the existing symbol names normalize_gt, is_correct, and
is_correct_strict to satisfy RUF022.

In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py`:
- Around line 172-175: The current loop over re.finditer returns the first valid
fenced JSON block but the parser needs to prefer the last valid block; change
the logic in the block that uses re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```",
text, re.DOTALL) to collect matches into a list (e.g., md_matches =
list(re.finditer(...))) and iterate over them in reverse order, calling
_try_parse_answer_json(md_match.group(1)) for each and returning the first
non-None result found (thereby returning the last valid fenced JSON in the
original text).
- Around line 117-120: The code is silently defaulting required schema fields
using parsed.get(...), which can hide malformed records; replace uses of
parsed.get("answer", ""), parsed.get("supporting_facts", []), and other
.get(...) accesses for required keys with direct indexing (e.g.,
parsed["answer"], parsed["supporting_facts"]) so the code fails fast with a
clear KeyError when required fields are missing; apply the same change to the
other occurrences noted (the later blocks around the same variable names)
ensuring you only use direct access for keys that are validated as present
earlier in the function.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 55fb65a2-f629-43b5-8970-f2df118b5667

📥 Commits

Reviewing files that changed from the base of the PR and between 7f2ce62 and 6b1c487.

📒 Files selected for processing (10)

docs/evaluation/other-benchmarks.md
nemo_skills/dataset/hotpotqa/__init__.py
nemo_skills/dataset/hotpotqa/prepare.py
nemo_skills/dataset/hotpotqa_closedbook/__init__.py
nemo_skills/dataset/hotpotqa_closedbook/prepare.py
nemo_skills/evaluation/metrics/hotpotqa_filtering.py
nemo_skills/evaluation/metrics/hotpotqa_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/prompt/config/eval/hotpotqa.yaml
nemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml

🚧 Files skipped from review as they are similar to previous changes (4)

nemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml
nemo_skills/dataset/hotpotqa_closedbook/init.py
nemo_skills/dataset/hotpotqa/init.py
nemo_skills/prompt/config/eval/hotpotqa.yaml

docs/evaluation/other-benchmarks.md

coderabbitai

♻️ Duplicate comments (6)

nemo_skills/dataset/hotpotqa_closedbook/prepare.py (2)

64-68: ⚠️ Potential issue | 🟠 Major

Write atomically after formatting to prevent partial output corruption.

If formatting fails mid-run, validation.jsonl can be left truncated.

Suggested fix

-        with open(output_file, "wt", encoding="utf-8") as fout:
-            for entry in tqdm(ds, desc=f"Writing {output_file.name}"):
-                formatted = format_entry(entry)
-                json.dump(formatted, fout)
-                fout.write("\n")
+        formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_file.name}")]
+        tmp_output_file = output_file.with_suffix(".jsonl.tmp")
+        with open(tmp_output_file, "wt", encoding="utf-8") as fout:
+            for formatted in formatted_entries:
+                json.dump(formatted, fout)
+                fout.write("\n")
+        tmp_output_file.replace(output_file)

As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py` around lines 64 - 68,
Currently the code opens output_file for writing immediately and streams
formatted entries, which can leave validation.jsonl truncated if format_entry
fails; instead first iterate ds and collect formatted entries into a list
(calling format_entry for each and letting exceptions surface before opening the
file), then reopen output_file and write the collected formatted objects with
json.dump and fout.write("\n") (using the same tqdm/desc if desired) so the file
is only truncated/replaced after all formatting succeeds; reference
functions/vars: format_entry, ds, output_file, json.dump, tqdm.

30-30: ⚠️ Potential issue | 🟠 Major

Add strict=True to both zip() calls.

Both loops can silently truncate on length mismatch, which can misalign context/supporting-fact fields.

Suggested fix

-    for title, sentences in zip(context["title"], context["sentences"]):
+    for title, sentences in zip(context["title"], context["sentences"], strict=True):
@@
-    supporting_facts = list(zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"]))
+    supporting_facts = list(
+        zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"], strict=True)
+    )

Also applies to: 40-40

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py` at line 30, The two zip()
loops that iterate over context["title"] and context["sentences"] (and the other
zip that pairs the supporting-fact/title lists) can silently truncate on length
mismatch; update both zip() calls to use zip(..., strict=True) so mismatched
lengths raise an error and prevent misaligned context/supporting-fact fields,
locating the zip calls that unpack into variables like title, sentences and the
one that unpacks supporting-fact pairs and adding strict=True to each.

nemo_skills/dataset/hotpotqa/prepare.py (2)

34-34: ⚠️ Potential issue | 🟠 Major

Use strict=True in both zip() calls to guard schema alignment.

Without strict mode, mismatched lengths silently truncate and can corrupt row alignment.

Suggested fix

-    for title, sentences in zip(context["title"], context["sentences"]):
+    for title, sentences in zip(context["title"], context["sentences"], strict=True):
@@
-    supporting_facts = list(zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"]))
+    supporting_facts = list(
+        zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"], strict=True)
+    )

Also applies to: 44-44

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/hotpotqa/prepare.py` at line 34, The zip iteration over
context["title"] and context["sentences"] (and the other zip on lines 44) should
use zip(..., strict=True) to ensure the two sequences are the same length and
fail fast on schema misalignment; update the for loops that destructure (title,
sentences) and the second zip to pass strict=True so mismatched lengths raise a
ValueError instead of silently truncating.

64-68: ⚠️ Potential issue | 🟠 Major

Avoid partial overwrite of validation.jsonl on failure.

Current flow can leave a truncated target file if format_entry() raises mid-loop.

Suggested fix

-    with open(output_file, "wt", encoding="utf-8") as fout:
-        for entry in tqdm(ds, desc=f"Writing {output_file.name}"):
-            formatted = format_entry(entry)
-            json.dump(formatted, fout)
-            fout.write("\n")
+    formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_file.name}")]
+    tmp_output_file = output_file.with_suffix(".jsonl.tmp")
+    with open(tmp_output_file, "wt", encoding="utf-8") as fout:
+        for formatted in formatted_entries:
+            json.dump(formatted, fout)
+            fout.write("\n")
+    tmp_output_file.replace(output_file)

As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/hotpotqa/prepare.py` around lines 64 - 68, Do not
stream-write while calling format_entry on ds; instead fully compute/collect all
formatted entries first (e.g., iterate ds and call format_entry, storing results
in a list) and only after that open output_file for writing and dump each
precomputed formatted entry with json.dump and fout.write; alternatively write
to a temporary file and atomically rename into place—refer to format_entry, ds,
output_file, tqdm and the existing write loop (fout) to locate and change the
logic.

nemo_skills/evaluation/metrics/hotpotqa_metrics.py (2)

172-175: ⚠️ Potential issue | 🟡 Minor

Use the last valid fenced JSON block, not the first.

Current loop returns the first fenced JSON, which contradicts the parser contract in the docstring.

Suggested fix

-    for md_match in re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL):
+    md_matches = list(re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL))
+    for md_match in reversed(md_matches):
         result = _try_parse_answer_json(md_match.group(1))
         if result is not None:
             return result

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 172 - 175,
The current loop in hotpotqa_metrics.py returns the first fenced JSON block
found, but the parser contract requires using the last valid fenced JSON; change
the logic in the block that calls re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```",
text, re.DOTALL) so you collect matches into a list (md_matches =
list(re.finditer(...))) and then iterate them in reverse (for md_match in
reversed(md_matches)): call _try_parse_answer_json(md_match.group(1)) and return
the first non-None result found, ensuring the last fenced JSON in the text is
preferred.

215-216: ⚠️ Potential issue | 🟠 Major

Use direct indexing for required prediction fields (fail fast on bad records).

These defaults can silently turn malformed inputs into incorrect metrics.

Suggested fix

-        generation = prediction.get("generation", "")
-        expected_answer = prediction.get("expected_answer", "")
+        generation = prediction["generation"]
+        expected_answer = prediction["expected_answer"]
@@
-            gold_sp = prediction.get("supporting_facts", [])
+            gold_sp = prediction["supporting_facts"]
@@
-        expected_answer = predictions[0].get("expected_answer", "")
+        expected_answer = predictions[0]["expected_answer"]

As per coding guidelines: “Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data”.

Also applies to: 236-236, 273-273

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 215 - 216,
Replace optional .get() usage with direct dictionary indexing for required
prediction fields so malformed records fail fast: change
prediction.get("generation", "") and prediction.get("expected_answer", "") to
prediction["generation"] and prediction["expected_answer"], and apply the same
replacement to the other prediction.get(...) occurrences noted (the other two
locations flagged). Ensure any field that your metric logic requires uses direct
indexing (prediction["..."]) so missing keys raise immediately instead of
silently using defaults.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py`:
- Around line 64-68: Currently the code opens output_file for writing
immediately and streams formatted entries, which can leave validation.jsonl
truncated if format_entry fails; instead first iterate ds and collect formatted
entries into a list (calling format_entry for each and letting exceptions
surface before opening the file), then reopen output_file and write the
collected formatted objects with json.dump and fout.write("\n") (using the same
tqdm/desc if desired) so the file is only truncated/replaced after all
formatting succeeds; reference functions/vars: format_entry, ds, output_file,
json.dump, tqdm.
- Line 30: The two zip() loops that iterate over context["title"] and
context["sentences"] (and the other zip that pairs the supporting-fact/title
lists) can silently truncate on length mismatch; update both zip() calls to use
zip(..., strict=True) so mismatched lengths raise an error and prevent
misaligned context/supporting-fact fields, locating the zip calls that unpack
into variables like title, sentences and the one that unpacks supporting-fact
pairs and adding strict=True to each.

In `@nemo_skills/dataset/hotpotqa/prepare.py`:
- Line 34: The zip iteration over context["title"] and context["sentences"] (and
the other zip on lines 44) should use zip(..., strict=True) to ensure the two
sequences are the same length and fail fast on schema misalignment; update the
for loops that destructure (title, sentences) and the second zip to pass
strict=True so mismatched lengths raise a ValueError instead of silently
truncating.
- Around line 64-68: Do not stream-write while calling format_entry on ds;
instead fully compute/collect all formatted entries first (e.g., iterate ds and
call format_entry, storing results in a list) and only after that open
output_file for writing and dump each precomputed formatted entry with json.dump
and fout.write; alternatively write to a temporary file and atomically rename
into place—refer to format_entry, ds, output_file, tqdm and the existing write
loop (fout) to locate and change the logic.

In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py`:
- Around line 172-175: The current loop in hotpotqa_metrics.py returns the first
fenced JSON block found, but the parser contract requires using the last valid
fenced JSON; change the logic in the block that calls
re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL) so you collect
matches into a list (md_matches = list(re.finditer(...))) and then iterate them
in reverse (for md_match in reversed(md_matches)): call
_try_parse_answer_json(md_match.group(1)) and return the first non-None result
found, ensuring the last fenced JSON in the text is preferred.
- Around line 215-216: Replace optional .get() usage with direct dictionary
indexing for required prediction fields so malformed records fail fast: change
prediction.get("generation", "") and prediction.get("expected_answer", "") to
prediction["generation"] and prediction["expected_answer"], and apply the same
replacement to the other prediction.get(...) occurrences noted (the other two
locations flagged). Ensure any field that your metric logic requires uses direct
indexing (prediction["..."]) so missing keys raise immediately instead of
silently using defaults.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 139606ad-3f62-4174-a35c-acbc6bf4d99f

📥 Commits

Reviewing files that changed from the base of the PR and between 6b1c487 and ae8e340.

📒 Files selected for processing (10)

docs/evaluation/other-benchmarks.md
nemo_skills/dataset/hotpotqa/__init__.py
nemo_skills/dataset/hotpotqa/prepare.py
nemo_skills/dataset/hotpotqa_closedbook/__init__.py
nemo_skills/dataset/hotpotqa_closedbook/prepare.py
nemo_skills/evaluation/metrics/hotpotqa_filtering.py
nemo_skills/evaluation/metrics/hotpotqa_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/prompt/config/eval/hotpotqa.yaml
nemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml

🚧 Files skipped from review as they are similar to previous changes (3)

nemo_skills/prompt/config/eval/hotpotqa.yaml
nemo_skills/dataset/hotpotqa/init.py
nemo_skills/dataset/hotpotqa_closedbook/init.py

docs/evaluation/other-benchmarks.md

nemo_skills/dataset/hotpotqa/prepare.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

nemo_skills/evaluation/metrics/hotpotqa_metrics.py (2)

172-175: ⚠️ Potential issue | 🟡 Minor

Scan fenced JSON blocks from the end.

This branch still returns the first valid fenced block, which contradicts the parser contract and can lock onto an earlier draft answer when reasoning contains multiple JSON snippets.

Suggested fix

-    for md_match in re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL):
+    md_matches = list(re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL))
+    for md_match in reversed(md_matches):
         result = _try_parse_answer_json(md_match.group(1))
         if result is not None:
             return result

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 172 - 175,
The loop that scans fenced JSON blocks uses re.finditer and returns the first
valid match, which can pick an earlier draft; change it to collect matches into
a list (e.g., md_matches = list(re.finditer(...))) and iterate that list in
reverse (for md_match in reversed(md_matches)) so you call
_try_parse_answer_json on the last fenced JSON blocks first and return the first
successful result.

117-120: ⚠️ Potential issue | 🟠 Major

Fail fast on required record fields.

The empty defaults here turn schema drift into silent zero-scores. These fields are required in this pipeline, so they should be indexed directly.

Suggested fix

-        answer = str(parsed.get("answer", ""))
+        answer = str(parsed["answer"])
@@
-        generation = prediction.get("generation", "")
-        expected_answer = prediction.get("expected_answer", "")
+        generation = prediction["generation"]
+        expected_answer = prediction["expected_answer"]
@@
-            gold_sp = prediction.get("supporting_facts", [])
+            gold_sp = prediction["supporting_facts"]
@@
-        expected_answer = predictions[0].get("expected_answer", "")
+        expected_answer = predictions[0]["expected_answer"]

As per coding guidelines: “Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data”.

Also applies to: 215-216, 236-237, 273-274

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 117 - 120,
The code currently uses parsed.get("answer", "") and
parsed.get("supporting_facts", []) which silently mask schema drift; change
these to direct indexing (parsed["answer"] and parsed["supporting_facts"]) and
ensure the initial guard verifies parsed is a dict and contains these keys so
the function fails fast; apply the same replacement for the other similar blocks
that access required fields (the other occurrences that use parsed.get for
"answer" and "supporting_facts") so missing keys raise clear errors instead of
producing silent defaults.

nemo_skills/dataset/hotpotqa/prepare_utils.py (1)

71-79: ⚠️ Potential issue | 🟠 Major

Write validation.jsonl atomically.

If format_entry() or json.dump() fails mid-loop, this overwrites the final file with a partial dataset. Stage the formatted rows first, write to a temp file, then replace the target.

Suggested fix

-    with open(output_path, "wt", encoding="utf-8") as fout:
-        for entry in tqdm(ds, desc=f"Writing {output_path.name}"):
-            formatted = format_entry(entry)
-            json.dump(formatted, fout)
-            fout.write("\n")
+    formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_path.name}")]
+    tmp_output_path = output_path.with_suffix(".jsonl.tmp")
+    with open(tmp_output_path, "wt", encoding="utf-8") as fout:
+        for formatted in formatted_entries:
+            json.dump(formatted, fout)
+            fout.write("\n")
+    tmp_output_path.replace(output_path)

As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/hotpotqa/prepare_utils.py` around lines 71 - 79, The
current loop writes directly to output_path and can leave a truncated file if
format_entry() or json.dump() fails; instead collect/serialize all formatted
rows first (e.g., build a list of formatted JSON strings via format_entry(entry)
and json.dumps) and then write them atomically: write to a temporary file in the
same directory (using output_path.parent) and when the write completes,
atomically replace output_path (e.g., os.replace). Update the block around
load_dataset(...), the for loop that calls format_entry and json.dump, and the
final print to use this temp-write-and-replace approach so output_path is only
replaced on successful completion.

🧹 Nitpick comments (1)

nemo_skills/dataset/hotpotqa/prepare.py (1)
15-24: Please wire this benchmark into CI/slurm coverage if that is not already in the PR.

This adds new dataset prep plus custom metrics, so a smoke path in default benchmark coverage will help catch regressions early.

Based on learnings: When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation; When adding new benchmarks, run GPU tests in CI locally, and ensure the dataset is included in default CI tests. Remove datasets from testing only if they require very heavy data preparation or have strong reasons.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/hotpotqa/prepare.py` around lines 15 - 24, Add HotpotQA
dataset prep into the CI/slurm benchmark coverage by registering the new prepare
script (prepare.py) and its prepare_validation call for the HotpotQA benchmark
so a smoke path runs in default benchmark tests and in slurm GPU jobs; update
the benchmark test matrix (the default benchmark coverage and slurm test
configs) to include the HotpotQA entry and its lightweight data prep step
(invoke prepare_validation) so custom metrics are exercised, and if the dataset
requires heavy prep, gate it behind a CI tag or skip condition to avoid slowing
all runs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/hotpotqa/prepare_utils.py`:
- Line 40: Replace the two non-strict zip() uses with zip(..., strict=True) to
avoid silent truncation (e.g., the loop over context entries using for title,
sentences in zip(context["title"], context["sentences"]) and the loop pairing
supporting_facts title/sent_id), and change the write logic in the function (the
block that formats entries and writes to the output file) to first build a list
of all formatted entries in-memory and only open/write the file after the entire
list is successfully created so a mid-loop exception cannot leave a partially
written/truncated file.

In `@nemo_skills/evaluation/metrics/hotpotqa_filtering.py`:
- Around line 279-281: In the short-alternative branch (where len(alt_norm) <=
4) replace the ASCII-letter-only boundary lookarounds used in the re.search call
for alt_norm against ans (currently using (?<![a-z])... (?![a-z])) with
word-character boundaries like (?<!\w)...(?!\w) so numeric alternatives (e.g.,
"10") won't match inside longer numeric tokens (e.g., "210"); update the pattern
construction around alt_norm in that re.search invocation accordingly.

---

Duplicate comments:
In `@nemo_skills/dataset/hotpotqa/prepare_utils.py`:
- Around line 71-79: The current loop writes directly to output_path and can
leave a truncated file if format_entry() or json.dump() fails; instead
collect/serialize all formatted rows first (e.g., build a list of formatted JSON
strings via format_entry(entry) and json.dumps) and then write them atomically:
write to a temporary file in the same directory (using output_path.parent) and
when the write completes, atomically replace output_path (e.g., os.replace).
Update the block around load_dataset(...), the for loop that calls format_entry
and json.dump, and the final print to use this temp-write-and-replace approach
so output_path is only replaced on successful completion.

In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py`:
- Around line 172-175: The loop that scans fenced JSON blocks uses re.finditer
and returns the first valid match, which can pick an earlier draft; change it to
collect matches into a list (e.g., md_matches = list(re.finditer(...))) and
iterate that list in reverse (for md_match in reversed(md_matches)) so you call
_try_parse_answer_json on the last fenced JSON blocks first and return the first
successful result.
- Around line 117-120: The code currently uses parsed.get("answer", "") and
parsed.get("supporting_facts", []) which silently mask schema drift; change
these to direct indexing (parsed["answer"] and parsed["supporting_facts"]) and
ensure the initial guard verifies parsed is a dict and contains these keys so
the function fails fast; apply the same replacement for the other similar blocks
that access required fields (the other occurrences that use parsed.get for
"answer" and "supporting_facts") so missing keys raise clear errors instead of
producing silent defaults.

---

Nitpick comments:
In `@nemo_skills/dataset/hotpotqa/prepare.py`:
- Around line 15-24: Add HotpotQA dataset prep into the CI/slurm benchmark
coverage by registering the new prepare script (prepare.py) and its
prepare_validation call for the HotpotQA benchmark so a smoke path runs in
default benchmark tests and in slurm GPU jobs; update the benchmark test matrix
(the default benchmark coverage and slurm test configs) to include the HotpotQA
entry and its lightweight data prep step (invoke prepare_validation) so custom
metrics are exercised, and if the dataset requires heavy prep, gate it behind a
CI tag or skip condition to avoid slowing all runs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3762d795-9036-4e31-9f19-56bea8038494

📥 Commits

Reviewing files that changed from the base of the PR and between ae8e340 and 46c4809.

📒 Files selected for processing (8)

docs/evaluation/other-benchmarks.md
nemo_skills/dataset/hotpotqa/__init__.py
nemo_skills/dataset/hotpotqa/prepare.py
nemo_skills/dataset/hotpotqa/prepare_utils.py
nemo_skills/dataset/hotpotqa_closedbook/__init__.py
nemo_skills/dataset/hotpotqa_closedbook/prepare.py
nemo_skills/evaluation/metrics/hotpotqa_filtering.py
nemo_skills/evaluation/metrics/hotpotqa_metrics.py

🚧 Files skipped from review as they are similar to previous changes (3)

docs/evaluation/other-benchmarks.md
nemo_skills/dataset/hotpotqa_closedbook/init.py
nemo_skills/dataset/hotpotqa/init.py

nemo_skills/dataset/hotpotqa/prepare_utils.py

nemo_skills/evaluation/metrics/hotpotqa_filtering.py

coderabbitai

🧹 Nitpick comments (1)

nemo_skills/evaluation/metrics/hotpotqa_metrics.py (1)

113-132: Consider using early return for optional supporting_facts handling.

The .get("supporting_facts", []) on line 120 is appropriate here since supporting_facts is optional in the JSON schema. However, the static analysis hint about moving the final return to an else block (TRY300) has merit for readability.

♻️ Optional refactor for cleaner structure

 def _try_parse_answer_json(text: str) -> tuple[str, list] | None:
     """Try to parse a JSON string as a HotpotQA answer object. Returns (answer, sp) or None."""
     try:
         parsed = json.loads(text)
-        if not isinstance(parsed, dict) or "answer" not in parsed:
-            return None
-        answer = str(parsed["answer"])
-        sp = parsed.get("supporting_facts", [])
-        if isinstance(sp, list):
-            valid_sp = []
-            for item in sp:
-                if isinstance(item, (list, tuple)) and len(item) == 2:
-                    try:
-                        valid_sp.append([str(item[0]), int(item[1])])
-                    except (ValueError, TypeError):
-                        continue
-            return answer, valid_sp
-        return answer, []
     except (json.JSONDecodeError, ValueError, TypeError):
         return None
+    if not isinstance(parsed, dict) or "answer" not in parsed:
+        return None
+    answer = str(parsed["answer"])
+    sp = parsed.get("supporting_facts", [])
+    if isinstance(sp, list):
+        valid_sp = []
+        for item in sp:
+            if isinstance(item, (list, tuple)) and len(item) == 2:
+                try:
+                    valid_sp.append([str(item[0]), int(item[1])])
+                except (ValueError, TypeError):
+                    continue
+        return answer, valid_sp
+    return answer, []

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 113 - 132,
The parsing function _try_parse_answer_json currently assigns sp =
parsed.get("supporting_facts", []) and then branches; refactor to use an early
return for the non-list case: after extracting answer, check if supporting_facts
is not a list (e.g., sp = parsed.get("supporting_facts"); if not isinstance(sp,
list): return answer, []), then proceed to validate and build valid_sp for the
list case and return answer, valid_sp; keep the JSON parse try/except and the
same validation logic inside the list-handling block so flow is clearer and the
final return is only used for the processed list result.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py`:
- Around line 113-132: The parsing function _try_parse_answer_json currently
assigns sp = parsed.get("supporting_facts", []) and then branches; refactor to
use an early return for the non-list case: after extracting answer, check if
supporting_facts is not a list (e.g., sp = parsed.get("supporting_facts"); if
not isinstance(sp, list): return answer, []), then proceed to validate and build
valid_sp for the list case and return answer, valid_sp; keep the JSON parse
try/except and the same validation logic inside the list-handling block so flow
is clearer and the final return is only used for the processed list result.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c0838c19-79d3-4b00-8aa5-4fb0481aa106

📥 Commits

Reviewing files that changed from the base of the PR and between 46c4809 and 3e45a4c.

📒 Files selected for processing (3)

nemo_skills/dataset/hotpotqa/prepare_utils.py
nemo_skills/evaluation/metrics/hotpotqa_filtering.py
nemo_skills/evaluation/metrics/hotpotqa_metrics.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/dataset/hotpotqa/prepare_utils.py

docs/evaluation/other-benchmarks.md

nemo_skills/evaluation/metrics/base.py

Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Adds distractor and closed-book variants of HotpotQA with official EM/F1 metrics, supporting-fact scoring, alternative-aware matching, and on-the-fly filtering. Includes evaluation documentation with reference results from Nemotron-3-Nano. Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

- Add hotpotqa/prepare_utils.py as single source for format and download - hotpotqa/prepare.py calls shared prepare_validation() - hotpotqa_closedbook/prepare.py copies from hotpotqa (or runs shared prep then copy) - Update docs/evaluation/other-benchmarks.md with unified prep description Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

- Metrics: use last fenced JSON block; fail fast on required prediction fields - Prepare: atomic write (temp then replace); zip(..., strict=True) - Filtering: word-boundary regex for short alternatives (fix numeric substring match) Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Variance was computed correctly but stored in [0,1]; table showed ± 0.00 because as_float formatted raw std. Now scale by 100 so e.g. 62.92 ± 0.25. Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

…scaling - other-benchmarks: use --benchmarks=hotpotqa:4 and hotpotqa_closedbook:4 in examples - base.py: add inline comment explaining 100.0 * metric_std (display % consistency) Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com> Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: Meriem B. <113170426+ka00ri@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Prasoon Varshney <prasoon1995@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com> Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: Meriem B. <113170426+ka00ri@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Prasoon Varshney <prasoon1995@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

MahanFathi requested a review from arnavkomaragiri March 4, 2026 18:21

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

MahanFathi force-pushed the mfathi/hotpotqa branch 2 times, most recently from 6b1c487 to 4431549 Compare March 4, 2026 18:40

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

docs/evaluation/other-benchmarks.md Show resolved Hide resolved

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

MahanFathi requested review from Kipok, abukharin3 and gnalbandyan and removed request for abukharin3 and arnavkomaragiri March 4, 2026 19:32

gnalbandyan reviewed Mar 6, 2026

View reviewed changes

docs/evaluation/other-benchmarks.md Show resolved Hide resolved

gnalbandyan reviewed Mar 6, 2026

View reviewed changes

nemo_skills/dataset/hotpotqa/prepare.py Show resolved Hide resolved

MahanFathi force-pushed the mfathi/hotpotqa branch from ae8e340 to 46c4809 Compare March 6, 2026 15:03

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

nemo_skills/dataset/hotpotqa/prepare_utils.py Outdated Show resolved Hide resolved

nemo_skills/evaluation/metrics/hotpotqa_filtering.py Show resolved Hide resolved

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

MahanFathi force-pushed the mfathi/hotpotqa branch from 3e45a4c to 40a829d Compare March 6, 2026 15:38

Kipok reviewed Mar 6, 2026

View reviewed changes

docs/evaluation/other-benchmarks.md Show resolved Hide resolved

nemo_skills/evaluation/metrics/base.py Outdated Show resolved Hide resolved

ka00ri and others added 8 commits March 11, 2026 14:37

Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285)

e820e93

Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Support source_lang param for translation recipe (#1290)

cf7334e

Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Fix pass@1[avg-of-k] variance display: scale ± to percentage in as_float

13aef08

Variance was computed correctly but stored in [0,1]; table showed ± 0.00 because as_float formatted raw std. Now scale by 100 so e.g. 62.92 ± 0.25. Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

docs: update HotpotQA example results with latest run and variance

9be6769

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

MahanFathi force-pushed the mfathi/hotpotqa branch from d89ab0b to 25de92c Compare March 11, 2026 21:37

MahanFathi and others added 2 commits March 11, 2026 15:13

Use as_percentage for HotpotQA metrics; keep as_float for plain floats

85475d2

Signed-off-by: Mahan Fathi <mfathi@nvidia.com>

Merge branch 'main' into mfathi/hotpotqa

767ac24

Merge branch 'main' into mfathi/hotpotqa

18e800c

Kipok added the run GPU tests label Mar 11, 2026

Kipok approved these changes Mar 12, 2026

View reviewed changes

Kipok merged commit 2b0a84d into main Mar 12, 2026
6 checks passed

Kipok deleted the mfathi/hotpotqa branch March 12, 2026 04:07

Conversation

MahanFathi commented Mar 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MahanFathi commented Mar 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading