Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds HotpotQA support: documentation, dataset preparation for distractor and closed‑book variants, prompt configs, HotpotQA-specific filtering and full evaluation metrics, and registry entries; includes new preparation utilities and evaluation scripts. Changes
Sequence DiagramsequenceDiagram
actor User
participant DataPrep as Data Preparation
participant Model as Model Generation
participant Metrics as Metrics Computation
participant Results as Results Output
User->>DataPrep: run prepare (distractor or closed-book)
activate DataPrep
DataPrep->>DataPrep: download/format validation.jsonl
DataPrep-->>User: validation.jsonl
deactivate DataPrep
User->>Model: run evaluation (prompt config + validation.jsonl)
activate Model
Model->>Model: generate JSON outputs (answer[, supporting_facts])
Model-->>Metrics: send predictions
deactivate Model
activate Metrics
Metrics->>Metrics: extract answer & supporting_facts
Metrics->>Metrics: normalize GTs (hotpotqa_filtering)
Metrics->>Metrics: compute EM/F1, SP, joint, pass@k (respect closed-book)
Metrics-->>Results: aggregated metrics (filtered & unfiltered)
deactivate Metrics
Results->>User: display/save evaluation results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/evaluation/other-benchmarks.md`:
- Line 140: Replace the non-descriptive link text "here" that points to
https://github.com/hotpotqa/hotpot with a descriptive label; for example change
the line "Original benchmark source is
[here](https://github.com/hotpotqa/hotpot)." to something like "Original
benchmark source: HotpotQA (https://github.com/hotpotqa/hotpot)" or "Original
benchmark source: HotpotQA repository" so the link text is meaningful and
lint-friendly.
- Around line 193-211: The fenced result blocks for the hotpotqa and
hotpotqa_closedbook examples are missing a fence language and trigger
markdownlint MD040; update the opening fences for both triple-backtick blocks to
include the explicit language specifier `text` (i.e., change ``` to ```text for
the first hotpotqa block and the second hotpotqa_closedbook block) so both
fenced output blocks are language-tagged.
In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py`:
- Around line 64-68: The current write loop opens output_file and streams
entries which risks truncation on failure; instead, first materialize all
formatted entries by iterating over ds and calling format_entry to build a list
(or generator collected into memory/temporary buffer), then open a temporary
file (e.g., output_file.with_suffix(".tmp")) and write the collected entries
using the same json.dump + newline logic inside the tqdm loop, flush and fsync,
and finally atomically replace the original output_file with the temp
(rename/replace) so that writing is atomic and avoids partial overwrite if the
run fails mid-way.
- Line 30: The zip calls in prepare.py currently iterate with
zip(context["title"], context["sentences"]) (and the similar zip over
supporting_facts fields) which can silently truncate mismatched-length
iterables; change both occurrences to use zip(..., strict=True) so a ValueError
is raised on length mismatch and misaligned title/sentences pairs are caught
early (update the zip in the loop over context["title"]/context["sentences"] and
the zip over
context["supporting_facts"]["title"]/context["supporting_facts"]["sentences"]
accordingly).
In `@nemo_skills/dataset/hotpotqa/prepare.py`:
- Around line 64-68: The current loop writes each formatted entry directly to
output_file and can leave a truncated file if formatting fails mid-run; first
iterate ds and collect or stream-serialize all formatted entries by calling
format_entry(entry) and accumulating them (e.g., a list or generator
materialized into a list), then open a temporary file in the same directory (use
tempfile.NamedTemporaryFile or mkstemp) and write the JSONL lines (the json.dump
+ fout.write("\n") logic currently in the loop) to that temp file, close/flush
it, and finally atomically replace output_file with os.replace(temp_path,
output_file) so the write of output_file is atomic and cannot be left partially
written if format_entry raises. Ensure you reference the existing names: ds,
format_entry, tqdm, output_file and preserve the same encoding/"wt" write mode
when writing the temp file.
- Line 34: The two loops iterating with zip over context fields (the for loop
using zip(context["title"], context["sentences"]) and the second zip at the
other loop) must be made strict to prevent silent truncation when arrays
mismatch; update both zip(...) calls to zip(..., strict=True) so any length
mismatch raises immediately, preserving data integrity for the variables title,
sentences (and the other zipped variables).
In `@nemo_skills/evaluation/metrics/hotpotqa_filtering.py`:
- Around line 24-28: The exported names in the __all__ list are not sorted per
Ruff RUF022; reorder the entries in the __all__ list for
nemo_skills/evaluation/metrics/hotpotqa_filtering.py so they follow isort-style
alphabetical order (e.g., "is_correct", "is_correct_strict", "normalize_gt") to
avoid lint failures; update the __all__ variable accordingly.
In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py`:
- Around line 171-175: The loop over fenced JSON blocks currently returns the
first valid parse; change it to collect all matches (e.g., md_matches =
list(re.finditer(...))), iterate them in reverse (for md_match in
reversed(md_matches)) and call _try_parse_answer_json(md_match.group(1)) so the
last valid fenced JSON block is used; ensure you keep the same regex and only
adjust the iteration order/collection and return the first non-None result found
from the reversed list.
- Around line 116-120: The code currently uses dict.get(..., default) for
required fields which masks missing data; update _try_parse_answer_json() to
remove the redundant default for "answer" (since you already check for its
presence) and access parsed["answer"] directly, and replace
parsed.get("supporting_facts", []) with direct parsed["supporting_facts"] in
that function; similarly in _get_score_dict() replace use of generation and
expected_answer via .get(...) with direct indexing (generation =
record["generation"], expected_answer = record["expected_answer"]) and in
update() access record["expected_answer"] directly so missing keys raise
KeyError and surface bad records immediately.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1116a35a-d1cd-4fdb-b5d6-32fc9efe9ba4
📒 Files selected for processing (10)
docs/evaluation/other-benchmarks.mdnemo_skills/dataset/hotpotqa/__init__.pynemo_skills/dataset/hotpotqa/prepare.pynemo_skills/dataset/hotpotqa_closedbook/__init__.pynemo_skills/dataset/hotpotqa_closedbook/prepare.pynemo_skills/evaluation/metrics/hotpotqa_filtering.pynemo_skills/evaluation/metrics/hotpotqa_metrics.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/prompt/config/eval/hotpotqa.yamlnemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml
6b1c487 to
4431549
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (9)
docs/evaluation/other-benchmarks.md (2)
193-193:⚠️ Potential issue | 🟡 MinorAdd fence languages to the result blocks.
Both output fences should be language-tagged (MD040).
Suggested fix
-``` +```text ----------------------------------------------------------------------------- hotpotqa ----------------------------------------------------------------------------- ...@@
-+text
----------------------------------------- hotpotqa_closedbook ------------------------------------------
...Also applies to: 204-204
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/evaluation/other-benchmarks.md` at line 193, Update the two fenced result blocks so they include a language tag (e.g., "text") after the opening triple backticks; specifically add the language tag to the fence that contains the "hotpotqa" result block and to the fence that contains the "hotpotqa_closedbook" result block so both are ```text ... ``` to satisfy MD040.
140-140:⚠️ Potential issue | 🟡 MinorUse descriptive link text on Line 140.
The current “here” anchor is non-descriptive and triggers MD059.
Suggested fix
-- Original benchmark source is [here](https://github.com/hotpotqa/hotpot). +- Original benchmark source is [HotpotQA benchmark repository](https://github.com/hotpotqa/hotpot).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/evaluation/other-benchmarks.md` at line 140, Replace the non-descriptive link anchor "here" with a descriptive label such as "HotpotQA repository" (or similar meaningful text) in the markdown link to https://github.com/hotpotqa/hotpot to resolve MD059; locate the current anchor text "here" and update it to the new descriptive link text while keeping the URL unchanged.nemo_skills/evaluation/metrics/hotpotqa_metrics.py (2)
172-175:⚠️ Potential issue | 🟡 MinorUse the last valid fenced JSON block, not the first.
Line 172–175 currently returns the first valid fenced JSON, which conflicts with the parser contract in the docstring and can select draft output over the final answer.
Suggested fix
- for md_match in re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL): + md_matches = list(re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL)) + for md_match in reversed(md_matches): result = _try_parse_answer_json(md_match.group(1)) if result is not None: return result🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 172 - 175, The current loop over re.finditer returns the first valid fenced JSON block but the parser needs to prefer the last valid block; change the logic in the block that uses re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL) to collect matches into a list (e.g., md_matches = list(re.finditer(...))) and iterate over them in reverse order, calling _try_parse_answer_json(md_match.group(1)) for each and returning the first non-None result found (thereby returning the last valid fenced JSON in the original text).
117-120:⚠️ Potential issue | 🟠 MajorFail fast on required keys; avoid
.get()defaults in required schema paths.Using defaults here can silently score malformed records instead of surfacing schema issues.
Suggested fix
- answer = str(parsed.get("answer", "")) + answer = str(parsed["answer"]) @@ - generation = prediction.get("generation", "") - expected_answer = prediction.get("expected_answer", "") + generation = prediction["generation"] + expected_answer = prediction["expected_answer"] @@ - gold_sp = prediction.get("supporting_facts", []) + gold_sp = prediction["supporting_facts"] @@ - expected_answer = predictions[0].get("expected_answer", "") + expected_answer = predictions[0]["expected_answer"]As per coding guidelines: “Don't use
.get()for accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data”.Also applies to: 215-217, 236-237, 273-274
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 117 - 120, The code is silently defaulting required schema fields using parsed.get(...), which can hide malformed records; replace uses of parsed.get("answer", ""), parsed.get("supporting_facts", []), and other .get(...) accesses for required keys with direct indexing (e.g., parsed["answer"], parsed["supporting_facts"]) so the code fails fast with a clear KeyError when required fields are missing; apply the same change to the other occurrences noted (the later blocks around the same variable names) ensuring you only use direct access for keys that are validated as present earlier in the function.nemo_skills/evaluation/metrics/hotpotqa_filtering.py (1)
24-28:⚠️ Potential issue | 🟡 MinorSort
__all__to satisfy RUF022.Please keep exported symbols in isort-style order to avoid lint drift.
Suggested fix
__all__ = [ - "normalize_gt", "is_correct", "is_correct_strict", + "normalize_gt", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/hotpotqa_filtering.py` around lines 24 - 28, The __all__ export list is not sorted; reorder the list in isort-style (alphabetical) order so exported symbols are stable — update the __all__ assignment to ["is_correct", "is_correct_strict", "normalize_gt"] referencing the existing symbol names normalize_gt, is_correct, and is_correct_strict to satisfy RUF022.nemo_skills/dataset/hotpotqa/prepare.py (2)
34-34:⚠️ Potential issue | 🟠 MajorUse
zip(..., strict=True)for alignment safety.Line 34 and Line 44 can silently truncate mismatched arrays, causing undetected record corruption.
Suggested fix
- for title, sentences in zip(context["title"], context["sentences"]): + for title, sentences in zip(context["title"], context["sentences"], strict=True): @@ - supporting_facts = list(zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"])) + supporting_facts = list( + zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"], strict=True) + )Also applies to: 44-44
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hotpotqa/prepare.py` at line 34, Replace the two uses of zip over context lists so mismatched lengths raise errors: change the loops using zip(context["title"], context["sentences"]) (and the similar occurrence at the other location) to zip(context["title"], context["sentences"], strict=True) so alignment mismatches are detected instead of silently truncating records; update both occurrences where those variables are iterated.
64-68:⚠️ Potential issue | 🟠 MajorWrite
validation.jsonlatomically to avoid truncated outputs.Current flow rewrites the final file while formatting. A mid-run failure can leave a partial dataset.
Suggested fix
- with open(output_file, "wt", encoding="utf-8") as fout: - for entry in tqdm(ds, desc=f"Writing {output_file.name}"): - formatted = format_entry(entry) - json.dump(formatted, fout) - fout.write("\n") + formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_file.name}")] + tmp_output_file = output_file.with_suffix(".jsonl.tmp") + with open(tmp_output_file, "wt", encoding="utf-8") as fout: + for formatted in formatted_entries: + json.dump(formatted, fout) + fout.write("\n") + tmp_output_file.replace(output_file)As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hotpotqa/prepare.py` around lines 64 - 68, The current loop writes directly to output_file and can leave a partially-written file on failure; instead collect or stream writes into a temporary file and atomically replace output_file when complete. Modify the block that iterates over ds and calls format_entry(entry) and json.dump by writing to a tempfile in the same directory (e.g., using tempfile.NamedTemporaryFile or TemporaryDirectory) and, after the loop finishes and the temp file is closed/flushed, call os.replace(temp_path, output_file) to atomically rename; ensure you keep the same encoding and newline behavior as the original and reference the existing variables/functions format_entry, ds, output_file and the tqdm loop so the change replaces only the write-out logic.nemo_skills/dataset/hotpotqa_closedbook/prepare.py (2)
30-30:⚠️ Potential issue | 🟠 MajorUse
zip(..., strict=True)to prevent silent field truncation.Line 30 and Line 40 can silently drop tail elements on length mismatches, which corrupts context/supporting-fact alignment.
Suggested fix
- for title, sentences in zip(context["title"], context["sentences"]): + for title, sentences in zip(context["title"], context["sentences"], strict=True): @@ - supporting_facts = list(zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"])) + supporting_facts = list( + zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"], strict=True) + )Also applies to: 40-40
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py` at line 30, The for-loops iterating with zip over context["title"] and context["sentences"] (and the similar zip over context["supporting_facts"] / sentences) can silently truncate mismatched lists; replace those zip calls with zip(..., strict=True) so a ValueError is raised on length mismatch (locate the loops that use zip(context["title"], context["sentences"]) and the other zip over supporting_facts/sentences and add the strict=True argument).
64-68:⚠️ Potential issue | 🟠 MajorAvoid partial overwrite by staging computation before writing.
If formatting fails mid-loop,
validation.jsonlcan be left truncated. Compute first, write to a temp file, then atomically replace.Suggested fix
- with open(output_file, "wt", encoding="utf-8") as fout: - for entry in tqdm(ds, desc=f"Writing {output_file.name}"): - formatted = format_entry(entry) - json.dump(formatted, fout) - fout.write("\n") + formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_file.name}")] + tmp_output_file = output_file.with_suffix(".jsonl.tmp") + with open(tmp_output_file, "wt", encoding="utf-8") as fout: + for formatted in formatted_entries: + json.dump(formatted, fout) + fout.write("\n") + tmp_output_file.replace(output_file)As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py` around lines 64 - 68, The loop that writes formatted entries directly to output_file can leave validation.jsonl truncated if formatting fails; instead, in prepare.py compute all formatted entries first (e.g., collect results from iterating ds and calling format_entry(entry) with tqdm), write those results to a temporary file (use tempfile.NamedTemporaryFile or a .tmp Path), and then atomically replace the target file (use os.replace or Path.replace) to swap the temp into output_file; update the block around the ds/tqdm loop and the json dumping logic to perform staging then atomic rename.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/evaluation/other-benchmarks.md`:
- Around line 130-150: The CI is missing HotpotQA variants ('hotpotqa' and
'hotpotqa_closedbook') from default GPU and CPU benchmark runs; update the CI
workflow definitions (gpu_tests.yml and tests.yml) to add jobs or matrix entries
that run evaluation for these two slugs (ensure data prep step runs ns
prepare_data hotpotqa and ns prepare_data hotpotqa_closedbook) so the pipeline
exercises the HotpotQA metric logic (supporting-facts EM/F1, Joint EM/F1,
alternative-aware substring matching and filtering) during PR checks; ensure the
new CI steps invoke the same evaluation command used by existing benchmarks so
coverage and reporting match other datasets.
---
Duplicate comments:
In `@docs/evaluation/other-benchmarks.md`:
- Line 193: Update the two fenced result blocks so they include a language tag
(e.g., "text") after the opening triple backticks; specifically add the language
tag to the fence that contains the "hotpotqa" result block and to the fence that
contains the "hotpotqa_closedbook" result block so both are ```text ... ``` to
satisfy MD040.
- Line 140: Replace the non-descriptive link anchor "here" with a descriptive
label such as "HotpotQA repository" (or similar meaningful text) in the markdown
link to https://github.com/hotpotqa/hotpot to resolve MD059; locate the current
anchor text "here" and update it to the new descriptive link text while keeping
the URL unchanged.
In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py`:
- Line 30: The for-loops iterating with zip over context["title"] and
context["sentences"] (and the similar zip over context["supporting_facts"] /
sentences) can silently truncate mismatched lists; replace those zip calls with
zip(..., strict=True) so a ValueError is raised on length mismatch (locate the
loops that use zip(context["title"], context["sentences"]) and the other zip
over supporting_facts/sentences and add the strict=True argument).
- Around line 64-68: The loop that writes formatted entries directly to
output_file can leave validation.jsonl truncated if formatting fails; instead,
in prepare.py compute all formatted entries first (e.g., collect results from
iterating ds and calling format_entry(entry) with tqdm), write those results to
a temporary file (use tempfile.NamedTemporaryFile or a .tmp Path), and then
atomically replace the target file (use os.replace or Path.replace) to swap the
temp into output_file; update the block around the ds/tqdm loop and the json
dumping logic to perform staging then atomic rename.
In `@nemo_skills/dataset/hotpotqa/prepare.py`:
- Line 34: Replace the two uses of zip over context lists so mismatched lengths
raise errors: change the loops using zip(context["title"], context["sentences"])
(and the similar occurrence at the other location) to zip(context["title"],
context["sentences"], strict=True) so alignment mismatches are detected instead
of silently truncating records; update both occurrences where those variables
are iterated.
- Around line 64-68: The current loop writes directly to output_file and can
leave a partially-written file on failure; instead collect or stream writes into
a temporary file and atomically replace output_file when complete. Modify the
block that iterates over ds and calls format_entry(entry) and json.dump by
writing to a tempfile in the same directory (e.g., using
tempfile.NamedTemporaryFile or TemporaryDirectory) and, after the loop finishes
and the temp file is closed/flushed, call os.replace(temp_path, output_file) to
atomically rename; ensure you keep the same encoding and newline behavior as the
original and reference the existing variables/functions format_entry, ds,
output_file and the tqdm loop so the change replaces only the write-out logic.
In `@nemo_skills/evaluation/metrics/hotpotqa_filtering.py`:
- Around line 24-28: The __all__ export list is not sorted; reorder the list in
isort-style (alphabetical) order so exported symbols are stable — update the
__all__ assignment to ["is_correct", "is_correct_strict", "normalize_gt"]
referencing the existing symbol names normalize_gt, is_correct, and
is_correct_strict to satisfy RUF022.
In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py`:
- Around line 172-175: The current loop over re.finditer returns the first valid
fenced JSON block but the parser needs to prefer the last valid block; change
the logic in the block that uses re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```",
text, re.DOTALL) to collect matches into a list (e.g., md_matches =
list(re.finditer(...))) and iterate over them in reverse order, calling
_try_parse_answer_json(md_match.group(1)) for each and returning the first
non-None result found (thereby returning the last valid fenced JSON in the
original text).
- Around line 117-120: The code is silently defaulting required schema fields
using parsed.get(...), which can hide malformed records; replace uses of
parsed.get("answer", ""), parsed.get("supporting_facts", []), and other
.get(...) accesses for required keys with direct indexing (e.g.,
parsed["answer"], parsed["supporting_facts"]) so the code fails fast with a
clear KeyError when required fields are missing; apply the same change to the
other occurrences noted (the later blocks around the same variable names)
ensuring you only use direct access for keys that are validated as present
earlier in the function.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 55fb65a2-f629-43b5-8970-f2df118b5667
📒 Files selected for processing (10)
docs/evaluation/other-benchmarks.mdnemo_skills/dataset/hotpotqa/__init__.pynemo_skills/dataset/hotpotqa/prepare.pynemo_skills/dataset/hotpotqa_closedbook/__init__.pynemo_skills/dataset/hotpotqa_closedbook/prepare.pynemo_skills/evaluation/metrics/hotpotqa_filtering.pynemo_skills/evaluation/metrics/hotpotqa_metrics.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/prompt/config/eval/hotpotqa.yamlnemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml
🚧 Files skipped from review as they are similar to previous changes (4)
- nemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml
- nemo_skills/dataset/hotpotqa_closedbook/init.py
- nemo_skills/dataset/hotpotqa/init.py
- nemo_skills/prompt/config/eval/hotpotqa.yaml
There was a problem hiding this comment.
♻️ Duplicate comments (6)
nemo_skills/dataset/hotpotqa_closedbook/prepare.py (2)
64-68:⚠️ Potential issue | 🟠 MajorWrite atomically after formatting to prevent partial output corruption.
If formatting fails mid-run,
validation.jsonlcan be left truncated.Suggested fix
- with open(output_file, "wt", encoding="utf-8") as fout: - for entry in tqdm(ds, desc=f"Writing {output_file.name}"): - formatted = format_entry(entry) - json.dump(formatted, fout) - fout.write("\n") + formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_file.name}")] + tmp_output_file = output_file.with_suffix(".jsonl.tmp") + with open(tmp_output_file, "wt", encoding="utf-8") as fout: + for formatted in formatted_entries: + json.dump(formatted, fout) + fout.write("\n") + tmp_output_file.replace(output_file)As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py` around lines 64 - 68, Currently the code opens output_file for writing immediately and streams formatted entries, which can leave validation.jsonl truncated if format_entry fails; instead first iterate ds and collect formatted entries into a list (calling format_entry for each and letting exceptions surface before opening the file), then reopen output_file and write the collected formatted objects with json.dump and fout.write("\n") (using the same tqdm/desc if desired) so the file is only truncated/replaced after all formatting succeeds; reference functions/vars: format_entry, ds, output_file, json.dump, tqdm.
30-30:⚠️ Potential issue | 🟠 MajorAdd
strict=Trueto bothzip()calls.Both loops can silently truncate on length mismatch, which can misalign context/supporting-fact fields.
Suggested fix
- for title, sentences in zip(context["title"], context["sentences"]): + for title, sentences in zip(context["title"], context["sentences"], strict=True): @@ - supporting_facts = list(zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"])) + supporting_facts = list( + zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"], strict=True) + )Also applies to: 40-40
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py` at line 30, The two zip() loops that iterate over context["title"] and context["sentences"] (and the other zip that pairs the supporting-fact/title lists) can silently truncate on length mismatch; update both zip() calls to use zip(..., strict=True) so mismatched lengths raise an error and prevent misaligned context/supporting-fact fields, locating the zip calls that unpack into variables like title, sentences and the one that unpacks supporting-fact pairs and adding strict=True to each.nemo_skills/dataset/hotpotqa/prepare.py (2)
34-34:⚠️ Potential issue | 🟠 MajorUse
strict=Truein bothzip()calls to guard schema alignment.Without strict mode, mismatched lengths silently truncate and can corrupt row alignment.
Suggested fix
- for title, sentences in zip(context["title"], context["sentences"]): + for title, sentences in zip(context["title"], context["sentences"], strict=True): @@ - supporting_facts = list(zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"])) + supporting_facts = list( + zip(entry["supporting_facts"]["title"], entry["supporting_facts"]["sent_id"], strict=True) + )Also applies to: 44-44
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hotpotqa/prepare.py` at line 34, The zip iteration over context["title"] and context["sentences"] (and the other zip on lines 44) should use zip(..., strict=True) to ensure the two sequences are the same length and fail fast on schema misalignment; update the for loops that destructure (title, sentences) and the second zip to pass strict=True so mismatched lengths raise a ValueError instead of silently truncating.
64-68:⚠️ Potential issue | 🟠 MajorAvoid partial overwrite of
validation.jsonlon failure.Current flow can leave a truncated target file if
format_entry()raises mid-loop.Suggested fix
- with open(output_file, "wt", encoding="utf-8") as fout: - for entry in tqdm(ds, desc=f"Writing {output_file.name}"): - formatted = format_entry(entry) - json.dump(formatted, fout) - fout.write("\n") + formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_file.name}")] + tmp_output_file = output_file.with_suffix(".jsonl.tmp") + with open(tmp_output_file, "wt", encoding="utf-8") as fout: + for formatted in formatted_entries: + json.dump(formatted, fout) + fout.write("\n") + tmp_output_file.replace(output_file)As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hotpotqa/prepare.py` around lines 64 - 68, Do not stream-write while calling format_entry on ds; instead fully compute/collect all formatted entries first (e.g., iterate ds and call format_entry, storing results in a list) and only after that open output_file for writing and dump each precomputed formatted entry with json.dump and fout.write; alternatively write to a temporary file and atomically rename into place—refer to format_entry, ds, output_file, tqdm and the existing write loop (fout) to locate and change the logic.nemo_skills/evaluation/metrics/hotpotqa_metrics.py (2)
172-175:⚠️ Potential issue | 🟡 MinorUse the last valid fenced JSON block, not the first.
Current loop returns the first fenced JSON, which contradicts the parser contract in the docstring.
Suggested fix
- for md_match in re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL): + md_matches = list(re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL)) + for md_match in reversed(md_matches): result = _try_parse_answer_json(md_match.group(1)) if result is not None: return result🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 172 - 175, The current loop in hotpotqa_metrics.py returns the first fenced JSON block found, but the parser contract requires using the last valid fenced JSON; change the logic in the block that calls re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL) so you collect matches into a list (md_matches = list(re.finditer(...))) and then iterate them in reverse (for md_match in reversed(md_matches)): call _try_parse_answer_json(md_match.group(1)) and return the first non-None result found, ensuring the last fenced JSON in the text is preferred.
215-216:⚠️ Potential issue | 🟠 MajorUse direct indexing for required prediction fields (fail fast on bad records).
These defaults can silently turn malformed inputs into incorrect metrics.
Suggested fix
- generation = prediction.get("generation", "") - expected_answer = prediction.get("expected_answer", "") + generation = prediction["generation"] + expected_answer = prediction["expected_answer"] @@ - gold_sp = prediction.get("supporting_facts", []) + gold_sp = prediction["supporting_facts"] @@ - expected_answer = predictions[0].get("expected_answer", "") + expected_answer = predictions[0]["expected_answer"]As per coding guidelines: “Don't use
.get()for accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data”.Also applies to: 236-236, 273-273
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 215 - 216, Replace optional .get() usage with direct dictionary indexing for required prediction fields so malformed records fail fast: change prediction.get("generation", "") and prediction.get("expected_answer", "") to prediction["generation"] and prediction["expected_answer"], and apply the same replacement to the other prediction.get(...) occurrences noted (the other two locations flagged). Ensure any field that your metric logic requires uses direct indexing (prediction["..."]) so missing keys raise immediately instead of silently using defaults.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nemo_skills/dataset/hotpotqa_closedbook/prepare.py`:
- Around line 64-68: Currently the code opens output_file for writing
immediately and streams formatted entries, which can leave validation.jsonl
truncated if format_entry fails; instead first iterate ds and collect formatted
entries into a list (calling format_entry for each and letting exceptions
surface before opening the file), then reopen output_file and write the
collected formatted objects with json.dump and fout.write("\n") (using the same
tqdm/desc if desired) so the file is only truncated/replaced after all
formatting succeeds; reference functions/vars: format_entry, ds, output_file,
json.dump, tqdm.
- Line 30: The two zip() loops that iterate over context["title"] and
context["sentences"] (and the other zip that pairs the supporting-fact/title
lists) can silently truncate on length mismatch; update both zip() calls to use
zip(..., strict=True) so mismatched lengths raise an error and prevent
misaligned context/supporting-fact fields, locating the zip calls that unpack
into variables like title, sentences and the one that unpacks supporting-fact
pairs and adding strict=True to each.
In `@nemo_skills/dataset/hotpotqa/prepare.py`:
- Line 34: The zip iteration over context["title"] and context["sentences"] (and
the other zip on lines 44) should use zip(..., strict=True) to ensure the two
sequences are the same length and fail fast on schema misalignment; update the
for loops that destructure (title, sentences) and the second zip to pass
strict=True so mismatched lengths raise a ValueError instead of silently
truncating.
- Around line 64-68: Do not stream-write while calling format_entry on ds;
instead fully compute/collect all formatted entries first (e.g., iterate ds and
call format_entry, storing results in a list) and only after that open
output_file for writing and dump each precomputed formatted entry with json.dump
and fout.write; alternatively write to a temporary file and atomically rename
into place—refer to format_entry, ds, output_file, tqdm and the existing write
loop (fout) to locate and change the logic.
In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py`:
- Around line 172-175: The current loop in hotpotqa_metrics.py returns the first
fenced JSON block found, but the parser contract requires using the last valid
fenced JSON; change the logic in the block that calls
re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL) so you collect
matches into a list (md_matches = list(re.finditer(...))) and then iterate them
in reverse (for md_match in reversed(md_matches)): call
_try_parse_answer_json(md_match.group(1)) and return the first non-None result
found, ensuring the last fenced JSON in the text is preferred.
- Around line 215-216: Replace optional .get() usage with direct dictionary
indexing for required prediction fields so malformed records fail fast: change
prediction.get("generation", "") and prediction.get("expected_answer", "") to
prediction["generation"] and prediction["expected_answer"], and apply the same
replacement to the other prediction.get(...) occurrences noted (the other two
locations flagged). Ensure any field that your metric logic requires uses direct
indexing (prediction["..."]) so missing keys raise immediately instead of
silently using defaults.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 139606ad-3f62-4174-a35c-acbc6bf4d99f
📒 Files selected for processing (10)
docs/evaluation/other-benchmarks.mdnemo_skills/dataset/hotpotqa/__init__.pynemo_skills/dataset/hotpotqa/prepare.pynemo_skills/dataset/hotpotqa_closedbook/__init__.pynemo_skills/dataset/hotpotqa_closedbook/prepare.pynemo_skills/evaluation/metrics/hotpotqa_filtering.pynemo_skills/evaluation/metrics/hotpotqa_metrics.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/prompt/config/eval/hotpotqa.yamlnemo_skills/prompt/config/eval/hotpotqa_closedbook.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
- nemo_skills/prompt/config/eval/hotpotqa.yaml
- nemo_skills/dataset/hotpotqa/init.py
- nemo_skills/dataset/hotpotqa_closedbook/init.py
ae8e340 to
46c4809
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
nemo_skills/evaluation/metrics/hotpotqa_metrics.py (2)
172-175:⚠️ Potential issue | 🟡 MinorScan fenced JSON blocks from the end.
This branch still returns the first valid fenced block, which contradicts the parser contract and can lock onto an earlier draft answer when reasoning contains multiple JSON snippets.
Suggested fix
- for md_match in re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL): + md_matches = list(re.finditer(r"```(?:json)?\s*(\{.*?\})\s*```", text, re.DOTALL)) + for md_match in reversed(md_matches): result = _try_parse_answer_json(md_match.group(1)) if result is not None: return result🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 172 - 175, The loop that scans fenced JSON blocks uses re.finditer and returns the first valid match, which can pick an earlier draft; change it to collect matches into a list (e.g., md_matches = list(re.finditer(...))) and iterate that list in reverse (for md_match in reversed(md_matches)) so you call _try_parse_answer_json on the last fenced JSON blocks first and return the first successful result.
117-120:⚠️ Potential issue | 🟠 MajorFail fast on required record fields.
The empty defaults here turn schema drift into silent zero-scores. These fields are required in this pipeline, so they should be indexed directly.
Suggested fix
- answer = str(parsed.get("answer", "")) + answer = str(parsed["answer"]) @@ - generation = prediction.get("generation", "") - expected_answer = prediction.get("expected_answer", "") + generation = prediction["generation"] + expected_answer = prediction["expected_answer"] @@ - gold_sp = prediction.get("supporting_facts", []) + gold_sp = prediction["supporting_facts"] @@ - expected_answer = predictions[0].get("expected_answer", "") + expected_answer = predictions[0]["expected_answer"]As per coding guidelines: “Don't use
.get()for accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data”.Also applies to: 215-216, 236-237, 273-274
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 117 - 120, The code currently uses parsed.get("answer", "") and parsed.get("supporting_facts", []) which silently mask schema drift; change these to direct indexing (parsed["answer"] and parsed["supporting_facts"]) and ensure the initial guard verifies parsed is a dict and contains these keys so the function fails fast; apply the same replacement for the other similar blocks that access required fields (the other occurrences that use parsed.get for "answer" and "supporting_facts") so missing keys raise clear errors instead of producing silent defaults.nemo_skills/dataset/hotpotqa/prepare_utils.py (1)
71-79:⚠️ Potential issue | 🟠 MajorWrite
validation.jsonlatomically.If
format_entry()orjson.dump()fails mid-loop, this overwrites the final file with a partial dataset. Stage the formatted rows first, write to a temp file, then replace the target.Suggested fix
- with open(output_path, "wt", encoding="utf-8") as fout: - for entry in tqdm(ds, desc=f"Writing {output_path.name}"): - formatted = format_entry(entry) - json.dump(formatted, fout) - fout.write("\n") + formatted_entries = [format_entry(entry) for entry in tqdm(ds, desc=f"Formatting {output_path.name}")] + tmp_output_path = output_path.with_suffix(".jsonl.tmp") + with open(tmp_output_path, "wt", encoding="utf-8") as fout: + for formatted in formatted_entries: + json.dump(formatted, fout) + fout.write("\n") + tmp_output_path.replace(output_path)As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails”.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hotpotqa/prepare_utils.py` around lines 71 - 79, The current loop writes directly to output_path and can leave a truncated file if format_entry() or json.dump() fails; instead collect/serialize all formatted rows first (e.g., build a list of formatted JSON strings via format_entry(entry) and json.dumps) and then write them atomically: write to a temporary file in the same directory (using output_path.parent) and when the write completes, atomically replace output_path (e.g., os.replace). Update the block around load_dataset(...), the for loop that calls format_entry and json.dump, and the final print to use this temp-write-and-replace approach so output_path is only replaced on successful completion.
🧹 Nitpick comments (1)
nemo_skills/dataset/hotpotqa/prepare.py (1)
15-24: Please wire this benchmark into CI/slurm coverage if that is not already in the PR.This adds new dataset prep plus custom metrics, so a smoke path in default benchmark coverage will help catch regressions early.
Based on learnings: When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation; When adding new benchmarks, run GPU tests in CI locally, and ensure the dataset is included in default CI tests. Remove datasets from testing only if they require very heavy data preparation or have strong reasons.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hotpotqa/prepare.py` around lines 15 - 24, Add HotpotQA dataset prep into the CI/slurm benchmark coverage by registering the new prepare script (prepare.py) and its prepare_validation call for the HotpotQA benchmark so a smoke path runs in default benchmark tests and in slurm GPU jobs; update the benchmark test matrix (the default benchmark coverage and slurm test configs) to include the HotpotQA entry and its lightweight data prep step (invoke prepare_validation) so custom metrics are exercised, and if the dataset requires heavy prep, gate it behind a CI tag or skip condition to avoid slowing all runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/hotpotqa/prepare_utils.py`:
- Line 40: Replace the two non-strict zip() uses with zip(..., strict=True) to
avoid silent truncation (e.g., the loop over context entries using for title,
sentences in zip(context["title"], context["sentences"]) and the loop pairing
supporting_facts title/sent_id), and change the write logic in the function (the
block that formats entries and writes to the output file) to first build a list
of all formatted entries in-memory and only open/write the file after the entire
list is successfully created so a mid-loop exception cannot leave a partially
written/truncated file.
In `@nemo_skills/evaluation/metrics/hotpotqa_filtering.py`:
- Around line 279-281: In the short-alternative branch (where len(alt_norm) <=
4) replace the ASCII-letter-only boundary lookarounds used in the re.search call
for alt_norm against ans (currently using (?<![a-z])... (?![a-z])) with
word-character boundaries like (?<!\w)...(?!\w) so numeric alternatives (e.g.,
"10") won't match inside longer numeric tokens (e.g., "210"); update the pattern
construction around alt_norm in that re.search invocation accordingly.
---
Duplicate comments:
In `@nemo_skills/dataset/hotpotqa/prepare_utils.py`:
- Around line 71-79: The current loop writes directly to output_path and can
leave a truncated file if format_entry() or json.dump() fails; instead
collect/serialize all formatted rows first (e.g., build a list of formatted JSON
strings via format_entry(entry) and json.dumps) and then write them atomically:
write to a temporary file in the same directory (using output_path.parent) and
when the write completes, atomically replace output_path (e.g., os.replace).
Update the block around load_dataset(...), the for loop that calls format_entry
and json.dump, and the final print to use this temp-write-and-replace approach
so output_path is only replaced on successful completion.
In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py`:
- Around line 172-175: The loop that scans fenced JSON blocks uses re.finditer
and returns the first valid match, which can pick an earlier draft; change it to
collect matches into a list (e.g., md_matches = list(re.finditer(...))) and
iterate that list in reverse (for md_match in reversed(md_matches)) so you call
_try_parse_answer_json on the last fenced JSON blocks first and return the first
successful result.
- Around line 117-120: The code currently uses parsed.get("answer", "") and
parsed.get("supporting_facts", []) which silently mask schema drift; change
these to direct indexing (parsed["answer"] and parsed["supporting_facts"]) and
ensure the initial guard verifies parsed is a dict and contains these keys so
the function fails fast; apply the same replacement for the other similar blocks
that access required fields (the other occurrences that use parsed.get for
"answer" and "supporting_facts") so missing keys raise clear errors instead of
producing silent defaults.
---
Nitpick comments:
In `@nemo_skills/dataset/hotpotqa/prepare.py`:
- Around line 15-24: Add HotpotQA dataset prep into the CI/slurm benchmark
coverage by registering the new prepare script (prepare.py) and its
prepare_validation call for the HotpotQA benchmark so a smoke path runs in
default benchmark tests and in slurm GPU jobs; update the benchmark test matrix
(the default benchmark coverage and slurm test configs) to include the HotpotQA
entry and its lightweight data prep step (invoke prepare_validation) so custom
metrics are exercised, and if the dataset requires heavy prep, gate it behind a
CI tag or skip condition to avoid slowing all runs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3762d795-9036-4e31-9f19-56bea8038494
📒 Files selected for processing (8)
docs/evaluation/other-benchmarks.mdnemo_skills/dataset/hotpotqa/__init__.pynemo_skills/dataset/hotpotqa/prepare.pynemo_skills/dataset/hotpotqa/prepare_utils.pynemo_skills/dataset/hotpotqa_closedbook/__init__.pynemo_skills/dataset/hotpotqa_closedbook/prepare.pynemo_skills/evaluation/metrics/hotpotqa_filtering.pynemo_skills/evaluation/metrics/hotpotqa_metrics.py
🚧 Files skipped from review as they are similar to previous changes (3)
- docs/evaluation/other-benchmarks.md
- nemo_skills/dataset/hotpotqa_closedbook/init.py
- nemo_skills/dataset/hotpotqa/init.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
nemo_skills/evaluation/metrics/hotpotqa_metrics.py (1)
113-132: Consider using early return for optionalsupporting_factshandling.The
.get("supporting_facts", [])on line 120 is appropriate here sincesupporting_factsis optional in the JSON schema. However, the static analysis hint about moving the final return to an else block (TRY300) has merit for readability.♻️ Optional refactor for cleaner structure
def _try_parse_answer_json(text: str) -> tuple[str, list] | None: """Try to parse a JSON string as a HotpotQA answer object. Returns (answer, sp) or None.""" try: parsed = json.loads(text) - if not isinstance(parsed, dict) or "answer" not in parsed: - return None - answer = str(parsed["answer"]) - sp = parsed.get("supporting_facts", []) - if isinstance(sp, list): - valid_sp = [] - for item in sp: - if isinstance(item, (list, tuple)) and len(item) == 2: - try: - valid_sp.append([str(item[0]), int(item[1])]) - except (ValueError, TypeError): - continue - return answer, valid_sp - return answer, [] except (json.JSONDecodeError, ValueError, TypeError): return None + if not isinstance(parsed, dict) or "answer" not in parsed: + return None + answer = str(parsed["answer"]) + sp = parsed.get("supporting_facts", []) + if isinstance(sp, list): + valid_sp = [] + for item in sp: + if isinstance(item, (list, tuple)) and len(item) == 2: + try: + valid_sp.append([str(item[0]), int(item[1])]) + except (ValueError, TypeError): + continue + return answer, valid_sp + return answer, []🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py` around lines 113 - 132, The parsing function _try_parse_answer_json currently assigns sp = parsed.get("supporting_facts", []) and then branches; refactor to use an early return for the non-list case: after extracting answer, check if supporting_facts is not a list (e.g., sp = parsed.get("supporting_facts"); if not isinstance(sp, list): return answer, []), then proceed to validate and build valid_sp for the list case and return answer, valid_sp; keep the JSON parse try/except and the same validation logic inside the list-handling block so flow is clearer and the final return is only used for the processed list result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@nemo_skills/evaluation/metrics/hotpotqa_metrics.py`:
- Around line 113-132: The parsing function _try_parse_answer_json currently
assigns sp = parsed.get("supporting_facts", []) and then branches; refactor to
use an early return for the non-list case: after extracting answer, check if
supporting_facts is not a list (e.g., sp = parsed.get("supporting_facts"); if
not isinstance(sp, list): return answer, []), then proceed to validate and build
valid_sp for the list case and return answer, valid_sp; keep the JSON parse
try/except and the same validation logic inside the list-handling block so flow
is clearer and the final return is only used for the processed list result.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c0838c19-79d3-4b00-8aa5-4fb0481aa106
📒 Files selected for processing (3)
nemo_skills/dataset/hotpotqa/prepare_utils.pynemo_skills/evaluation/metrics/hotpotqa_filtering.pynemo_skills/evaluation/metrics/hotpotqa_metrics.py
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/dataset/hotpotqa/prepare_utils.py
3e45a4c to
40a829d
Compare
Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Adds distractor and closed-book variants of HotpotQA with official EM/F1 metrics, supporting-fact scoring, alternative-aware matching, and on-the-fly filtering. Includes evaluation documentation with reference results from Nemotron-3-Nano. Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
- Add hotpotqa/prepare_utils.py as single source for format and download - hotpotqa/prepare.py calls shared prepare_validation() - hotpotqa_closedbook/prepare.py copies from hotpotqa (or runs shared prep then copy) - Update docs/evaluation/other-benchmarks.md with unified prep description Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
- Metrics: use last fenced JSON block; fail fast on required prediction fields - Prepare: atomic write (temp then replace); zip(..., strict=True) - Filtering: word-boundary regex for short alternatives (fix numeric substring match) Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Variance was computed correctly but stored in [0,1]; table showed ± 0.00 because as_float formatted raw std. Now scale by 100 so e.g. 62.92 ± 0.25. Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
…scaling - other-benchmarks: use --benchmarks=hotpotqa:4 and hotpotqa_closedbook:4 in examples - base.py: add inline comment explaining 100.0 * metric_std (display % consistency) Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
d89ab0b to
25de92c
Compare
Signed-off-by: Mahan Fathi <mfathi@nvidia.com>
Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com> Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: Meriem B. <113170426+ka00ri@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Prasoon Varshney <prasoon1995@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com> Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: Meriem B. <113170426+ka00ri@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Prasoon Varshney <prasoon1995@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Adds evaluation support for HotpotQA multi-hop question answering, with two variants:
hotpotqa): model receives 10 context paragraphs (2 gold + 8 distractors) and must return the answer and identify supporting-fact sentenceshotpotqa_closedbook): same questions, no context providedWhat's included:
hotpotqa/hotpot_qa, distractor split, 7405 validation examples)Tested end-to-end on cluster with Nemotron-3-Nano, gpt-oss-20b, and gpt-oss-120b.
Summary by CodeRabbit
New Features
Evaluation
Documentation
Prompts