Conversation
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds two datasets (HLE-Verified, UGPhysics): docs entries; new dataset packages with prepare scripts and default evaluation configs; UGPhysics metric and registration; UGPhysics generation and judge prompts; small MCQ boxed prompt wording tweak. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant Gen as Generation Pipeline
participant Judge as Judge Service
participant Metrics as Metrics/Evaluator
participant Store as Results Store
User->>Gen: request solution for UGPhysics question
Gen->>User: generated solution (LaTeX + boxed answer)
Gen->>Judge: submit reference + generated solution for equivalence
Judge->>Gen: judgement report (Equivalence: TRUE/FALSE + justification)
Gen->>Metrics: send prediction + judgement for scoring
Metrics->>Store: record metric results / produce incorrect sample if needed
Store-->>User: aggregated evaluation results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🧹 Nitpick comments (1)
nemo_skills/evaluation/metrics/ugphysics_metrics.py (1)
24-51: Consider adding UGPhysics to CI/slurm tests.Since this introduces new evaluation metrics logic for the UGPhysics benchmark, ensure the dataset is included in default CI tests and consider adding it to slurm tests for comprehensive evaluation coverage. Based on learnings: "When adding new benchmarks, run GPU tests in CI locally, and ensure the dataset is included in default CI tests" and "When enabling new evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/ugphysics_metrics.py` around lines 24 - 51, New UGPhysics evaluation logic (class UGPhysicsMetrics with methods is_correct_judgement and get_incorrect_sample) needs to be exercised in CI and slurm tests; update the CI test matrix and slurm test definitions to include the UGPhysics dataset so these metrics run by default. Specifically, add the UGPhysics dataset/test target to the default CI unit/integration test suite and to the slurm GPU test list used for benchmark evaluations, ensure any required GPU job configuration and dataset download/setup steps are included, and add a small CI test that calls UGPhysicsMetrics.is_correct_judgement and get_incorrect_sample on representative examples to catch regressions locally before pushing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/evaluation/scientific-knowledge.md`:
- Around line 10-17: The docs update added HLE-Verified and UGPhysics to the
table but omitted benchmark-specific eval examples and expected outcomes; add a
short subsection for each dataset (HLE-Verified and UGPhysics) in the same
evaluation document that shows (1) an example run command for the evaluation
script (matching the repo's eval CLI usage), (2) a minimal example input and the
expected output/answer format, and (3) the expected tested-model metrics or
baseline results (e.g., accuracy/EM or pass@k) and any evaluation split
(test/val) to be used; reference the dataset names HLE-Verified and UGPhysics so
readers can find these entries easily.
In `@nemo_skills/dataset/hle_verified/prepare.py`:
- Around line 56-58: The code silently defaults missing keys by using dict.get
during parsing; change the lambda in the loop that assigns df[field] (the one
using parsed.apply(lambda x, f=field: x.get(f))) to use direct key access (e.g.,
x[f]) so a KeyError is raised on schema drift, and add a small guard that
validates parsed is a dict and re-raises a clearer exception (with field name
and offending row/context) if not; apply the same direct-access fix to the
similar occurrence referenced at the other location (line ~74) so missing keys
fail fast rather than producing None values.
- Around line 78-94: The write_data_to_file function currently opens the output
file before applying filters/formatting which risks partially
written/overwritten files if processing fails; change it to first iterate over
data and collect the filtered/formatted JSON strings (using the same filter
logic referencing HLE_REVERSE_MAP, HLE_VERIFIED_CLASSES_REVERSE_MAP,
entry["image"], and the "text"/"uncertain" check and format_entry) into a list,
and only after successful completion open output_file for writing and dump each
precomputed string (one per line) to the file so no partial files are created on
error.
In `@nemo_skills/dataset/ugphysics/prepare.py`:
- Line 54: The list comprehension that builds descriptions uses
OB_ANS_TYPE_ID2EN.get(t, t) which silently accepts unknown type IDs; change this
to use direct lookup OB_ANS_TYPE_ID2EN[t] (or explicitly validate membership
first) so that missing/invalid answer type IDs raise an error rather than
producing incorrect fallbacks; update the code that constructs descriptions (the
variable descriptions, using types and OB_ANS_TYPE_ID2EN) to either perform an
explicit membership check like "if t not in OB_ANS_TYPE_ID2EN: raise
KeyError(...)" or replace .get with direct indexing OB_ANS_TYPE_ID2EN[t] to
enforce required keys.
- Around line 23-30: Fix typos in the answer-type description mapping in
prepare.py: change "inteval" to "interval" in the "IN" value, change "seperated"
to "separated" and "comma" to "commas" in the "TUP" value (so it reads "multiple
numbers, separated by commas, such as (x, y, z)"). Ensure you update the literal
mapping/dictionary that defines these codes so the corrected strings are used
when injecting prompts.
- Around line 95-103: The save_data function currently formats each entry during
file write, which can leave a partial JSONL if format_entry fails; change
save_data to first build a list of formatted records by calling format_entry for
every entry (e.g., formatted = [format_entry(e) for e in data]) and only after
that open the output file and write the precomputed formatted records (using
tqdm over the formatted list) so all computation completes before any file is
opened for writing; reference save_data and format_entry when making this
change.
In `@nemo_skills/evaluation/metrics/ugphysics_metrics.py`:
- Around line 37-40: The fallback regex checks for TRUE/FALSE use case-sensitive
re.search calls on the variable `judgement`, causing lowercase "true"/"false" to
be missed; update those checks (the two re.search calls that precede the returns
True/False) to perform case-insensitive matching (e.g., pass re.IGNORECASE to
re.search or normalize `judgement` to lower/upper before checking) so they
behave consistently with the earlier main pattern.
In `@nemo_skills/prompt/config/judge/ugphysics.yaml`:
- Line 13: Fix the typos and grammar in the judge prompt lines containing "B.
Consider Physiccal Equivalence" and the phrase "answer share" — change
"Physiccal" to "Physical" and change "answer share" to "answer shares" (and
review adjacent sentences for parallel grammar), updating the rubric text in the
ugphysics.yaml prompt entries that contain "B. Consider Physiccal Equivalence"
and the related line at the other occurrence so both instances match corrected
wording.
---
Nitpick comments:
In `@nemo_skills/evaluation/metrics/ugphysics_metrics.py`:
- Around line 24-51: New UGPhysics evaluation logic (class UGPhysicsMetrics with
methods is_correct_judgement and get_incorrect_sample) needs to be exercised in
CI and slurm tests; update the CI test matrix and slurm test definitions to
include the UGPhysics dataset so these metrics run by default. Specifically, add
the UGPhysics dataset/test target to the default CI unit/integration test suite
and to the slurm GPU test list used for benchmark evaluations, ensure any
required GPU job configuration and dataset download/setup steps are included,
and add a small CI test that calls UGPhysicsMetrics.is_correct_judgement and
get_incorrect_sample on representative examples to catch regressions locally
before pushing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cc91c017-9f15-4ede-93ed-3c435b1ff72a
📒 Files selected for processing (9)
docs/evaluation/scientific-knowledge.mdnemo_skills/dataset/hle_verified/__init__.pynemo_skills/dataset/hle_verified/prepare.pynemo_skills/dataset/ugphysics/__init__.pynemo_skills/dataset/ugphysics/prepare.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/evaluation/metrics/ugphysics_metrics.pynemo_skills/prompt/config/generic/ugphysics.yamlnemo_skills/prompt/config/judge/ugphysics.yaml
| | **[HLE-Verified](https://huggingface.co/datasets/skylenage/HLE-Verified)** | 2,500 | Open ended, MCQ | Engineering, Physics, Chemistry, Bio, etc. | Yes | gold+revision text only | | ||
| | **[GPQA ](https://huggingface.co/datasets/Idavidrein/gpqa)** | 448 (main)<br>198 (diamond)</br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond | | ||
| | **[SuperGPQA](https://huggingface.co/datasets/m-a-p/SuperGPQA)** | 26,529 | MCQ (≤ 10) | Science, Eng, Humanities, etc. | No | test | | ||
| | **[MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test | | ||
| | **[SciCode](https://huggingface.co/datasets/SciCode1/SciCode)** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | | ||
| | **[FrontierScience](https://huggingface.co/datasets/openai/frontierscience)** | 100 | Short-answer | Physics, Chemistry, Biology | No | all | | ||
| | **[Physics](https://huggingface.co/datasets/desimfj/PHYSICS)** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN | | ||
| | **[UGPhysics](https://huggingface.co/datasets/UGPhysics/ugphysics)** | 5,520 (EN), 5,520 (ZH) | Open-ended MCQ | Physics | No | EN | |
There was a problem hiding this comment.
Add benchmark-specific eval examples and expected results for the new datasets.
HLE-Verified and UGPhysics were added to the overview, but this update should also include example run commands and expected tested-model outcomes for these two benchmarks.
As per coding guidelines: "When adding new benchmarks, add it to the corresponding place in the documentation with example commands for running evaluation and expected results for tested models".
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 11-11: Spaces inside link text
(MD039, no-space-in-links)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/evaluation/scientific-knowledge.md` around lines 10 - 17, The docs
update added HLE-Verified and UGPhysics to the table but omitted
benchmark-specific eval examples and expected outcomes; add a short subsection
for each dataset (HLE-Verified and UGPhysics) in the same evaluation document
that shows (1) an example run command for the evaluation script (matching the
repo's eval CLI usage), (2) a minimal example input and the expected
output/answer format, and (3) the expected tested-model metrics or baseline
results (e.g., accuracy/EM or pass@k) and any evaluation split (test/val) to be
used; reference the dataset names HLE-Verified and UGPhysics so readers can find
these entries easily.
| parsed = df["json"].apply(json.loads) | ||
| for field in ("author_name", "rationale", "answer_type", "canary", "image"): | ||
| df[field] = parsed.apply(lambda x, f=field: x.get(f)) |
There was a problem hiding this comment.
Fail fast on schema drift instead of silently defaulting.
Using .get() here can quietly produce invalid records (None or unmapped labels) and hide upstream data changes.
Suggested fix
- parsed = df["json"].apply(json.loads)
- for field in ("author_name", "rationale", "answer_type", "canary", "image"):
- df[field] = parsed.apply(lambda x, f=field: x.get(f))
+ parsed = df["json"].apply(json.loads)
+ for field in ("author_name", "rationale", "answer_type", "canary", "image"):
+ df[field] = parsed.apply(lambda x, f=field: x[f])
@@
- "verified_class": HLE_VERIFIED_CLASSES_MAP.get(entry["Verified_Classes"], entry["Verified_Classes"]),
+ "verified_class": HLE_VERIFIED_CLASSES_MAP[entry["Verified_Classes"]],As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".
Also applies to: 74-74
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/hle_verified/prepare.py` around lines 56 - 58, The code
silently defaults missing keys by using dict.get during parsing; change the
lambda in the loop that assigns df[field] (the one using parsed.apply(lambda x,
f=field: x.get(f))) to use direct key access (e.g., x[f]) so a KeyError is
raised on schema drift, and add a small guard that validates parsed is a dict
and re-raises a clearer exception (with field name and offending row/context) if
not; apply the same direct-access fix to the similar occurrence referenced at
the other location (line ~74) so missing keys fail fast rather than producing
None values.
| def write_data_to_file(output_file, data, split): | ||
| with open(output_file, "wt", encoding="utf-8") as fout: | ||
| for _, entry in tqdm(data.iterrows(), total=len(data), desc=f"Writing {output_file.name}"): | ||
| # Filter by category for category-specific splits | ||
| if split in HLE_REVERSE_MAP and entry["category"] != HLE_REVERSE_MAP[split]: | ||
| continue | ||
| # Filter by verified class for class-specific splits | ||
| if split in HLE_VERIFIED_CLASSES_REVERSE_MAP: | ||
| if entry["Verified_Classes"] != HLE_VERIFIED_CLASSES_REVERSE_MAP[split]: | ||
| continue | ||
| if entry["image"]: | ||
| continue | ||
| # text split = text-only entries from Gold + Revision subsets only | ||
| if split == "text" and entry["Verified_Classes"] == HLE_VERIFIED_CLASSES_REVERSE_MAP["uncertain"]: | ||
| continue | ||
| json.dump(format_entry(entry), fout) | ||
| fout.write("\n") |
There was a problem hiding this comment.
Do transformation/filtering before opening the output file.
If formatting/filtering fails mid-loop, current flow can leave partially written files and overwrite valid prior outputs.
Suggested fix
def write_data_to_file(output_file, data, split):
- with open(output_file, "wt", encoding="utf-8") as fout:
- for _, entry in tqdm(data.iterrows(), total=len(data), desc=f"Writing {output_file.name}"):
- # Filter by category for category-specific splits
- if split in HLE_REVERSE_MAP and entry["category"] != HLE_REVERSE_MAP[split]:
- continue
- # Filter by verified class for class-specific splits
- if split in HLE_VERIFIED_CLASSES_REVERSE_MAP:
- if entry["Verified_Classes"] != HLE_VERIFIED_CLASSES_REVERSE_MAP[split]:
- continue
- if entry["image"]:
- continue
- # text split = text-only entries from Gold + Revision subsets only
- if split == "text" and entry["Verified_Classes"] == HLE_VERIFIED_CLASSES_REVERSE_MAP["uncertain"]:
- continue
- json.dump(format_entry(entry), fout)
- fout.write("\n")
+ records = []
+ for _, entry in tqdm(data.iterrows(), total=len(data), desc=f"Preparing {output_file.name}"):
+ if split in HLE_REVERSE_MAP and entry["category"] != HLE_REVERSE_MAP[split]:
+ continue
+ if split in HLE_VERIFIED_CLASSES_REVERSE_MAP and entry["Verified_Classes"] != HLE_VERIFIED_CLASSES_REVERSE_MAP[split]:
+ continue
+ if entry["image"]:
+ continue
+ if split == "text" and entry["Verified_Classes"] == HLE_VERIFIED_CLASSES_REVERSE_MAP["uncertain"]:
+ continue
+ records.append(format_entry(entry))
+
+ with open(output_file, "wt", encoding="utf-8") as fout:
+ for record in records:
+ json.dump(record, fout)
+ fout.write("\n")As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/hle_verified/prepare.py` around lines 78 - 94, The
write_data_to_file function currently opens the output file before applying
filters/formatting which risks partially written/overwritten files if processing
fails; change it to first iterate over data and collect the filtered/formatted
JSON strings (using the same filter logic referencing HLE_REVERSE_MAP,
HLE_VERIFIED_CLASSES_REVERSE_MAP, entry["image"], and the "text"/"uncertain"
check and format_entry) into a list, and only after successful completion open
output_file for writing and dump each precomputed string (one per line) to the
file so no partial files are created on error.
| Adapted from https://github.com/YangLabHKUST/UGPhysics/blob/main/codes/utils.py#L146 | ||
| """ | ||
| types = [t.strip() for t in answer_type.split(",")] | ||
| descriptions = [OB_ANS_TYPE_ID2EN.get(t, t) for t in types] |
There was a problem hiding this comment.
Avoid fallback mapping for required answer type IDs.
Using .get(t, t) can silently accept unknown/invalid type IDs and produce inconsistent prompts.
Suggested fix
- descriptions = [OB_ANS_TYPE_ID2EN.get(t, t) for t in types]
+ descriptions = [OB_ANS_TYPE_ID2EN[t] for t in types]As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| descriptions = [OB_ANS_TYPE_ID2EN.get(t, t) for t in types] | |
| descriptions = [OB_ANS_TYPE_ID2EN[t] for t in types] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/ugphysics/prepare.py` at line 54, The list comprehension
that builds descriptions uses OB_ANS_TYPE_ID2EN.get(t, t) which silently accepts
unknown type IDs; change this to use direct lookup OB_ANS_TYPE_ID2EN[t] (or
explicitly validate membership first) so that missing/invalid answer type IDs
raise an error rather than producing incorrect fallbacks; update the code that
constructs descriptions (the variable descriptions, using types and
OB_ANS_TYPE_ID2EN) to either perform an explicit membership check like "if t not
in OB_ANS_TYPE_ID2EN: raise KeyError(...)" or replace .get with direct indexing
OB_ANS_TYPE_ID2EN[t] to enforce required keys.
| def save_data(data, split_name): | ||
| data_dir = Path(__file__).absolute().parent | ||
| data_dir.mkdir(exist_ok=True) | ||
| output_file = data_dir / f"{split_name}.jsonl" | ||
|
|
||
| with open(output_file, "wt", encoding="utf-8") as fout: | ||
| for entry in tqdm(data, desc=f"Writing {output_file.name}"): | ||
| json.dump(format_entry(entry), fout) | ||
| fout.write("\n") |
There was a problem hiding this comment.
Precompute formatted records before writing JSONL.
Current flow can leave partial files if format_entry fails mid-write.
Suggested fix
def save_data(data, split_name):
@@
- with open(output_file, "wt", encoding="utf-8") as fout:
- for entry in tqdm(data, desc=f"Writing {output_file.name}"):
- json.dump(format_entry(entry), fout)
- fout.write("\n")
+ records = [format_entry(entry) for entry in tqdm(data, desc=f"Preparing {output_file.name}")]
+ with open(output_file, "wt", encoding="utf-8") as fout:
+ for record in records:
+ json.dump(record, fout)
+ fout.write("\n")As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/ugphysics/prepare.py` around lines 95 - 103, The
save_data function currently formats each entry during file write, which can
leave a partial JSONL if format_entry fails; change save_data to first build a
list of formatted records by calling format_entry for every entry (e.g.,
formatted = [format_entry(e) for e in data]) and only after that open the output
file and write the precomputed formatted records (using tqdm over the formatted
list) so all computation completes before any file is opened for writing;
reference save_data and format_entry when making this change.
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
nemo_skills/dataset/ugphysics/prepare.py (2)
96-102:⚠️ Potential issue | 🟠 MajorPrecompute formatted records before writing to prevent partial file output.
If
format_entryfails mid-iteration, the output file will contain partial data. Compute all records first, then write.Suggested fix
def save_data(data, output_path): output_path = Path(output_path) output_path.parent.mkdir(parents=True, exist_ok=True) + records = [format_entry(entry) for entry in tqdm(data, desc=f"Preparing {output_path.name}")] with open(output_path, "wt", encoding="utf-8") as fout: - for entry in tqdm(data, desc=f"Writing {output_path.name}"): - json.dump(format_entry(entry), fout) + for record in records: + json.dump(record, fout) fout.write("\n")As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/ugphysics/prepare.py` around lines 96 - 102, The save_data function writes entries as it formats them, risking a partially-written output if format_entry raises; change save_data to first compute and store all formatted records (e.g., build a list by calling format_entry for each entry) before opening output_path for writing, then open the file and iterate over the precomputed records to json.dump them; keep the existing output_path parent.mkdir call and tqdm usage but reference format_entry and save_data so you format everything up front to avoid partial file output.
55-55:⚠️ Potential issue | 🟠 MajorUse direct dictionary access instead of
.get()for required keys.Using
.get(t, t)silently falls back to the raw type ID if it's not in the mapping, which can produce inconsistent prompts. Direct access will fail with a clear error if an unknown type is encountered.Suggested fix
- descriptions = [OB_ANS_TYPE_ID2EN.get(t, t) for t in types] + descriptions = [OB_ANS_TYPE_ID2EN[t] for t in types]As per coding guidelines: "Don't use
.get()for accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/ugphysics/prepare.py` at line 55, The list comprehension that builds descriptions uses OB_ANS_TYPE_ID2EN.get(t, t) which silently falls back to the raw type id; change it to use direct dictionary access OB_ANS_TYPE_ID2EN[t] so that missing keys raise an error (locate the expression that assigns descriptions from types and replace the .get(...) usage with direct indexing to enforce presence of expected keys).nemo_skills/prompt/config/judge/ugphysics.yaml (1)
27-27:⚠️ Potential issue | 🟡 MinorMinor grammar improvement: "with" → "as".
The phrase "shares the same meaning with" should be "shares the same meaning as" for proper English grammar.
Suggested fix
- [Whether the student's answer shares the same meaning with the reference answer. (TRUE or FALSE)] + [Whether the student's answer shares the same meaning as the reference answer. (TRUE or FALSE)]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/prompt/config/judge/ugphysics.yaml` at line 27, Update the prompt text that reads "[Whether the student's answer shares the same meaning with the reference answer. (TRUE or FALSE)]" to use correct grammar by replacing "with" with "as" so it reads "[Whether the student's answer shares the same meaning as the reference answer. (TRUE or FALSE)]"; locate the exact string in the judge prompt (ugphysics.yaml) and apply this one-word substitution.
🧹 Nitpick comments (1)
nemo_skills/dataset/hle_verified/__init__.py (1)
22-27: Consider adding HLE-Verified to slurm tests.Since this PR introduces a new dataset with evaluation/metrics logic, consider adding it to the slurm tests for comprehensive evaluation coverage. Based on learnings: "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hle_verified/__init__.py` around lines 22 - 27, Add the new HLE-Verified dataset to the slurm test suite so its evaluation/metrics (defined by JUDGE_PIPELINE_ARGS and JUDGE_ARGS in nemo_skills.dataset.hle_verified.__init__) are exercised; update the slurm test configuration to include the dataset key/name "hle_verified" in the list of datasets to run, ensure the test runner picks up JUDGE_PIPELINE_ARGS and JUDGE_ARGS for that dataset, and add any required resources/permissions to the slurm test entry so the evaluation executes successfully.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/hle_verified/__init__.py`:
- Around line 20-21: Fix the minor typo in the comment: change "overriden" to
"overridden" in the module doc/comment where you mention overriding the openai
judge (the sentence that reads "Setting openai judge by default, but can be
overriden from command line for a locally hosted model").
---
Duplicate comments:
In `@nemo_skills/dataset/ugphysics/prepare.py`:
- Around line 96-102: The save_data function writes entries as it formats them,
risking a partially-written output if format_entry raises; change save_data to
first compute and store all formatted records (e.g., build a list by calling
format_entry for each entry) before opening output_path for writing, then open
the file and iterate over the precomputed records to json.dump them; keep the
existing output_path parent.mkdir call and tqdm usage but reference format_entry
and save_data so you format everything up front to avoid partial file output.
- Line 55: The list comprehension that builds descriptions uses
OB_ANS_TYPE_ID2EN.get(t, t) which silently falls back to the raw type id; change
it to use direct dictionary access OB_ANS_TYPE_ID2EN[t] so that missing keys
raise an error (locate the expression that assigns descriptions from types and
replace the .get(...) usage with direct indexing to enforce presence of expected
keys).
In `@nemo_skills/prompt/config/judge/ugphysics.yaml`:
- Line 27: Update the prompt text that reads "[Whether the student's answer
shares the same meaning with the reference answer. (TRUE or FALSE)]" to use
correct grammar by replacing "with" with "as" so it reads "[Whether the
student's answer shares the same meaning as the reference answer. (TRUE or
FALSE)]"; locate the exact string in the judge prompt (ugphysics.yaml) and apply
this one-word substitution.
---
Nitpick comments:
In `@nemo_skills/dataset/hle_verified/__init__.py`:
- Around line 22-27: Add the new HLE-Verified dataset to the slurm test suite so
its evaluation/metrics (defined by JUDGE_PIPELINE_ARGS and JUDGE_ARGS in
nemo_skills.dataset.hle_verified.__init__) are exercised; update the slurm test
configuration to include the dataset key/name "hle_verified" in the list of
datasets to run, ensure the test runner picks up JUDGE_PIPELINE_ARGS and
JUDGE_ARGS for that dataset, and add any required resources/permissions to the
slurm test entry so the evaluation executes successfully.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 543c1511-0dda-46a9-bd08-c7aabfcd2781
📒 Files selected for processing (3)
nemo_skills/dataset/hle_verified/__init__.pynemo_skills/dataset/ugphysics/prepare.pynemo_skills/prompt/config/judge/ugphysics.yaml
| # Some answers are not possible to compare symbolically, so have to use a judge model | ||
| # Setting openai judge by default, but can be overriden from command line for a locally hosted model |
There was a problem hiding this comment.
Minor typo: "overriden" → "overridden".
✏️ Proposed fix
# Some answers are not possible to compare symbolically, so have to use a judge model
-# Setting openai judge by default, but can be overriden from command line for a locally hosted model
+# Setting openai judge by default, but can be overridden from command line for a locally hosted model🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/hle_verified/__init__.py` around lines 20 - 21, Fix the
minor typo in the comment: change "overriden" to "overridden" in the module
doc/comment where you mention overriding the openai judge (the sentence that
reads "Setting openai judge by default, but can be overriden from command line
for a locally hosted model").
| # limitations under the License. | ||
|
|
||
| # settings that define how evaluation should be done by default (all can be changed from cmdline) | ||
| DATASET_GROUP = "math" |
There was a problem hiding this comment.
| DATASET_GROUP = "math" |
| **Student Solution**: | ||
| {generation} | ||
|
|
||
| </physics solution> |
There was a problem hiding this comment.
extra space? Is this consistent with original prompt?
| DATASET_GROUP = "math" | ||
| METRICS_TYPE = "ugphysics" | ||
| GENERATION_ARGS = "++prompt_config=generic/ugphysics ++eval_type=math" | ||
| EVAL_SPLIT = "test" |
There was a problem hiding this comment.
this doesn't correspond to the data file created? Should this be one of the real options?
There was a problem hiding this comment.
fixed (was changed after adding parser I guess)
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
nemo_skills/prompt/config/judge/ugphysics.yaml (1)
27-27:⚠️ Potential issue | 🟡 MinorUse “same meaning as” in the rubric sentence.
Line 27 still reads “shares the same meaning with,” which is unidiomatic in a core instruction line.
✏️ Suggested fix
- [Whether the student's answer shares the same meaning with the reference answer. (TRUE or FALSE)] + [Whether the student's answer shares the same meaning as the reference answer. (TRUE or FALSE)]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/prompt/config/judge/ugphysics.yaml` at line 27, Replace the unidiomatic phrase "shares the same meaning with" in the rubric sentence with "same meaning as"; locate the line containing the bracketed instruction "[Whether the student's answer shares the same meaning with the reference answer. (TRUE or FALSE)]" and update it to use "same meaning as" (e.g., "[Whether the student's answer has the same meaning as the reference answer. (TRUE or FALSE)]") so the core instruction reads naturally.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/prompt/config/judge/ugphysics.yaml`:
- Around line 228-237: The justification wrongly claims the factor N is
negligible; instead update the justification in ugphysics.yaml to state that the
reference solution is using a unit-normalized probability density (i.e., it
omits the total-particle factor) or is ambiguous about whether A is per-particle
or for N particles, and therefore the student's inclusion of N is acceptable for
a normalization constant defined for a system with N particles; replace the
sentence that says the expressions are “equivalent apart from the factor of N”
with language clarifying the reference is incomplete/ambiguous about
normalization and that both forms are valid under different normalization
conventions for A.
---
Duplicate comments:
In `@nemo_skills/prompt/config/judge/ugphysics.yaml`:
- Line 27: Replace the unidiomatic phrase "shares the same meaning with" in the
rubric sentence with "same meaning as"; locate the line containing the bracketed
instruction "[Whether the student's answer shares the same meaning with the
reference answer. (TRUE or FALSE)]" and update it to use "same meaning as"
(e.g., "[Whether the student's answer has the same meaning as the reference
answer. (TRUE or FALSE)]") so the core instruction reads naturally.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f672198e-c21b-4187-9533-6985211c9638
📒 Files selected for processing (2)
nemo_skills/dataset/ugphysics/__init__.pynemo_skills/prompt/config/judge/ugphysics.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/dataset/ugphysics/init.py
| ## Justification | ||
| The student’s answer, | ||
|
|
||
| \[ | ||
| A = \frac{{N}}{{(2\pi m k T)^{{3/2}}}}, | ||
| \] | ||
|
|
||
| is **physically correct**. The inclusion of \(N\) accounts for the total number of particles in the gas. In the context of the problem, \(A\) represents the normalization constant for the number density in momentum space, and the student correctly derived this value. While the reference solution omits \(N\) for simplicity (assuming a unit-normalized probability density), the student’s inclusion of \(N\) aligns with the interpretation of \(A\) as a normalization constant for a system with \(N\) particles. | ||
|
|
||
| Mathematically, both expressions are equivalent apart from the factor of \(N\), which is not essential to the physical interpretation in this context. Therefore, the student’s answer can be considered correct. |
There was a problem hiding this comment.
Don’t justify a positive label by treating an extra factor of N as negligible.
This example currently says the student's answer is acceptable because the two forms are “equivalent apart from the factor of N.” That rationale is unsafe: multiplicative factors change the quantity and can teach the judge to over-accept scaled answers in other problems. If you want to keep this as a TRUE example, the justification should instead explain that the extracted reference answer is incomplete/ambiguous for this question—not that N is generally ignorable.
✏️ Suggested rewrite
- is **physically correct**. The inclusion of \(N\) accounts for the total number of particles in the gas. In the context of the problem, \(A\) represents the normalization constant for the number density in momentum space, and the student correctly derived this value. While the reference solution omits \(N\) for simplicity (assuming a unit-normalized probability density), the student’s inclusion of \(N\) aligns with the interpretation of \(A\) as a normalization constant for a system with \(N\) particles.
-
- Mathematically, both expressions are equivalent apart from the factor of \(N\), which is not essential to the physical interpretation in this context. Therefore, the student’s answer can be considered correct.
+ is acceptable here because the problem statement asks for the number of atoms in the momentum-space element, which introduces an overall factor of \(N\). The extracted reference answer omits that factor, so the judge should rely on the full problem/reference-solution context here rather than treating multiplicative factors as generally unimportant.
+
+ The important point is that this is a reference-answer incompleteness case, not a generic "extra scale factors are still equivalent" case. Therefore, the student’s answer can be considered correct for this example.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/prompt/config/judge/ugphysics.yaml` around lines 228 - 237, The
justification wrongly claims the factor N is negligible; instead update the
justification in ugphysics.yaml to state that the reference solution is using a
unit-normalized probability density (i.e., it omits the total-particle factor)
or is ambiguous about whether A is per-particle or for N particles, and
therefore the student's inclusion of N is acceptable for a normalization
constant defined for a system with N particles; replace the sentence that says
the expressions are “equivalent apart from the factor of N” with language
clarifying the reference is incomplete/ambiguous about normalization and that
both forms are valid under different normalization conventions for A.
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
nemo_skills/evaluation/metrics/ugphysics_metrics.py (2)
25-27: Passanswer_keythroughMathMetrics.__init__.This re-implements part of the parent initializer for no gain. Forwarding
answer_keytosuper()keeps the subclass aligned with futureMathMetricschanges.Proposed refactor
class UGPhysicsMetrics(MathMetrics): def __init__(self, compute_no_answer: bool = False, answer_key: str = "generation"): - super().__init__(compute_no_answer=compute_no_answer) - self.answer_key = answer_key + super().__init__(compute_no_answer=compute_no_answer, answer_key=answer_key)As per coding guidelines, "Keep code simple and elegant; reuse/extend existing functionality when possible, minimize conditional checks, use self-explanatory code over comments, avoid complicated type interfaces with unions, and keep naming consistent with existing conventions".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/ugphysics_metrics.py` around lines 25 - 27, The subclass UGPhysicsMetrics __init__ is reassigning answer_key instead of forwarding it to the parent; update UGPhysicsMetrics.__init__ to call super().__init__(compute_no_answer=compute_no_answer, answer_key=answer_key) (or the equivalent parameter name used by MathMetrics) and remove the direct self.answer_key assignment so the parent handles initialization and future changes to MathMetrics.__init__ are respected.
24-51: Please add CI/slurm coverage for the new benchmark path.New dataset + metric additions like this tend to regress quietly unless they run in the default benchmark jobs. I’d wire UGPhysics, and ideally HLE-Verified too, into the benchmark CI/slurm coverage before merge.
Based on learnings, "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation" and "When adding new benchmarks, run GPU tests in CI locally, and ensure the dataset is included in default CI tests".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/ugphysics_metrics.py` around lines 24 - 51, The new UGPhysics metrics (class UGPhysicsMetrics, methods is_correct_judgement and get_incorrect_sample) aren’t covered by CI/slurm benchmark runs; add UGPhysics (and optionally HLE-Verified) to the benchmark CI and slurm test matrices so the metric logic runs in default jobs: update the CI/slurm benchmark job configuration to include the new dataset id in the list of datasets/executors used by the benchmark test suite, add a small smoke test entry (minimal shard or sample count) that runs the evaluation pipeline and invokes UGPhysicsMetrics, and ensure the GPU benchmark job(s) include the dataset so the metric code (is_correct_judgement/get_incorrect_sample) executes under CI; also add a brief integration test that runs the metric on one sample to catch regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/evaluation/metrics/ugphysics_metrics.py`:
- Around line 36-39: The current fallback uses re.finditer(r"\b(TRUE|FALSE)\b",
judgement, re.IGNORECASE) which can grab rubric text like "(TRUE or FALSE)";
instead scan judgement line-by-line and only accept a match if the entire line
is just the verdict (e.g., re.match(r"^\s*(TRUE|FALSE)\s*$", line,
re.IGNORECASE)), then return the last such full-line match; update the block
referencing true_false_matches, judgement, and re.finditer to perform this
line-based full-line check (or first strip rubric-like lines) so rubric
templates are not treated as real verdicts.
---
Nitpick comments:
In `@nemo_skills/evaluation/metrics/ugphysics_metrics.py`:
- Around line 25-27: The subclass UGPhysicsMetrics __init__ is reassigning
answer_key instead of forwarding it to the parent; update
UGPhysicsMetrics.__init__ to call
super().__init__(compute_no_answer=compute_no_answer, answer_key=answer_key) (or
the equivalent parameter name used by MathMetrics) and remove the direct
self.answer_key assignment so the parent handles initialization and future
changes to MathMetrics.__init__ are respected.
- Around line 24-51: The new UGPhysics metrics (class UGPhysicsMetrics, methods
is_correct_judgement and get_incorrect_sample) aren’t covered by CI/slurm
benchmark runs; add UGPhysics (and optionally HLE-Verified) to the benchmark CI
and slurm test matrices so the metric logic runs in default jobs: update the
CI/slurm benchmark job configuration to include the new dataset id in the list
of datasets/executors used by the benchmark test suite, add a small smoke test
entry (minimal shard or sample count) that runs the evaluation pipeline and
invokes UGPhysicsMetrics, and ensure the GPU benchmark job(s) include the
dataset so the metric code (is_correct_judgement/get_incorrect_sample) executes
under CI; also add a brief integration test that runs the metric on one sample
to catch regressions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: af9030c5-a6fb-4067-b36d-a732554390f2
📒 Files selected for processing (2)
nemo_skills/evaluation/metrics/ugphysics_metrics.pynemo_skills/prompt/config/robustness/mcq_prompts/boxed_1.yaml
| # Fallback: look for standalone TRUE/FALSE (case-insensitive), use last match | ||
| true_false_matches = list(re.finditer(r"\b(TRUE|FALSE)\b", judgement, re.IGNORECASE)) | ||
| if true_false_matches: | ||
| return true_false_matches[-1].group(1).upper() == "TRUE" |
There was a problem hiding this comment.
Fallback parsing will treat the rubric text as a real verdict.
nemo_skills/prompt/config/judge/ugphysics.yaml:26-27 includes the literal string (TRUE or FALSE) under the ## Equivalence Judgement header. If a model echoes that template but never emits an actual judgement, this fallback grabs the last token and returns False, silently mis-scoring malformed outputs. Please restrict the fallback to verdict-only lines (or similarly strip the rubric text first).
Safer fallback example
- true_false_matches = list(re.finditer(r"\b(TRUE|FALSE)\b", judgement, re.IGNORECASE))
+ true_false_matches = list(
+ re.finditer(r"^\s*(TRUE|FALSE)\s*$", judgement, re.IGNORECASE | re.MULTILINE)
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/evaluation/metrics/ugphysics_metrics.py` around lines 36 - 39,
The current fallback uses re.finditer(r"\b(TRUE|FALSE)\b", judgement,
re.IGNORECASE) which can grab rubric text like "(TRUE or FALSE)"; instead scan
judgement line-by-line and only accept a match if the entire line is just the
verdict (e.g., re.match(r"^\s*(TRUE|FALSE)\s*$", line, re.IGNORECASE)), then
return the last such full-line match; update the block referencing
true_false_matches, judgement, and re.finditer to perform this line-based
full-line check (or first strip rubric-like lines) so rubric templates are not
treated as real verdicts.
…DIA/NeMo-Skills into gnalbandyan/ugph_hleVerified
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Add UGPhysics and HLE-Verified dataset support
Summary by CodeRabbit
New Features
Documentation
Tests