Gnalbandyan/ugph hle verified by gnalbandyan · Pull Request #1293 · NVIDIA-NeMo/Skills

gnalbandyan · 2026-03-05T12:52:42Z

Add UGPhysics and HLE-Verified dataset support

Summary by CodeRabbit

New Features
- Added HLE-Verified and UGPhysics datasets with accompanying data-export tooling and split-specific JSONL outputs.
- Added UGPhysics evaluation metrics and registered them for use.
- Added UGPhysics prompt templates for solutions and equivalence judging.
- Clarified MCQ prompt to require a boxed answer letter.
Documentation
- Updated scientific-knowledge docs with new dataset entries and metadata (formatting tweaks).
Tests
- Excluded mmmlu from certain test preparations.

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

coderabbitai · 2026-03-05T13:02:12Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds two datasets (HLE-Verified, UGPhysics): docs entries; new dataset packages with prepare scripts and default evaluation configs; UGPhysics metric and registration; UGPhysics generation and judge prompts; small MCQ boxed prompt wording tweak.

Changes

Cohort / File(s)	Summary
Documentation `docs/evaluation/scientific-knowledge.md`	Added HLE-Verified and UGPhysics rows to dataset overview; formatted HLE count.
HLE-Verified package `nemo_skills/dataset/hle_verified/__init__.py`, `nemo_skills/dataset/hle_verified/prepare.py`	New package defaults (METRICS_TYPE, GENERATION_ARGS, EVAL_SPLIT, judge args) and prepare script that loads HLE-Verified from HF, unpacks JSON fields, filters by category/verified class, excludes image-only entries, and emits split-specific `.jsonl` files.
UGPhysics package `nemo_skills/dataset/ugphysics/__init__.py`, `nemo_skills/dataset/ugphysics/prepare.py`	New package defaults and prepare script with answer-type mappings, prompt/boxed-answer helpers, loaders across language subsets, entry formatting, and JSONL writers for `en`, `zh`, and combined splits.
Metrics & registration `nemo_skills/evaluation/metrics/ugphysics_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	Added `UGPhysicsMetrics` (extends MathMetrics) with equivalence-judgement parsing and incorrect-sample generation; registered `"ugphysics"` in `METRICS_MAP`.
Prompt configs (UGPhysics) `nemo_skills/prompt/config/generic/ugphysics.yaml`, `nemo_skills/prompt/config/judge/ugphysics.yaml`	Added UGPhysics generation prompt (LaTeX step-by-step with boxed final answer) and a detailed judge prompt for equivalence evaluations including examples and required report format.
Prompt tweak (MCQ boxed) `nemo_skills/prompt/config/robustness/mcq_prompts/boxed_1.yaml`	Clarified boxed content must be the single answer letter (example `\boxed{A}`); minor wording/comment edits.
Tests `tests/gpu-tests/test_eval.py`	Excluded `mmmlu` from EXCLUDED_DATASETS (comment: "too large").

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor User
    participant Gen as Generation Pipeline
    participant Judge as Judge Service
    participant Metrics as Metrics/Evaluator
    participant Store as Results Store

    User->>Gen: request solution for UGPhysics question
    Gen->>User: generated solution (LaTeX + boxed answer)
    Gen->>Judge: submit reference + generated solution for equivalence
    Judge->>Gen: judgement report (Equivalence: TRUE/FALSE + justification)
    Gen->>Metrics: send prediction + judgement for scoring
    Metrics->>Store: record metric results / produce incorrect sample if needed
    Store-->>User: aggregated evaluation results

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Add CritPt benchmark #1200: Adds new benchmarks and updates METRICS_MAP registration similarly (related code-level change to metrics mapping).

Suggested reviewers

ekmb

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Gnalbandyan/ugph hle verified' is vague and does not clearly describe the main changes; it appears to reference branch names rather than summarizing what was actually added.	Consider revising to a clearer title like 'Add UGPhysics and HLE-Verified dataset support' that explicitly describes the primary changes being introduced.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch gnalbandyan/ugph_hleVerified

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (1)

nemo_skills/evaluation/metrics/ugphysics_metrics.py (1)
24-51: Consider adding UGPhysics to CI/slurm tests.

Since this introduces new evaluation metrics logic for the UGPhysics benchmark, ensure the dataset is included in default CI tests and consider adding it to slurm tests for comprehensive evaluation coverage. Based on learnings: "When adding new benchmarks, run GPU tests in CI locally, and ensure the dataset is included in default CI tests" and "When enabling new evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/ugphysics_metrics.py` around lines 24 - 51,
New UGPhysics evaluation logic (class UGPhysicsMetrics with methods
is_correct_judgement and get_incorrect_sample) needs to be exercised in CI and
slurm tests; update the CI test matrix and slurm test definitions to include the
UGPhysics dataset so these metrics run by default. Specifically, add the
UGPhysics dataset/test target to the default CI unit/integration test suite and
to the slurm GPU test list used for benchmark evaluations, ensure any required
GPU job configuration and dataset download/setup steps are included, and add a
small CI test that calls UGPhysicsMetrics.is_correct_judgement and
get_incorrect_sample on representative examples to catch regressions locally
before pushing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/scientific-knowledge.md`:
- Around line 10-17: The docs update added HLE-Verified and UGPhysics to the
table but omitted benchmark-specific eval examples and expected outcomes; add a
short subsection for each dataset (HLE-Verified and UGPhysics) in the same
evaluation document that shows (1) an example run command for the evaluation
script (matching the repo's eval CLI usage), (2) a minimal example input and the
expected output/answer format, and (3) the expected tested-model metrics or
baseline results (e.g., accuracy/EM or pass@k) and any evaluation split
(test/val) to be used; reference the dataset names HLE-Verified and UGPhysics so
readers can find these entries easily.

In `@nemo_skills/dataset/hle_verified/prepare.py`:
- Around line 56-58: The code silently defaults missing keys by using dict.get
during parsing; change the lambda in the loop that assigns df[field] (the one
using parsed.apply(lambda x, f=field: x.get(f))) to use direct key access (e.g.,
x[f]) so a KeyError is raised on schema drift, and add a small guard that
validates parsed is a dict and re-raises a clearer exception (with field name
and offending row/context) if not; apply the same direct-access fix to the
similar occurrence referenced at the other location (line ~74) so missing keys
fail fast rather than producing None values.
- Around line 78-94: The write_data_to_file function currently opens the output
file before applying filters/formatting which risks partially
written/overwritten files if processing fails; change it to first iterate over
data and collect the filtered/formatted JSON strings (using the same filter
logic referencing HLE_REVERSE_MAP, HLE_VERIFIED_CLASSES_REVERSE_MAP,
entry["image"], and the "text"/"uncertain" check and format_entry) into a list,
and only after successful completion open output_file for writing and dump each
precomputed string (one per line) to the file so no partial files are created on
error.

In `@nemo_skills/dataset/ugphysics/prepare.py`:
- Line 54: The list comprehension that builds descriptions uses
OB_ANS_TYPE_ID2EN.get(t, t) which silently accepts unknown type IDs; change this
to use direct lookup OB_ANS_TYPE_ID2EN[t] (or explicitly validate membership
first) so that missing/invalid answer type IDs raise an error rather than
producing incorrect fallbacks; update the code that constructs descriptions (the
variable descriptions, using types and OB_ANS_TYPE_ID2EN) to either perform an
explicit membership check like "if t not in OB_ANS_TYPE_ID2EN: raise
KeyError(...)" or replace .get with direct indexing OB_ANS_TYPE_ID2EN[t] to
enforce required keys.
- Around line 23-30: Fix typos in the answer-type description mapping in
prepare.py: change "inteval" to "interval" in the "IN" value, change "seperated"
to "separated" and "comma" to "commas" in the "TUP" value (so it reads "multiple
numbers, separated by commas, such as (x, y, z)"). Ensure you update the literal
mapping/dictionary that defines these codes so the corrected strings are used
when injecting prompts.
- Around line 95-103: The save_data function currently formats each entry during
file write, which can leave a partial JSONL if format_entry fails; change
save_data to first build a list of formatted records by calling format_entry for
every entry (e.g., formatted = [format_entry(e) for e in data]) and only after
that open the output file and write the precomputed formatted records (using
tqdm over the formatted list) so all computation completes before any file is
opened for writing; reference save_data and format_entry when making this
change.

In `@nemo_skills/evaluation/metrics/ugphysics_metrics.py`:
- Around line 37-40: The fallback regex checks for TRUE/FALSE use case-sensitive
re.search calls on the variable `judgement`, causing lowercase "true"/"false" to
be missed; update those checks (the two re.search calls that precede the returns
True/False) to perform case-insensitive matching (e.g., pass re.IGNORECASE to
re.search or normalize `judgement` to lower/upper before checking) so they
behave consistently with the earlier main pattern.

In `@nemo_skills/prompt/config/judge/ugphysics.yaml`:
- Line 13: Fix the typos and grammar in the judge prompt lines containing "B.
Consider Physiccal Equivalence" and the phrase "answer share" — change
"Physiccal" to "Physical" and change "answer share" to "answer shares" (and
review adjacent sentences for parallel grammar), updating the rubric text in the
ugphysics.yaml prompt entries that contain "B. Consider Physiccal Equivalence"
and the related line at the other occurrence so both instances match corrected
wording.

---

Nitpick comments:
In `@nemo_skills/evaluation/metrics/ugphysics_metrics.py`:
- Around line 24-51: New UGPhysics evaluation logic (class UGPhysicsMetrics with
methods is_correct_judgement and get_incorrect_sample) needs to be exercised in
CI and slurm tests; update the CI test matrix and slurm test definitions to
include the UGPhysics dataset so these metrics run by default. Specifically, add
the UGPhysics dataset/test target to the default CI unit/integration test suite
and to the slurm GPU test list used for benchmark evaluations, ensure any
required GPU job configuration and dataset download/setup steps are included,
and add a small CI test that calls UGPhysicsMetrics.is_correct_judgement and
get_incorrect_sample on representative examples to catch regressions locally
before pushing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cc91c017-9f15-4ede-93ed-3c435b1ff72a

📥 Commits

Reviewing files that changed from the base of the PR and between 12454dd and 96b351a.

📒 Files selected for processing (9)

docs/evaluation/scientific-knowledge.md
nemo_skills/dataset/hle_verified/__init__.py
nemo_skills/dataset/hle_verified/prepare.py
nemo_skills/dataset/ugphysics/__init__.py
nemo_skills/dataset/ugphysics/prepare.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/ugphysics_metrics.py
nemo_skills/prompt/config/generic/ugphysics.yaml
nemo_skills/prompt/config/judge/ugphysics.yaml

coderabbitai · 2026-03-05T13:02:15Z

docs/evaluation/scientific-knowledge.md

+| **[HLE-Verified](https://huggingface.co/datasets/skylenage/HLE-Verified)** | 2,500 | Open ended, MCQ | Engineering, Physics, Chemistry, Bio, etc. | Yes | gold+revision text only |
 | **[GPQA ](https://huggingface.co/datasets/Idavidrein/gpqa)** | 448 (main)<br>198 (diamond)</br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond |
 | **[SuperGPQA](https://huggingface.co/datasets/m-a-p/SuperGPQA)** | 26,529 | MCQ (≤ 10) | Science, Eng, Humanities, etc. | No | test |
 | **[MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test |
 | **[SciCode](https://huggingface.co/datasets/SciCode1/SciCode)** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val |
 | **[FrontierScience](https://huggingface.co/datasets/openai/frontierscience)** | 100 | Short-answer | Physics, Chemistry, Biology | No | all |
 | **[Physics](https://huggingface.co/datasets/desimfj/PHYSICS)** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN |
+| **[UGPhysics](https://huggingface.co/datasets/UGPhysics/ugphysics)** | 5,520 (EN), 5,520 (ZH) | Open-ended MCQ | Physics | No | EN |


⚠️ Potential issue | 🟠 Major

Add benchmark-specific eval examples and expected results for the new datasets.

HLE-Verified and UGPhysics were added to the overview, but this update should also include example run commands and expected tested-model outcomes for these two benchmarks.

As per coding guidelines: "When adding new benchmarks, add it to the corresponding place in the documentation with example commands for running evaluation and expected results for tested models".

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 11-11: Spaces inside link text

(MD039, no-space-in-links)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/evaluation/scientific-knowledge.md` around lines 10 - 17, The docs update added HLE-Verified and UGPhysics to the table but omitted benchmark-specific eval examples and expected outcomes; add a short subsection for each dataset (HLE-Verified and UGPhysics) in the same evaluation document that shows (1) an example run command for the evaluation script (matching the repo's eval CLI usage), (2) a minimal example input and the expected output/answer format, and (3) the expected tested-model metrics or baseline results (e.g., accuracy/EM or pass@k) and any evaluation split (test/val) to be used; reference the dataset names HLE-Verified and UGPhysics so readers can find these entries easily.

coderabbitai · 2026-03-05T13:02:15Z

nemo_skills/dataset/hle_verified/prepare.py

+    parsed = df["json"].apply(json.loads)
+    for field in ("author_name", "rationale", "answer_type", "canary", "image"):
+        df[field] = parsed.apply(lambda x, f=field: x.get(f))


⚠️ Potential issue | 🟠 Major

Fail fast on schema drift instead of silently defaulting.

Using .get() here can quietly produce invalid records (None or unmapped labels) and hide upstream data changes.

Suggested fix

- parsed = df["json"].apply(json.loads) - for field in ("author_name", "rationale", "answer_type", "canary", "image"): - df[field] = parsed.apply(lambda x, f=field: x.get(f)) + parsed = df["json"].apply(json.loads) + for field in ("author_name", "rationale", "answer_type", "canary", "image"): + df[field] = parsed.apply(lambda x, f=field: x[f]) @@ - "verified_class": HLE_VERIFIED_CLASSES_MAP.get(entry["Verified_Classes"], entry["Verified_Classes"]), + "verified_class": HLE_VERIFIED_CLASSES_MAP[entry["Verified_Classes"]],

As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".

Also applies to: 74-74

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hle_verified/prepare.py` around lines 56 - 58, The code silently defaults missing keys by using dict.get during parsing; change the lambda in the loop that assigns df[field] (the one using parsed.apply(lambda x, f=field: x.get(f))) to use direct key access (e.g., x[f]) so a KeyError is raised on schema drift, and add a small guard that validates parsed is a dict and re-raises a clearer exception (with field name and offending row/context) if not; apply the same direct-access fix to the similar occurrence referenced at the other location (line ~74) so missing keys fail fast rather than producing None values.

coderabbitai · 2026-03-05T13:02:15Z

nemo_skills/dataset/hle_verified/prepare.py

+def write_data_to_file(output_file, data, split):
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for _, entry in tqdm(data.iterrows(), total=len(data), desc=f"Writing {output_file.name}"):
+            # Filter by category for category-specific splits
+            if split in HLE_REVERSE_MAP and entry["category"] != HLE_REVERSE_MAP[split]:
+                continue
+            # Filter by verified class for class-specific splits
+            if split in HLE_VERIFIED_CLASSES_REVERSE_MAP:
+                if entry["Verified_Classes"] != HLE_VERIFIED_CLASSES_REVERSE_MAP[split]:
+                    continue
+            if entry["image"]:
+                continue
+            # text split = text-only entries from Gold + Revision subsets only
+            if split == "text" and entry["Verified_Classes"] == HLE_VERIFIED_CLASSES_REVERSE_MAP["uncertain"]:
+                continue
+            json.dump(format_entry(entry), fout)
+            fout.write("\n")


⚠️ Potential issue | 🟠 Major

Do transformation/filtering before opening the output file.

If formatting/filtering fails mid-loop, current flow can leave partially written files and overwrite valid prior outputs.

Suggested fix

def write_data_to_file(output_file, data, split): - with open(output_file, "wt", encoding="utf-8") as fout: - for _, entry in tqdm(data.iterrows(), total=len(data), desc=f"Writing {output_file.name}"): - # Filter by category for category-specific splits - if split in HLE_REVERSE_MAP and entry["category"] != HLE_REVERSE_MAP[split]: - continue - # Filter by verified class for class-specific splits - if split in HLE_VERIFIED_CLASSES_REVERSE_MAP: - if entry["Verified_Classes"] != HLE_VERIFIED_CLASSES_REVERSE_MAP[split]: - continue - if entry["image"]: - continue - # text split = text-only entries from Gold + Revision subsets only - if split == "text" and entry["Verified_Classes"] == HLE_VERIFIED_CLASSES_REVERSE_MAP["uncertain"]: - continue - json.dump(format_entry(entry), fout) - fout.write("\n") + records = [] + for _, entry in tqdm(data.iterrows(), total=len(data), desc=f"Preparing {output_file.name}"): + if split in HLE_REVERSE_MAP and entry["category"] != HLE_REVERSE_MAP[split]: + continue + if split in HLE_VERIFIED_CLASSES_REVERSE_MAP and entry["Verified_Classes"] != HLE_VERIFIED_CLASSES_REVERSE_MAP[split]: + continue + if entry["image"]: + continue + if split == "text" and entry["Verified_Classes"] == HLE_VERIFIED_CLASSES_REVERSE_MAP["uncertain"]: + continue + records.append(format_entry(entry)) + + with open(output_file, "wt", encoding="utf-8") as fout: + for record in records: + json.dump(record, fout) + fout.write("\n")

As per coding guidelines: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/hle_verified/prepare.py` around lines 78 - 94, The write_data_to_file function currently opens the output file before applying filters/formatting which risks partially written/overwritten files if processing fails; change it to first iterate over data and collect the filtered/formatted JSON strings (using the same filter logic referencing HLE_REVERSE_MAP, HLE_VERIFIED_CLASSES_REVERSE_MAP, entry["image"], and the "text"/"uncertain" check and format_entry) into a list, and only after successful completion open output_file for writing and dump each precomputed string (one per line) to the file so no partial files are created on error.