Gnalbandyan/add physics by gnalbandyan · Pull Request #1214 · NVIDIA-NeMo/Skills

gnalbandyan · 2026-02-05T15:25:15Z

Added PHYSICS benchmark, updated Scientific Knowledge documentation page

Summary by CodeRabbit

Documentation
- Consolidated scientific-knowledge docs into a compact dataset overview table and updated evaluation examples to new parameter conventions.
New Features
- Added physics evaluation support: dataset preparation and export, specialized scoring for physics answers, and new prompt templates for problem generation and judgement.

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/evaluation/metrics/physics_metrics.py

greptile-apps · 2026-02-05T15:27:18Z

nemo_skills/dataset/physics/prepare.py

+    dataset = load_dataset("desimfj/PHYSICS")["test"]
+    eng_data = [entry for entry in dataset if entry["language"] == "en"]
+    ch_data = [entry for entry in dataset if entry["language"] == "zh"]
+    full_data = eng_data + ch_data
+
+    for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
+        save_data(split_data, split_name)


EN/ZH split filenames swapped
In the final loop, zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]) writes English examples to test.jsonl and Chinese examples to zh.jsonl, but then uses en_zh for the combined split. This makes test effectively EN-only and zh ZH-only, which seems fine, but contradicts the naming in the docs/config (EN default is called test). If test is intended to be the full test split, this is wrong; if test is intended to be EN-only, rename test to en (or update dataset defaults/docs) to avoid consumers accidentally evaluating the wrong language split.

coderabbitai · 2026-02-05T15:32:39Z

📝 Walkthrough

Walkthrough

Adds a new Physics dataset package (data prep and config), physics-specific metrics and judge/prompt configs, and updates scientific-knowledge docs to a compact dataset table and revised evaluation examples.

Changes

Cohort / File(s)	Summary
Physics Dataset `nemo_skills/dataset/physics/__init__.py`, `nemo_skills/dataset/physics/prepare.py`	New dataset module with evaluation/config constants and data-preparation utilities (strip_boxed, process_answer, format_entry, write_data_to_file, save_data). Produces JSONL splits (test en, zh, en_zh) from DESIMFJ Physics.
Metrics Registration & Implementation `nemo_skills/evaluation/metrics/map_metrics.py`, `nemo_skills/evaluation/metrics/physics_metrics.py`, `nemo_skills/evaluation/metrics/math_metrics.py`	Registers `physics` in METRICS_MAP; adds PhysicsMetrics subclass (is_correct_judgement, get_incorrect_sample) and exposes MathMetrics.is_correct_judgement wrapper used in scoring flow.
Prompt Configs `nemo_skills/prompt/config/generic/physics.yaml`, `nemo_skills/prompt/config/judge/physics.yaml`	Adds generic problem prompt enforcing LaTeX and boxed answers, and judge prompt for producing `[Correct]`/`[Incorrect]` judgements with handling rules for multiple answers and equivalence.
Docs `docs/evaluation/scientific-knowledge.md`, `docs/evaluation/index.md`, `docs/index.md`	Converts narrative benchmark sections into a consolidated dataset overview table and simplifies example listings; updates example evaluation snippets and parameter names.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Prep as DataPrep
    participant Storage as Dataset (JSONL)
    participant Evaluator as PhysicsMetrics
    participant Judge as JudgePrompt
    participant Model as JudgeModel

    User->>Prep: load_dataset(DESIMFJ Physics)
    Prep->>Prep: strip_boxed / process_answer / format_entry
    Prep->>Storage: write JSONL splits (en, zh, en_zh)
    User->>Evaluator: submit prediction(s)
    Evaluator->>Judge: craft judge prompt (problem, generation, expected_answer)
    Judge->>Model: send prompt to judge model/server
    Model-->>Judge: return judgement ([Correct]/[Incorrect])
    Judge-->>Evaluator: judgement text
    Evaluator-->>User: score dict (judge_correct, metrics)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Audio input output integration #1157 — Also modifies METRICS_MAP to add a new metrics entry; likely touches the same registration area.
Add RULERv2 #1106 — Similar METRICS_MAP extension pattern (adds a metrics class registration).
Added AAI-Omniscience Benchmark #1161 — Adds new dataset package and metrics module with registration, matching this PR's scaffolding.

Suggested labels

enhancement

Suggested reviewers

ekmb
Kipok

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Gnalbandyan/add physics' is overly vague and uses a branch naming convention rather than describing the actual changes; it does not clearly convey the main purpose of adding a Physics benchmark.	Use a more descriptive title like 'Add Physics benchmark with dataset, metrics, and evaluation configuration' to clearly summarize the main changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch gnalbandyan/add_physics

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🤖 Fix all issues with AI agents

In `@docs/evaluation/scientific-knowledge.md`:
- Around line 10-13: Remove the invalid HTML break tags in the markdown table
and following code block by replacing all occurrences of "<br>" and "</br>" with
valid markdown-friendly breaks (e.g., use "<br/>" if HTML breaks are required,
or convert to plain newlines/Markdown line breaks) in the rows containing
"**GPQA**", "**SciCode**" and the surrounding table lines (also fix the similar
instances around lines 40-40 referenced). Ensure table cells remain properly
formatted after the change.
- Around line 5-17: Add a new "Physics benchmark" subsection below the "Dataset
Overview" table that targets the "Physics" row: include the example evaluation
command (CLI or script) to run the benchmark, the expected baseline
results/metrics to compare against, model-testing details (prompt format,
scoring/judging rules and any automated judge used), and dataset-specific notes
describing the EN/ZH splits and how to select the EN split for evaluation;
reference the "Physics" dataset name from the table and ensure the subsection
succinctly documents command, expected results, model testing, and
dataset-specific setup (EN/ZH selection and judge configuration).

In `@nemo_skills/dataset/physics/__init__.py`:
- Around line 15-18: The inline comment next to METRICS_TYPE is incorrect:
update the comment to reflect that METRICS_TYPE = "physics" uses the
PhysicsMetrics class (not MathMetrics) and still sets compute_no_answer=False;
modify the comment on the METRICS_TYPE line accordingly and ensure surrounding
constants DATASET_GROUP, METRICS_TYPE, and GENERATION_ARGS remain unchanged.

In `@nemo_skills/dataset/physics/prepare.py`:
- Line 68: Change the zip call to enforce that the two iterables have identical
lengths by adding strict=True to the zip invocation used in the loop over
eng_data, ch_data, full_data and ["test", "zh", "en_zh"], i.e., update the for
loop that binds split_data and split_name so zip(..., strict=True) is used
instead of a plain zip to ensure mismatched lengths raise an error.
- Around line 29-32: Add strict=True to the zip(...) invocation used to pair the
parallel lists in this module — locate the zip call in this file (the one that
pairs items when building examples, adjacent to process_answer) and change
zip(a, b) to zip(a, b, strict=True) so mismatched lengths raise immediately;
ensure the call site where the pairing logic is implemented is updated (the zip
used inside the example-building function in this file).

In `@nemo_skills/prompt/config/generic/physics.yaml`:
- Around line 2-5: The YAML prompt contains a typo in the third rule: change the
word "seperated" to the correct spelling "separate" in the instruction that
reads "If there are multiple final answers, please seperated them by commas in
\\boxed{{}}"; update that sentence so it reads "If there are multiple final
answers, please separate them by commas in \\boxed{{}}", keeping the surrounding
LaTeX guidance and formatting intact (file:
nemo_skills/prompt/config/generic/physics.yaml; locate the rule text containing
"seperated").

In `@nemo_skills/prompt/config/judge/physics.yaml`:
- Around line 16-17: The prompt string fragment "Question: {problem}, Output
sentence: {generation}, Correct answer: {expected_answer}, Judge- ment:" is
broken by a hyphenated line break; fix it by merging the split token into a
single word and line so it reads "Judgement:" (i.e., replace "Judge- ment:" with
"Judgement:") ensuring the entire prompt line is contiguous: "Question:
{problem}, Output sentence: {generation}, Correct answer: {expected_answer},
Judgement:".

coderabbitai · 2026-02-05T15:32:42Z

docs/evaluation/scientific-knowledge.md

+## Dataset Overview

-### hle
-
- Benchmark is defined in [`nemo_skills/dataset/hle/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hle/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/cais/hle).
- The `text` split includes all non-image examples. It is further divided into `eng`, `chem`, `bio`, `cs`, `phy`, `math`, `human`, `other`. Currently, **all** of these splits contain only text data.
-
-### SimpleQA
-
- Benchmark is defined in [`nemo_skills/dataset/simpleqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/simpleqa/__init__.py)
- Original benchmark source code for SimpleQA (OpenAI) is [here](https://github.com/openai/simple-evals/) and the leaderboard is [here](https://www.kaggle.com/benchmarks/openai/simpleqa). An improved version with 1,000 examples from Google, SimpleQA-verified, is [here](https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified).
- To use the SimpleQA-verified, set `split=verified`. To use the original version of SimpleQA, please set `split=test`.
-
-In the below configurations, we also use `gpt-oss-120b` as the judge model.
-
-#### Configuration: `gpt-oss-120b` with builtin tool (python)
+| <div style="width:80px; display:inline-block; text-align:center">Dataset</div> | <div style="width:110px; display:inline-block; text-align:center">Questions</div> | <div style="width:90px; display:inline-block; text-align:center">Types</div> | <div style="width:150px; display:inline-block; text-align:center">Domain</div> | <div style="width:70px; display:inline-block; text-align:center">Images?</div> | <div style="width:70px; display:inline-block; text-align:center">NS default</div> | <div style="width:50px; display:inline-block; text-align:center">Link</div> |
+|:---|:---:|:---:|:---|:---:|:---:|:---:|
+| **HLE** | 2500 | Open ended, MCQ | Engineering, Physics, Chemistry, Bio, etc. | Yes | text only | [HF](https://huggingface.co/datasets/cais/hle) |
+| **GPQA** | 448 (main)<br>198 (diamond)</br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond | [HF](https://huggingface.co/datasets/Idavidrein/gpqa) |
+| **SuperGPQA** | 26,529 | MCQ (≤ 10) | Science, Eng, Humanities, etc. | No | test | [HF](https://huggingface.co/datasets/m-a-p/SuperGPQA) |
+| **MMLU-Pro** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test | [HF](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) |
+| **SciCode** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) |
+| **FrontierScience** | 100 | Short-answer | Physics, Chemistry, Biology | No | all | [HF](https://huggingface.co/datasets/openai/frontierscience) |
+| **Physics** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN | [HF](https://huggingface.co/datasets/desimfj/PHYSICS) |
+| **MMLU** | 14,042 | MCQ (4) | Multiple Subjects | No | test | [HF](https://huggingface.co/datasets/cais/mmlu) |
+| **SimpleQa** | 4,326 (test), 1,000 (verified) | Open ended | Factuality, Parametric knowledge| No | verified | [HF](https://github.com/openai/simple-evals/) |


⚠️ Potential issue | 🟠 Major

Add Physics benchmark details (command, expected results, model testing, dataset-specific notes).
The table introduces Physics but there’s no physics-specific example command, expected results, or dataset notes (e.g., EN/ZH splits, judge setup). Please add a short subsection covering these items.
As per coding guidelines: When adding new benchmarks, add documentation with example commands, expected results, model testing details, and dataset-specific information.

🤖 Prompt for AI Agents

In `@docs/evaluation/scientific-knowledge.md` around lines 5 - 17, Add a new "Physics benchmark" subsection below the "Dataset Overview" table that targets the "Physics" row: include the example evaluation command (CLI or script) to run the benchmark, the expected baseline results/metrics to compare against, model-testing details (prompt format, scoring/judging rules and any automated judge used), and dataset-specific notes describing the EN/ZH splits and how to select the EN split for evaluation; reference the "Physics" dataset name from the table and ensure the subsection succinctly documents command, expected results, model testing, and dataset-specific setup (EN/ZH selection and judge configuration).

coderabbitai · 2026-02-05T15:32:42Z

docs/evaluation/scientific-knowledge.md

+| **GPQA** | 448 (main)<br>198 (diamond)</br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond | [HF](https://huggingface.co/datasets/Idavidrein/gpqa) |
+| **SuperGPQA** | 26,529 | MCQ (≤ 10) | Science, Eng, Humanities, etc. | No | test | [HF](https://huggingface.co/datasets/m-a-p/SuperGPQA) |
+| **MMLU-Pro** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test | [HF](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) |
+| **SciCode** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) |


coderabbitai · 2026-02-05T15:32:42Z

nemo_skills/dataset/physics/__init__.py

+# settings that define how evaluation should be done by default (all can be changed from cmdline)
+DATASET_GROUP = "math"
+METRICS_TYPE = "physics"  # This uses the MathMetrics class, but with compute_no_answer=False
+GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"


⚠️ Potential issue | 🟡 Minor

Comment mismatch: METRICS_TYPE uses PhysicsMetrics, not MathMetrics.

🛠️ Suggested fix

-METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False +METRICS_TYPE = "physics" # Uses PhysicsMetrics (compute_no_answer defaults to False)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# settings that define how evaluation should be done by default (all can be changed from cmdline)

DATASET_GROUP = "math"

METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False

GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"

# settings that define how evaluation should be done by default (all can be changed from cmdline)

DATASET_GROUP = "math"

METRICS_TYPE = "physics" # Uses PhysicsMetrics (compute_no_answer defaults to False)

GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/physics/__init__.py` around lines 15 - 18, The inline comment next to METRICS_TYPE is incorrect: update the comment to reflect that METRICS_TYPE = "physics" uses the PhysicsMetrics class (not MathMetrics) and still sets compute_no_answer=False; modify the comment on the METRICS_TYPE line accordingly and ensure surrounding constants DATASET_GROUP, METRICS_TYPE, and GENERATION_ARGS remain unchanged.

coderabbitai · 2026-02-05T15:32:42Z

nemo_skills/dataset/physics/prepare.py

+def process_answer(answer):
+    """Flatten all answers and wrap in a single \\boxed{}"""
+    all_answers = [strip_boxed(item) for sublist in answer for item in sublist]
+    return f"\\boxed{{{', '.join(all_answers)}}}"


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's view the full prepare.py file to understand the context find . -type f -name "prepare.py" -path "*/physics/*" | head -5

Repository: NVIDIA-NeMo/Skills

Length of output: 102

🏁 Script executed:

# List files in the physics dataset directory fd "prepare.py" --path "*physics*"

Repository: NVIDIA-NeMo/Skills

Length of output: 277

🏁 Script executed:

# Search for the exact file find . -type f -name "prepare.py" | grep physics

Repository: NVIDIA-NeMo/Skills

Length of output: 102

🌐 Web query:

HuggingFace dataset desimfj/PHYSICS schema answer field format

💡 Result:

In desimfj/PHYSICS, the answer field is a nested list: List[List[str]] (HF viewer shows it as a sequence of sequences). Each outer list element corresponds to one sub-question, and the inner list contains one or more accepted answer strings (typically LaTeX, often wrapped with \\boxed{...}).[1]

Example patterns from the dataset:[1]

Single sub-question:
answer = [[ "\\boxed{3}" ]]

Two sub-questions:
answer = [[ "\\boxed{-60}" ], [ "\\boxed{4}" ]]

MCQ:
answer = [[ "\\boxed{B}" ]]

🏁 Script executed:

# Search for references to this dataset or answer processing rg "desimfj/PHYSICS" --type py

Repository: NVIDIA-NeMo/Skills

Length of output: 154

🏁 Script executed:

# Look for load_dataset calls related to physics rg "load_dataset.*physics" --type py -i

Repository: NVIDIA-NeMo/Skills

Length of output: 154

🏁 Script executed:

# Now let's read the full prepare.py file to see the context and line 68 cat -n ./nemo_skills/dataset/physics/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 2866

🏁 Script executed:

# Check for Python version requirements find . -name "setup.py" -o -name "pyproject.toml" -o -name ".python-version" | xargs cat 2>/dev/null | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 1986

🏁 Script executed:

# Also check for requirements files find . -name "requirements*.txt" | xargs head -5 2>/dev/null | head -30

Repository: NVIDIA-NeMo/Skills

Length of output: 174

🏁 Script executed:

# Search for python_requires in setup files rg "python_requires|python =" --type py --type toml | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 220

🏁 Script executed:

# Verify zip behavior and confirm both lists have same length python3 << 'EOF' # Simulating the actual data eng_data = [1, 2, 3] # placeholder for actual data ch_data = [1, 2, 3] full_data = [1, 2, 3] split_names = ["test", "zh", "en_zh"] split_datas = [eng_data, ch_data, full_data] # Check lengths match print(f"Length of split_datas: {len(split_datas)}") print(f"Length of split_names: {len(split_names)}") print(f"Balanced: {len(split_datas) == len(split_names)}") # Test that zip(strict=True) would work (Python 3.10+) result = list(zip(split_datas, split_names, strict=True)) print(f"zip(strict=True) works: {len(result)} pairs") EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 158

Add strict=True to the zip() call at line 68 for safety.

The nested comprehension in process_answer() is correct—the dataset schema confirms answer is List[List[str]], and the code handles it properly with no risk of data corruption. However, line 68 should use zip(..., strict=True) to catch potential future bugs if the list lengths ever become unequal.

Fix for line 68

- for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]): + for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/physics/prepare.py` around lines 29 - 32, Add strict=True to the zip(...) invocation used to pair the parallel lists in this module — locate the zip call in this file (the one that pairs items when building examples, adjacent to process_answer) and change zip(a, b) to zip(a, b, strict=True) so mismatched lengths raise immediately; ensure the call site where the pairing logic is implemented is updated (the zip used inside the example-building function in this file).

coderabbitai · 2026-02-05T15:32:42Z

nemo_skills/dataset/physics/prepare.py

+    ch_data = [entry for entry in dataset if entry["language"] == "zh"]
+    full_data = eng_data + ch_data
+
+    for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check Python version requirements echo "=== Python version requirements ===" rg -n "python_requires|requires-python|python_version" -S -i echo "" echo "=== pyproject.toml content (relevant sections) ===" if [ -f pyproject.toml ]; then head -50 pyproject.toml fi

Repository: NVIDIA-NeMo/Skills

Length of output: 1818

🏁 Script executed:

#!/bin/bash # Check for Ruff configuration and B905 rule echo "=== Looking for Ruff configuration ===" find . -maxdepth 2 -type f \( -name "ruff.toml" -o -name ".ruff.toml" -o -name "pyproject.toml" \) -exec echo "File: {}" \; -exec cat {} \;

Repository: NVIDIA-NeMo/Skills

Length of output: 3330

🏁 Script executed:

#!/bin/bash # Verify the actual zip call and surrounding context echo "=== Context around line 68 in prepare.py ===" sed -n '60,75p' nemo_skills/dataset/physics/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 508

Reconsider the strict= parameter value for alignment with project philosophy.

Ruff B905 requires explicit strict= parameter in zip(). However, the suggested fix of strict=False contradicts the project's coding guidelines about explicit error handling and letting errors surface rather than silently failing. Since both iterables are guaranteed to have exactly 3 elements each, use strict=True instead to enforce that expectation:

- for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]): + for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):

for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 68-68: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In `@nemo_skills/dataset/physics/prepare.py` at line 68, Change the zip call to enforce that the two iterables have identical lengths by adding strict=True to the zip invocation used in the loop over eng_data, ch_data, full_data and ["test", "zh", "en_zh"], i.e., update the for loop that binds split_data and split_name so zip(..., strict=True) is used instead of a plain zip to ensure mismatched lengths raise an error.

nemo_skills/prompt/config/generic/physics.yaml

nemo_skills/prompt/config/judge/physics.yaml

jiacheng-xu · 2026-02-05T23:04:23Z

nemo_skills/prompt/config/generic/physics.yaml

Source of the prompt?

nemo_skills/prompt/config/judge/physics.yaml

docs/evaluation/scientific-knowledge.md

nemo_skills/evaluation/metrics/physics_metrics.py

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-06T09:34:12Z

nemo_skills/prompt/config/judge/physics.yaml

+  Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judge-
+  ment:


incomplete prompt, text cut off at "Judge-ment:" instead of proper instruction

Suggested change

Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judge-

ment:

Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}. Judgement:

greptile-apps · 2026-02-06T09:34:13Z

nemo_skills/dataset/physics/__init__.py

+DATASET_GROUP = "math"
+METRICS_TYPE = "physics"  # This uses the MathMetrics class, but with compute_no_answer=False
+GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"
+EVAL_SPLIT = "test"


EVAL_SPLIT = "test" creates naming confusion - per prepare.py:68, test contains only EN examples (1000), but doc table (line 15) says default is "EN" with 1,000 examples, suggesting alignment. However, the file labeling is confusing: test.jsonl = EN-only, zh.jsonl = ZH-only, en_zh.jsonl = combined. Consider renaming test to en for clarity or update docs to explicitly state that "test" = "EN-only split"

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

jiacheng-xu

lgtm

Kipok · 2026-02-06T23:23:09Z

docs/evaluation/scientific-knowledge.md

@@ -1,214 +1,92 @@
-# Scientific knowledge
+# Scientific Knowledge


please fix these issues reported by mkdocs

INFO - Doc file 'index.md' contains a link './evaluation/scientific-knowledge.md#hle', but the doc 'evaluation/scientific-knowledge.md' does not contain an anchor '#hle'. INFO - Doc file 'index.md' contains a link './evaluation/scientific-knowledge.md#scicode', but the doc 'evaluation/scientific-knowledge.md' does not contain an anchor '#scicode'. INFO - Doc file 'index.md' contains a link './evaluation/scientific-knowledge.md#gpqa', but the doc 'evaluation/scientific-knowledge.md' does not contain an anchor '#gpqa'. INFO - Doc file 'evaluation/index.md' contains a link './scientific-knowledge.md#hle', but the doc 'evaluation/scientific-knowledge.md' does not contain an anchor '#hle'. INFO - Doc file 'evaluation/index.md' contains a link './scientific-knowledge.md#scicode', but the doc 'evaluation/scientific-knowledge.md' does not contain an anchor '#scicode'. INFO - Doc file 'evaluation/index.md' contains a link './scientific-knowledge.md#gpqa', but the doc 'evaluation/scientific-knowledge.md' does not contain an anchor '#gpqa'.

also the table is a bit too wide - have to scroll through. Maybe we can reorganize to reduce number of columns? E.g. link can just be fused into the first column. And if we also remove images (can just add a footnote maybe for hle), then it's going to fit

fixed these, for the images column, we plan to add multimodal data soon, thats why its there

Kipok · 2026-02-06T23:25:02Z

docs/evaluation/scientific-knowledge.md

-                "++parse_reasoning=True "
-                '\'++end_reasoning_string="<|start|>assistant<|channel|>final<|message|>"\''
+        "++inference.temperature=1.0 ++inference.top_p=1.0 "
+        "++inference.tokens_to_generate=131072 ++inference.extra_body.skip_special_tokens=false "


do we need ++inference.extra_body.skip_special_tokens=false ?

Kipok · 2026-02-06T23:28:31Z

docs/evaluation/scientific-knowledge.md

+| **SciCode** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) |
+| **FrontierScience** | 100 | Short-answer | Physics, Chemistry, Biology | No | all | [HF](https://huggingface.co/datasets/openai/frontierscience) |
+| **Physics** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN | [HF](https://huggingface.co/datasets/desimfj/PHYSICS) |
+| **MMLU** | 14,042 | MCQ (4) | Multiple Subjects | No | test | [HF](https://huggingface.co/datasets/cais/mmlu) |


the table seems to have fewer datasets than in the original docs. E.g. mmlu-redux is missing? Also scicode section had some useful extra details which are good to keep?

Added mmlu-redux, the scicide note was about gpt-oss, which is now removed.

Kipok · 2026-02-06T23:29:25Z

docs/evaluation/scientific-knowledge.md

+    server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32",
+    model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
+    benchmarks="gpqa:4",
+    output_dir="/workspace/Nano_V3_evals"


Can we add expected results to all of these commands? You can use mkdocs dropdowns / tabs to make it use fewer space, e.g. can have a toggle per benchmark / evaluation mode or something. But having reference numbers is useful

nemo_skills/evaluation/metrics/physics_metrics.py

Kipok · 2026-02-06T23:47:53Z

nemo_skills/prompt/config/generic/physics.yaml

+user: |-
+    Below is an open-ended problem in Physics. Please answer this problem adhering to the following rules:
+    1. Please use LaTeX format to represent the variables and formulas used in the solution process and results.
+    2. Please put the final answer(s) in \\boxed{{}}, note that the unit of the answer should not be included in \\boxed{{}}.


you most likely want this to be \boxed, not \\boxed. With |- syntax yaml doesn't need \ escaping, so this will render as \\

nemo_skills/prompt/config/judge/physics.yaml

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T14:12:19Z

docs/evaluation/scientific-knowledge.md

+| **[MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test |
+| **[SciCode](https://huggingface.co/datasets/SciCode1/SciCode)** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val |
+| **[FrontierScience](https://huggingface.co/datasets/openai/frontierscience)** | 100 | Short-answer | Physics, Chemistry, Biology | No | all |
+| **[Physics](https://huggingface.co/datasets/desimfj/PHYSICS)** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN |


Documentation table states default is "EN", but __init__.py:19 uses EVAL_SPLIT = "test" which maps to EN-only split per prepare.py:68. While technically aligned (both refer to 1,000 EN examples), consider clarifying by either updating table from "EN" to "test" for consistency with code, or renaming test.jsonl to en.jsonl in prepare.py:68 and updating EVAL_SPLIT = "en" for better semantic clarity. Current naming creates confusion since test typically implies the full test set, not a language-specific subset.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T19:21:15Z

nemo_skills/prompt/config/judge/physics.yaml

+  2. Mathematical Problems: If the formats differ but the answers are mathematically equivalent, respond with [Correct].
+  3. Explicit Options: If the question provides explicit candidate answers, the output will be considered correct if it clearly indicates the correct option’s code or the correct option’s content.
+  4. No Explicit Options: If the question does not provide explicit options, the output must align with the correct answer in content and meaning to be considered [Correct].
+  Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judgement:


Missing space after "Judgement:" - judge will append its response directly without separation

Suggested change

Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judgement:

Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judgement:

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Kipok · 2026-02-10T00:12:19Z

nemo_skills/evaluation/metrics/physics_metrics.py

+
+
+class PhysicsMetrics(MathMetrics):
+    def __init__(self, compute_no_answer: bool = False, answer_key: str = "generation"):


you can skip the init if it's identical to parent class

@Kipok the defaults are different. I could remove it and use a partial in map_metrics.py, but it looks more explicit and straightforward to me having the defaults here.

Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

greptile-apps

_{10 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-11T07:35:46Z

nemo_skills/evaluation/metrics/physics_metrics.py

+    def __init__(self, compute_no_answer: bool = False, answer_key: str = "generation"):
+        super().__init__(compute_no_answer=compute_no_answer)
+        self.answer_key = answer_key


Incorrect super().__init__ args
PhysicsMetrics.__init__ accepts answer_key but doesn’t pass it to MathMetrics.__init__, so MathMetrics.question_key/answer_key stay at their defaults (problem/predicted_answer). This will break evaluation when predictions use the expected generation key (e.g., pass@k/majority@k will look up predicted_answer and raise KeyError). Pass answer_key (and any non-default question_key if needed) through to super().__init__ instead of only setting self.answer_key.

greptile-apps · 2026-02-11T07:35:47Z

nemo_skills/dataset/physics/__init__.py

+# settings that define how evaluation should be done by default (all can be changed from cmdline)
+DATASET_GROUP = "math"
+METRICS_TYPE = "physics"  # This uses the MathMetrics class, but with compute_no_answer=False
+GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"
+EVAL_SPLIT = "test"


Wrong dataset group/type
DATASET_GROUP = "math" and GENERATION_ARGS sets ++eval_type=math, but this PR introduces a physics-specific prompt/metrics. Using the math group/type here can route PHYSICS runs through the wrong dataset category/config defaults and can select the wrong evaluation pipeline settings.

If this benchmark is meant to show up under scientific knowledge (per docs) and be evaluated with the physics metrics, the dataset metadata should be consistent with that (group + eval_type).

greptile-apps · 2026-02-11T07:35:48Z

nemo_skills/prompt/config/judge/physics.yaml

+user: |-
+  You are a diligent and precise assistant tasked with evaluating the correctness of responses. You will receive a question, an output sentence, and the correct answer. Your task is to determine if the output sentence accurately answers the question based on the provided correct answer. Respond with either [Correct] or [Incorrect].
+  Special considerations:
+  1. Multiple Answers: If the output contains multiple answers, evaluate whether later answers modify or correct earlier ones. In such cases, compare the final answer with the correct answer. If the final answer is unclear or incorrect, respond with [Incorrect].
+  2. Mathematical Problems: If the formats differ but the answers are mathematically equivalent, respond with [Correct].
+  3. Explicit Options: If the question provides explicit candidate answers, the output will be considered correct if it clearly indicates the correct option’s code or the correct option’s content.
+  4. No Explicit Options: If the question does not provide explicit options, the output must align with the correct answer in content and meaning to be considered [Correct].
+  Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judgement:


Judge prompt unparseable
The prompt ends with a stray "] after Judgement:. This will be included in the model input and makes the expected output format ambiguous; it also looks like an accidental truncation/quoting artifact. Remove the extra characters so the judge sees a clean instruction ending at Judgement: (with an appropriate trailing space/newline).

greptile-apps · 2026-02-11T07:35:49Z

nemo_skills/prompt/config/generic/physics.yaml

+user: |-
+    Below is an open-ended problem in Physics. Please answer this problem adhering to the following rules:
+    1. Please use LaTeX format to represent the variables and formulas used in the solution process and results.
+    2. Please put the final answer(s) in \boxed{{}}, note that the unit of the answer should not be included in \boxed{{}}.
+    3. If there are multiple final answers, please seperated them by commas in \boxed{{}}, e.g., \boxed{{answer 1, answer 2}}.


Prompt typo affects instruction
Rule 3 says “please seperated them by commas” (typo). This gets copied into the model’s instructions and can hurt prompt clarity / consistency across benchmarks. Fix to “separate”.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@docs/evaluation/scientific-knowledge.md`:
- Around line 77-82: The multiline string passed into ctx=wrap_arguments(...)
contains an unintended blank line (a stray newline) between the
"++parse_reasoning=True " and "++tool_modules=..." lines which triggers the
markdownlint indented-code-block warning; edit the argument to wrap_arguments
(the ctx=wrap_arguments(...) call) and remove the blank line so the
configuration lines are contiguous within the string (no extra empty line),
preserving existing spacing and quotes.

In `@nemo_skills/evaluation/metrics/physics_metrics.py`:
- Around line 29-39: The return type hint for is_correct_judgement is incorrect
because it can return None when return_none=True or the judgement format is
unrecognized; update the signature of is_correct_judgement to reflect
Optional[bool] (or Union[bool, None]) and add the necessary typing import (e.g.,
Optional) to the module so the annotation matches behavior and aligns with
utils.is_correct_judgement.

🧹 Nitpick comments (3)

nemo_skills/evaluation/metrics/math_metrics.py (1)
84-86: Signature mismatch with subclass override.

PhysicsMetrics.is_correct_judgement adds a return_none parameter that this base method doesn't accept. While not breaking today (callers don't pass return_none), this inconsistency could cause TypeError if someone calls the method polymorphically with return_none=True on a MathMetrics instance. Consider adding return_none: bool = False here too for a consistent interface.
Suggested fix
-    def is_correct_judgement(self, judgement: str) -> bool:
+    def is_correct_judgement(self, judgement: str, return_none: bool = False) -> bool:
         """Check if the judgement is correct."""
-        return is_correct_judgement(judgement)
+        return is_correct_judgement(judgement, return_none=return_none)
nemo_skills/evaluation/metrics/physics_metrics.py (1)
25-27: Pass answer_key through to super().__init__ instead of overriding after the fact.

MathMetrics.__init__ already accepts answer_key. Passing it through avoids the redundant set-then-override pattern:
Suggested fix
     def __init__(self, compute_no_answer: bool = False, answer_key: str = "generation"):
-        super().__init__(compute_no_answer=compute_no_answer)
-        self.answer_key = answer_key
+        super().__init__(compute_no_answer=compute_no_answer, answer_key=answer_key)
docs/index.md (1)

20-20: Consider mentioning physics in the example list and adding anchor links for consistency.

This PR adds the Physics benchmark, but the example list here says "hle, scicode, gpqa" without mentioning physics. Also, every other category line links individual benchmarks to their doc anchors, while this one uses plain text.

coderabbitai · 2026-02-11T07:37:15Z

docs/evaluation/scientific-knowledge.md

+        "++inference.temperature=0.6 ++inference.top_p=0.95 "
+        "++inference.tokens_to_generate=131072 "
+        "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
+        "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
+
    ),


⚠️ Potential issue | 🟡 Minor

Stray blank line inside function call.

Line 81 is blank inside the ctx=wrap_arguments(...) string, which causes the markdownlint indented-code-block warning and looks unintentional in the example.

Suggested fix

"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] " - ),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"++inference.temperature=0.6 ++inference.top_p=0.95 "

"++inference.tokens_to_generate=131072 "

"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "

"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "

),

"++inference.temperature=0.6 ++inference.top_p=0.95 "

"++inference.tokens_to_generate=131072 "

"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "

"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "

),

🧰 Tools

🪛 markdownlint-cli2 (0.20.0)

[warning] 79-79: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

🤖 Prompt for AI Agents

In `@docs/evaluation/scientific-knowledge.md` around lines 77 - 82, The multiline string passed into ctx=wrap_arguments(...) contains an unintended blank line (a stray newline) between the "++parse_reasoning=True " and "++tool_modules=..." lines which triggers the markdownlint indented-code-block warning; edit the argument to wrap_arguments (the ctx=wrap_arguments(...) call) and remove the blank line so the configuration lines are contiguous within the string (no extra empty line), preserving existing spacing and quotes.

coderabbitai · 2026-02-11T07:37:15Z

nemo_skills/evaluation/metrics/physics_metrics.py

+    def is_correct_judgement(self, judgement: str, return_none: bool = False) -> bool:
+        """Parse physics judgement that returns [Correct] or [Incorrect]."""
+        if judgement:
+            # Look for [Correct] or [Incorrect] patterns (case insensitive)
+            if re.search(r"\[correct\]", judgement, re.IGNORECASE):
+                return True
+            elif re.search(r"\[incorrect\]", judgement, re.IGNORECASE):
+                return False
+
+        # improper judgement format, so have to judge as false
+        return None if return_none else False


⚠️ Potential issue | 🟡 Minor

Return type hint -> bool is inaccurate — method can return None.

When return_none=True and the judgement format is unrecognized, this returns None. The hint should reflect that:

- def is_correct_judgement(self, judgement: str, return_none: bool = False) -> bool: + def is_correct_judgement(self, judgement: str, return_none: bool = False) -> bool | None:

Note: the same inaccuracy exists in utils.is_correct_judgement (which uses Union[bool, None] correctly in its signature), so this would bring them into alignment.

🤖 Prompt for AI Agents

In `@nemo_skills/evaluation/metrics/physics_metrics.py` around lines 29 - 39, The return type hint for is_correct_judgement is incorrect because it can return None when return_none=True or the judgement format is unrecognized; update the signature of is_correct_judgement to reflect Optional[bool] (or Union[bool, None]) and add the necessary typing import (e.g., Optional) to the module so the annotation matches behavior and aligns with utils.is_correct_judgement.

commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>

gnalbandyan added 2 commits February 3, 2026 00:07

Add Physics dataset, metric and official prompts

2377699

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

Update Scientific Knowledge docs page

06df59c

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

gnalbandyan requested review from Kipok, ekmb and jiacheng-xu and removed request for Kipok February 5, 2026 15:25

greptile-apps bot reviewed Feb 5, 2026

View reviewed changes

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

jiacheng-xu requested changes Feb 5, 2026

View reviewed changes

Add tool use command example, add references to prompts

3704acd

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

jiacheng-xu approved these changes Feb 6, 2026

View reviewed changes

Kipok reviewed Feb 6, 2026

View reviewed changes

Update prompts, fix mkdocs warnings, remove link column

ea5a10d

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

gnalbandyan requested a review from Kipok February 9, 2026 14:30

Move is_correct_judgement inside MathMetrics

caa20b6

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

Kipok reviewed Feb 10, 2026

View reviewed changes

Merge branch 'main' into gnalbandyan/add_physics

b307b24

Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

greptile-apps bot reviewed Feb 11, 2026

View reviewed changes

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

gnalbandyan merged commit ef0a890 into main Feb 11, 2026
5 checks passed

gnalbandyan deleted the gnalbandyan/add_physics branch February 11, 2026 08:03

This was referenced Mar 4, 2026

Add HotpotQA multi-hop QA benchmark #1292

Merged

Gnalbandyan/ugph hle verified #1293

Merged

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Gnalbandyan/add physics (#1214)

8f5a5ad

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

	for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
	for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):

		Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judge-
		ment:

		@@ -1,214 +1,92 @@
		# Scientific knowledge
		# Scientific Knowledge



		class PhysicsMetrics(MathMetrics):
		def __init__(self, compute_no_answer: bool = False, answer_key: str = "generation"):

Conversation

gnalbandyan commented Feb 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

jiacheng-xu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

gnalbandyan commented Feb 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 5, 2026 •

edited

Loading