Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/evaluation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f
- [**Math (natural language**)](./natural-math.md): e.g. [aime24](./natural-math.md#aime24), [aime25](./natural-math.md#aime25), [hmmt_feb25](./natural-math.md#hmmt_feb25)
- [**Math (formal language)**](./formal-math.md): e.g. [minif2f](./formal-math.md#minif2f), [proofnet](./formal-math.md#proofnet), [putnam-bench](./formal-math.md#putnam-bench)
- [**Code**](./code.md): e.g. [swe-bench](./code.md#swe-bench), [livecodebench](./code.md#livecodebench), [bird](./code.md#bird)
- [**Scientific knowledge**](./scientific-knowledge.md): e.g., [hle](./scientific-knowledge.md#hle), [scicode](./scientific-knowledge.md#scicode), [gpqa](./scientific-knowledge.md#gpqa)
- [**Scientific knowledge**](./scientific-knowledge.md): e.g., hle, scicode, gpqa.
- [**Instruction following**](./instruction-following.md): e.g. [ifbench](./instruction-following.md#ifbench), [ifeval](./instruction-following.md#ifeval)
- [**Long-context**](./long-context.md): e.g. [ruler](./long-context.md#ruler), [mrcr](./long-context.md#mrcr), [aalcr](./long-context.md#aalcr)
- [**Tool-calling**](./tool-calling.md): e.g. [bfcl_v3](./tool-calling.md#bfcl_v3)
Expand Down
227 changes: 53 additions & 174 deletions docs/evaluation/scientific-knowledge.md
Original file line number Diff line number Diff line change
@@ -1,214 +1,93 @@
# Scientific knowledge
# Scientific Knowledge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix these issues reported by mkdocs

INFO    -  Doc file 'index.md' contains a link './evaluation/scientific-knowledge.md#hle', but the doc 'evaluation/scientific-knowledge.md'
           does not contain an anchor '#hle'.
INFO    -  Doc file 'index.md' contains a link './evaluation/scientific-knowledge.md#scicode', but the doc 'evaluation/scientific-knowledge.md'
           does not contain an anchor '#scicode'.
INFO    -  Doc file 'index.md' contains a link './evaluation/scientific-knowledge.md#gpqa', but the doc 'evaluation/scientific-knowledge.md'
           does not contain an anchor '#gpqa'.
INFO    -  Doc file 'evaluation/index.md' contains a link './scientific-knowledge.md#hle', but the doc 'evaluation/scientific-knowledge.md'
           does not contain an anchor '#hle'.
INFO    -  Doc file 'evaluation/index.md' contains a link './scientific-knowledge.md#scicode', but the doc 'evaluation/scientific-knowledge.md'
           does not contain an anchor '#scicode'.
INFO    -  Doc file 'evaluation/index.md' contains a link './scientific-knowledge.md#gpqa', but the doc 'evaluation/scientific-knowledge.md'
           does not contain an anchor '#gpqa'.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the table is a bit too wide - have to scroll through. Maybe we can reorganize to reduce number of columns? E.g. link can just be fused into the first column. And if we also remove images (can just add a footnote maybe for hle), then it's going to fit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed these, for the images column, we plan to add multimodal data soon, thats why its there


More details are coming soon!
Nemo-Skills can be used to evaluate an LLM on various STEM datasets.

## Supported benchmarks
## Dataset Overview

### hle

- Benchmark is defined in [`nemo_skills/dataset/hle/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hle/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/cais/hle).
- The `text` split includes all non-image examples. It is further divided into `eng`, `chem`, `bio`, `cs`, `phy`, `math`, `human`, `other`. Currently, **all** of these splits contain only text data.

### SimpleQA

- Benchmark is defined in [`nemo_skills/dataset/simpleqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/simpleqa/__init__.py)
- Original benchmark source code for SimpleQA (OpenAI) is [here](https://github.com/openai/simple-evals/) and the leaderboard is [here](https://www.kaggle.com/benchmarks/openai/simpleqa). An improved version with 1,000 examples from Google, SimpleQA-verified, is [here](https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified).
- To use the SimpleQA-verified, set `split=verified`. To use the original version of SimpleQA, please set `split=test`.

In the below configurations, we also use `gpt-oss-120b` as the judge model.

#### Configuration: `gpt-oss-120b` with builtin tool (python)
| <div style="width:55px; display:inline-block; text-align:center">Dataset</div> | <div style="width:105px; display:inline-block; text-align:center">Questions</div> | <div style="width:85px; display:inline-block; text-align:center">Types</div> | <div style="width:145px; display:inline-block; text-align:center">Domain</div> | <div style="width:60px; display:inline-block; text-align:center">Images?</div> | <div style="width:50px; display:inline-block; text-align:center">NS default</div> |
|:---|:---:|:---:|:---|:---:|:---:|
| **[HLE](https://huggingface.co/datasets/cais/hle)** | 2500 | Open ended, MCQ | Engineering, Physics, Chemistry, Bio, etc. | Yes | text only |
| **[GPQA ](https://huggingface.co/datasets/Idavidrein/gpqa)** | 448 (main)<br>198 (diamond)</br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond |
| **[SuperGPQA](https://huggingface.co/datasets/m-a-p/SuperGPQA)** | 26,529 | MCQ (≤ 10) | Science, Eng, Humanities, etc. | No | test |
| **[MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test |
| **[SciCode](https://huggingface.co/datasets/SciCode1/SciCode)** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val |
| **[FrontierScience](https://huggingface.co/datasets/openai/frontierscience)** | 100 | Short-answer | Physics, Chemistry, Biology | No | all |
| **[Physics](https://huggingface.co/datasets/desimfj/PHYSICS)** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation table states default is "EN", but __init__.py:19 uses EVAL_SPLIT = "test" which maps to EN-only split per prepare.py:68. While technically aligned (both refer to 1,000 EN examples), consider clarifying by either updating table from "EN" to "test" for consistency with code, or renaming test.jsonl to en.jsonl in prepare.py:68 and updating EVAL_SPLIT = "en" for better semantic clarity. Current naming creates confusion since test typically implies the full test set, not a language-specific subset.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

| **[MMLU](https://huggingface.co/datasets/cais/mmlu)** | 14,042 | MCQ (4) | Multiple Subjects | No | test |
| **[MMLU-Redux](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux)** | 5,385| MCQ (4) | Multiple Subjects | No | test |
| **[SimpleQA](https://github.com/openai/simple-evals/)** | 4,326 (test), 1,000 (verified) | Open ended | Factuality, Parametric knowledge| No | verified |


## Evaluate `NVIDIA-Nemotron-3-Nano` on an MCQ dataset

```python
from nemo_skills.pipeline.cli import wrap_arguments, eval
cluster = 'slurm'

cluster = "slurm"
eval(
ctx=wrap_arguments(
"++inference.temperature=1.0 ++inference.tokens_to_generate=65536 "
"++code_tags=gpt-oss ++server.code_execution.max_code_executions=100 "
"++inference.endpoint_type=text ++chat_template_kwargs.builtin_tools=[python] "
"++chat_template_kwargs.reasoning_effort=high ++code_execution=true "
"++parse_reasoning=True "
'\'++end_reasoning_string="<|start|>assistant<|channel|>final<|message|>"\''
"++inference.temperature=1.0 ++inference.top_p=1.0 "
"++inference.tokens_to_generate=131072 "
"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
),
cluster=cluster,
expname="simpleqa-gpt-oss-120b-tool-output-only",
model="openai/gpt-oss-120b",
server_type="vllm",
server_gpus=8,
server_args="--async-scheduling",
benchmarks="simpleqa:2",
split="verified",
output_dir="/workspace/simpleqa-gpt-oss-120b-tool-output-only",
with_sandbox=True,
judge_model="openai/gpt-oss-120b",
judge_server_type="vllm",
judge_server_gpus=8,
judge_server_args="--async-scheduling --reasoning-parser GptOss",
server_gpus=1,
server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32",
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
benchmarks="gpqa:4",
output_dir="/workspace/Nano_V3_evals"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add expected results to all of these commands? You can use mkdocs dropdowns / tabs to make it use fewer space, e.g. can have a toggle per benchmark / evaluation mode or something. But having reference numbers is useful

)
```
</br>



#### Configuration: `gpt-oss-120b` without tool


## Evaluate `NVIDIA-Nemotron-3-Nano` using LLM-as-a-judge

```python
from nemo_skills.pipeline.cli import wrap_arguments, eval
cluster = 'slurm'
cluster = "slurm"
eval(
ctx=wrap_arguments(
"++inference.temperature=1.0 ++inference.tokens_to_generate=100000 "
"++inference.extra_body.reasoning_effort=high "
"++inference.temperature=1.0 ++inference.top_p=1.0 "
"++inference.tokens_to_generate=131072 "
"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
),
cluster="ord",
expname="simpleqa-gpt-oss-120b-notool",
model="openai/gpt-oss-120b",
cluster=cluster,
server_type="vllm",
server_gpus=8,
server_args="--async-scheduling --reasoning-parser GptOss",
benchmarks="simpleqa:2",
split="verified",
output_dir="/workspace/simpleqa-gpt-oss-120b-notool",
server_gpus=1,
server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32",
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
benchmarks="hle:4",
output_dir="/workspace/Nano_V3_evals",
judge_model="openai/gpt-oss-120b",
judge_server_type="vllm",
judge_server_gpus=8,
judge_server_args="--async-scheduling --reasoning-parser GptOss",
judge_server_args="--async-scheduling",
extra_judge_args="++chat_template_kwargs.reasoning_effort=high ++inference.temperature=1.0 ++inference.top_p=1.0 ++inference.tokens_to_generate=120000 "
)
```

!!! note

The module name for `reasoning-parser` differs across `vllm` versions. Depending on your version, it might appear as `openai_gptoss` or `GptOss`. In the latest main branch, it is named `openai_gptoss`. You can verify this in [gptoss_reasoning_parser.py](https://github.com/vllm-project/vllm/blob/main/vllm/reasoning/gptoss_reasoning_parser.py) and confirm which version your environment uses.

#### Result

We also tested a variant where the full generation output was provided to the judge—disabling "parse_reasoning". This configuration, labeled `simpleqa-gpt-oss-120b-tool-full-generation`, produced results nearly identical to the standard setup where the reasoning portion is excluded from the judge’s input.



| Run Name | pass@1 | majority@2 | pass@2 |
|:----------------------------------------------|-----------:|-------------:|----------:|
| simpleqa-gpt-oss-120b-notool | 12.93 | 12.93 | 17.22 |
| simpleqa-gpt-oss-120b-tool-full-generation | 80.30 | 80.30 | 84.78 |
| simpleqa-gpt-oss-120b-tool-output-only | 79.51 | 79.51 | 83.74 |

The reported number for `simpleqa-gpt-oss-120b-notool` is 13.1% according to this [kaggle page](https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified).

### FrontierScience-Olympiad

- Benchmark is defined in [`nemo_skills/dataset/frontierscience-olympiad/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/frontierscience-olympiad/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/openai/frontierscience).
- Contains 100 short-answer questions crafted by international science olympiad medalists across physics, chemistry, and biology.
- Available splits: `physics`, `chemistry`, `biology`, and `all` (all subjects combined, default).
```

#### Configuration: `gpt-oss-20b` with builtin tool (python)
## Evaluate `NVIDIA-Nemotron-3-Nano` on an MCQ dataset using tools

```python
from nemo_skills.pipeline.cli import wrap_arguments, eval

cluster = "slurm"
eval(
ctx=wrap_arguments(
"++inference.temperature=1.0 ++inference.tokens_to_generate=65536 "
"++code_tags=gpt-oss ++server.code_execution.max_code_executions=100 "
"++inference.endpoint_type=text ++chat_template_kwargs.builtin_tools=[python] "
"++chat_template_kwargs.reasoning_effort=high ++code_execution=true"
"++inference.temperature=0.6 ++inference.top_p=0.95 "
"++inference.tokens_to_generate=131072 "
"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "

),
Comment on lines +77 to 82
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stray blank line inside function call.

Line 81 is blank inside the ctx=wrap_arguments(...) string, which causes the markdownlint indented-code-block warning and looks unintentional in the example.

Suggested fix
         "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
         "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
-
     ),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"++inference.temperature=0.6 ++inference.top_p=0.95 "
"++inference.tokens_to_generate=131072 "
"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
),
"++inference.temperature=0.6 ++inference.top_p=0.95 "
"++inference.tokens_to_generate=131072 "
"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
),
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 79-79: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

🤖 Prompt for AI Agents
In `@docs/evaluation/scientific-knowledge.md` around lines 77 - 82, The multiline
string passed into ctx=wrap_arguments(...) contains an unintended blank line (a
stray newline) between the "++parse_reasoning=True " and "++tool_modules=..."
lines which triggers the markdownlint indented-code-block warning; edit the
argument to wrap_arguments (the ctx=wrap_arguments(...) call) and remove the
blank line so the configuration lines are contiguous within the string (no extra
empty line), preserving existing spacing and quotes.

cluster="slurm",
expname="ghb-model_gpt_oss_20b",
model="openai/gpt-oss-20b",
cluster=cluster,
server_type="vllm",
server_gpus=4,
server_args="--async-scheduling",
benchmarks="frontierscience-olympiad:20",
split="all",
num_chunks=1,
output_dir="/workspace/frontierscience-ghb-model_gpt_oss_20b",
server_gpus=1,
server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32 --enable-auto-tool-choice --tool-call-parser qwen3_coder",
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
benchmarks="gpqa:4",
output_dir="/workspace/Nano_V3_evals",
with_sandbox=True,
wandb_project="frontier",
wandb_name="frontierscience-ghb-model_gpt_oss_20b",
judge_model="openai/gpt-oss-120b",
judge_server_type="vllm",
judge_server_gpus=8,
judge_server_args="--async-scheduling",
)
```


#### Configuration: `gpt-oss-120b` without tool

```python
from nemo_skills.pipeline.cli import wrap_arguments, eval

eval(
ctx=wrap_arguments(
"++inference.temperature=1.0 ++inference.tokens_to_generate=65536 "
"++inference.extra_body.reasoning_effort=high"
),
cluster="slurm",
expname="ghn-model_gpt_oss_120b",
model="openai/gpt-oss-120b",
server_type="vllm",
server_gpus=8,
server_args="--async-scheduling",
benchmarks="frontierscience-olympiad:20",
split="all",
num_chunks=1,
output_dir="/workspace/frontierscience-ghn-model_gpt_oss_120b",
wandb_project="frontier",
wandb_name="frontierscience-ghn-model_gpt_oss_120b",
judge_model="openai/gpt-oss-120b",
judge_server_type="vllm",
judge_server_gpus=8,
judge_server_args="--async-scheduling",
)
```

#### Result

| Run Name | pass@1 | majority@8 | pass@8 |
|:------------------------------------------|---------:|-------------:|---------:|
| gpt-oss-20b (no tool) | 49.74 | 47.00 | 71.98 |
| gpt-oss-20b (with python tool) | 36.94 | 37.38 | 73.61 |
| gpt-oss-120b (no tool) | 60.53 | 61.13 | 79.25 |
| gpt-oss-120b (with python tool) | 54.05 | 53.00 | 80.07 |

### SuperGPQA

- Benchmark is defined in [`nemo_skills/dataset/supergpqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/supergpqa/__init__.py)
- Original benchmark source is available in the [SuperGPQA repository](https://github.com/SuperGPQA/SuperGPQA). The official leaderboard is listed on the [SuperGPQA dataset page](https://supergpqa.github.io/#Dataset).
- The `science` split contains all the data where the discipline is "Science". The default full split is `test`.

### scicode

!!! note

For scicode by default we evaluate on the combined dev + test split (containing 80 problems and 338 subtasks) for consistency with [AAI evaluation methodology](https://artificialanalysis.ai/methodology/intelligence-benchmarking). If you want to only evaluate on the test set, use `--split=test`.

- Benchmark is defined in [`nemo_skills/dataset/scicode/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/scicode/__init__.py)
- Original benchmark source is [here](https://github.com/scicode-bench/SciCode).

### gpqa

- Benchmark is defined in [`nemo_skills/dataset/gpqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/gpqa/__init__.py)
- Original benchmark source is [here](https://github.com/idavidrein/gpqa).

### mmlu-pro

- Benchmark is defined in [`nemo_skills/dataset/mmlu-pro/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmlu-pro/__init__.py)
- Original benchmark source is [here](https://github.com/TIGER-AI-Lab/MMLU-Pro).

### mmlu

- Benchmark is defined in [`nemo_skills/dataset/mmlu/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmlu/__init__.py)
- Original benchmark source is [here](https://github.com/hendrycks/test).

### mmlu-redux

- Benchmark is defined in [`nemo_skills/dataset/mmlu-redux/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmlu-redux/__init__.py)
- Original benchmark source is [here](https://github.com/aryopg/mmlu-redux).
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Here are some of the features we support:
- [**Math (natural language**)](./evaluation/natural-math.md): e.g. [aime24](./evaluation/natural-math.md#aime24), [aime25](./evaluation/natural-math.md#aime25), [hmmt_feb25](./evaluation/natural-math.md#hmmt_feb25)
- [**Math (formal language)**](./evaluation/formal-math.md): e.g. [minif2f](./evaluation/formal-math.md#minif2f), [proofnet](./evaluation/formal-math.md#proofnet), [putnam-bench](./evaluation/formal-math.md#putnam-bench)
- [**Code**](./evaluation/code.md): e.g. [swe-bench](./evaluation/code.md#swe-bench), [livecodebench](./evaluation/code.md#livecodebench), [bird](./evaluation/code.md#bird)
- [**Scientific knowledge**](./evaluation/scientific-knowledge.md): e.g., [hle](./evaluation/scientific-knowledge.md#hle), [scicode](./evaluation/scientific-knowledge.md#scicode), [gpqa](./evaluation/scientific-knowledge.md#gpqa)
- [**Scientific knowledge**](./evaluation/scientific-knowledge.md): e.g., hle, scicode, gpqa.
- [**Instruction following**](./evaluation/instruction-following.md): e.g. [ifbench](./evaluation/instruction-following.md#ifbench), [ifeval](./evaluation/instruction-following.md#ifeval)
- [**Long-context**](./evaluation/long-context.md): e.g. [ruler](./evaluation/long-context.md#ruler), [mrcr](./evaluation/long-context.md#mrcr), [aalcr](./evaluation/long-context.md#aalcr)
- [**Tool-calling**](./evaluation/tool-calling.md): e.g. [bfcl_v3](./evaluation/tool-calling.md#bfcl_v3)
Expand Down
28 changes: 28 additions & 0 deletions nemo_skills/dataset/physics/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
DATASET_GROUP = "math"
METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False
GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"
Comment on lines +15 to +18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment mismatch: METRICS_TYPE uses PhysicsMetrics, not MathMetrics.

🛠️ Suggested fix
-METRICS_TYPE = "physics"  # This uses the MathMetrics class, but with compute_no_answer=False
+METRICS_TYPE = "physics"  # Uses PhysicsMetrics (compute_no_answer defaults to False)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# settings that define how evaluation should be done by default (all can be changed from cmdline)
DATASET_GROUP = "math"
METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False
GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"
# settings that define how evaluation should be done by default (all can be changed from cmdline)
DATASET_GROUP = "math"
METRICS_TYPE = "physics" # Uses PhysicsMetrics (compute_no_answer defaults to False)
GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/physics/__init__.py` around lines 15 - 18, The inline
comment next to METRICS_TYPE is incorrect: update the comment to reflect that
METRICS_TYPE = "physics" uses the PhysicsMetrics class (not MathMetrics) and
still sets compute_no_answer=False; modify the comment on the METRICS_TYPE line
accordingly and ensure surrounding constants DATASET_GROUP, METRICS_TYPE, and
GENERATION_ARGS remain unchanged.

EVAL_SPLIT = "test"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EVAL_SPLIT = "test" creates naming confusion - per prepare.py:68, test contains only EN examples (1000), but doc table (line 15) says default is "EN" with 1,000 examples, suggesting alignment. However, the file labeling is confusing: test.jsonl = EN-only, zh.jsonl = ZH-only, en_zh.jsonl = combined. Consider renaming test to en for clarity or update docs to explicitly state that "test" = "EN-only split"

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +15 to +19
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong dataset group/type
DATASET_GROUP = "math" and GENERATION_ARGS sets ++eval_type=math, but this PR introduces a physics-specific prompt/metrics. Using the math group/type here can route PHYSICS runs through the wrong dataset category/config defaults and can select the wrong evaluation pipeline settings.

If this benchmark is meant to show up under scientific knowledge (per docs) and be evaluated with the physics metrics, the dataset metadata should be consistent with that (group + eval_type).


# Setting openai judge by default, but can be overriden from command line for a locally hosted model
# Currently using o4-mini-2025-04-16
JUDGE_PIPELINE_ARGS = {
"model": "o4-mini-2025-04-16",
"server_type": "openai",
"server_address": "https://api.openai.com/v1",
}
JUDGE_ARGS = "++prompt_config=judge/physics ++generation_key=judgement ++add_generation_stats=False"
69 changes: 69 additions & 0 deletions nemo_skills/dataset/physics/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import json
from pathlib import Path

from datasets import load_dataset
from tqdm import tqdm


def strip_boxed(s):
"""Remove \\boxed{} if present"""
if s.startswith("\\boxed{") and s.endswith("}"):
return s[7:-1]
return s


def process_answer(answer):
"""Flatten all answers and wrap in a single \\boxed{}"""
all_answers = [strip_boxed(item) for sublist in answer for item in sublist]
return f"\\boxed{{{', '.join(all_answers)}}}"
Comment on lines +29 to +32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's view the full prepare.py file to understand the context
find . -type f -name "prepare.py" -path "*/physics/*" | head -5

Repository: NVIDIA-NeMo/Skills

Length of output: 102


🏁 Script executed:

# List files in the physics dataset directory
fd "prepare.py" --path "*physics*"

Repository: NVIDIA-NeMo/Skills

Length of output: 277


🏁 Script executed:

# Search for the exact file
find . -type f -name "prepare.py" | grep physics

Repository: NVIDIA-NeMo/Skills

Length of output: 102


🌐 Web query:

HuggingFace dataset desimfj/PHYSICS schema answer field format

💡 Result:

In desimfj/PHYSICS, the answer field is a nested list: List[List[str]] (HF viewer shows it as a sequence of sequences). Each outer list element corresponds to one sub-question, and the inner list contains one or more accepted answer strings (typically LaTeX, often wrapped with \\boxed{...}).[1]

Example patterns from the dataset:[1]

  • Single sub-question:
    answer = [[ "\\boxed{3}" ]]
  • Two sub-questions:
    answer = [[ "\\boxed{-60}" ], [ "\\boxed{4}" ]]
  • MCQ:
    answer = [[ "\\boxed{B}" ]]

🏁 Script executed:

# Search for references to this dataset or answer processing
rg "desimfj/PHYSICS" --type py

Repository: NVIDIA-NeMo/Skills

Length of output: 154


🏁 Script executed:

# Look for load_dataset calls related to physics
rg "load_dataset.*physics" --type py -i

Repository: NVIDIA-NeMo/Skills

Length of output: 154


🏁 Script executed:

# Now let's read the full prepare.py file to see the context and line 68
cat -n ./nemo_skills/dataset/physics/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 2866


🏁 Script executed:

# Check for Python version requirements
find . -name "setup.py" -o -name "pyproject.toml" -o -name ".python-version" | xargs cat 2>/dev/null | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 1986


🏁 Script executed:

# Also check for requirements files
find . -name "requirements*.txt" | xargs head -5 2>/dev/null | head -30

Repository: NVIDIA-NeMo/Skills

Length of output: 174


🏁 Script executed:

# Search for python_requires in setup files
rg "python_requires|python =" --type py --type toml | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 220


🏁 Script executed:

# Verify zip behavior and confirm both lists have same length
python3 << 'EOF'
# Simulating the actual data
eng_data = [1, 2, 3]  # placeholder for actual data
ch_data = [1, 2, 3]
full_data = [1, 2, 3]

split_names = ["test", "zh", "en_zh"]
split_datas = [eng_data, ch_data, full_data]

# Check lengths match
print(f"Length of split_datas: {len(split_datas)}")
print(f"Length of split_names: {len(split_names)}")
print(f"Balanced: {len(split_datas) == len(split_names)}")

# Test that zip(strict=True) would work (Python 3.10+)
result = list(zip(split_datas, split_names, strict=True))
print(f"zip(strict=True) works: {len(result)} pairs")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 158


Add strict=True to the zip() call at line 68 for safety.

The nested comprehension in process_answer() is correct—the dataset schema confirms answer is List[List[str]], and the code handles it properly with no risk of data corruption. However, line 68 should use zip(..., strict=True) to catch potential future bugs if the list lengths ever become unequal.

Fix for line 68
-    for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
+    for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/physics/prepare.py` around lines 29 - 32, Add strict=True
to the zip(...) invocation used to pair the parallel lists in this module —
locate the zip call in this file (the one that pairs items when building
examples, adjacent to process_answer) and change zip(a, b) to zip(a, b,
strict=True) so mismatched lengths raise immediately; ensure the call site where
the pairing logic is implemented is updated (the zip used inside the
example-building function in this file).



def format_entry(entry):
return {
"problem": entry["question"],
"expected_answer": process_answer(entry["answer"]),
"solution": entry["solution"],
"answer_type": entry["answer_type"],
"subset_for_metrics": entry["domain"],
"difficulty": entry["difficulty"],
"language": entry["language"],
}


def write_data_to_file(output_file, data):
with open(output_file, "wt", encoding="utf-8") as fout:
for entry in tqdm(data, desc=f"Writing {output_file.name}"):
json.dump(format_entry(entry), fout)
fout.write("\n")


def save_data(split_data, split_name):
data_dir = Path(__file__).absolute().parent
data_dir.mkdir(exist_ok=True)
output_file = data_dir / f"{split_name}.jsonl"

write_data_to_file(output_file, split_data)


if __name__ == "__main__":
dataset = load_dataset("desimfj/PHYSICS")["test"]
eng_data = [entry for entry in dataset if entry["language"] == "en"]
ch_data = [entry for entry in dataset if entry["language"] == "zh"]
full_data = eng_data + ch_data

for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check Python version requirements
echo "=== Python version requirements ==="
rg -n "python_requires|requires-python|python_version" -S -i
echo ""
echo "=== pyproject.toml content (relevant sections) ==="
if [ -f pyproject.toml ]; then
  head -50 pyproject.toml
fi

Repository: NVIDIA-NeMo/Skills

Length of output: 1818


🏁 Script executed:

#!/bin/bash
# Check for Ruff configuration and B905 rule
echo "=== Looking for Ruff configuration ==="
find . -maxdepth 2 -type f \( -name "ruff.toml" -o -name ".ruff.toml" -o -name "pyproject.toml" \) -exec echo "File: {}" \; -exec cat {} \;

Repository: NVIDIA-NeMo/Skills

Length of output: 3330


🏁 Script executed:

#!/bin/bash
# Verify the actual zip call and surrounding context
echo "=== Context around line 68 in prepare.py ==="
sed -n '60,75p' nemo_skills/dataset/physics/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 508


Reconsider the strict= parameter value for alignment with project philosophy.

Ruff B905 requires explicit strict= parameter in zip(). However, the suggested fix of strict=False contradicts the project's coding guidelines about explicit error handling and letting errors surface rather than silently failing. Since both iterables are guaranteed to have exactly 3 elements each, use strict=True instead to enforce that expectation:

-    for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
+    for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 68-68: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In `@nemo_skills/dataset/physics/prepare.py` at line 68, Change the zip call to
enforce that the two iterables have identical lengths by adding strict=True to
the zip invocation used in the loop over eng_data, ch_data, full_data and
["test", "zh", "en_zh"], i.e., update the for loop that binds split_data and
split_name so zip(..., strict=True) is used instead of a plain zip to ensure
mismatched lengths raise an error.

save_data(split_data, split_name)
Comment on lines +63 to +69
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EN/ZH split filenames swapped
In the final loop, zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]) writes English examples to test.jsonl and Chinese examples to zh.jsonl, but then uses en_zh for the combined split. This makes test effectively EN-only and zh ZH-only, which seems fine, but contradicts the naming in the docs/config (EN default is called test). If test is intended to be the full test split, this is wrong; if test is intended to be EN-only, rename test to en (or update dataset defaults/docs) to avoid consumers accidentally evaluating the wrong language split.

Loading