-
Notifications
You must be signed in to change notification settings - Fork 163
Gnalbandyan/add physics #1214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gnalbandyan/add physics #1214
Changes from all commits
2377699
06df59c
3704acd
ea5a10d
caa20b6
b307b24
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,214 +1,93 @@ | ||||||||||||||||||||||||
| # Scientific knowledge | ||||||||||||||||||||||||
| # Scientific Knowledge | ||||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please fix these issues reported by mkdocs
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also the table is a bit too wide - have to scroll through. Maybe we can reorganize to reduce number of columns? E.g. link can just be fused into the first column. And if we also remove images (can just add a footnote maybe for hle), then it's going to fit
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed these, for the images column, we plan to add multimodal data soon, thats why its there |
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| More details are coming soon! | ||||||||||||||||||||||||
| Nemo-Skills can be used to evaluate an LLM on various STEM datasets. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ## Supported benchmarks | ||||||||||||||||||||||||
| ## Dataset Overview | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ### hle | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| - Benchmark is defined in [`nemo_skills/dataset/hle/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hle/__init__.py) | ||||||||||||||||||||||||
| - Original benchmark source is [here](https://huggingface.co/datasets/cais/hle). | ||||||||||||||||||||||||
| - The `text` split includes all non-image examples. It is further divided into `eng`, `chem`, `bio`, `cs`, `phy`, `math`, `human`, `other`. Currently, **all** of these splits contain only text data. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ### SimpleQA | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| - Benchmark is defined in [`nemo_skills/dataset/simpleqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/simpleqa/__init__.py) | ||||||||||||||||||||||||
| - Original benchmark source code for SimpleQA (OpenAI) is [here](https://github.com/openai/simple-evals/) and the leaderboard is [here](https://www.kaggle.com/benchmarks/openai/simpleqa). An improved version with 1,000 examples from Google, SimpleQA-verified, is [here](https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified). | ||||||||||||||||||||||||
| - To use the SimpleQA-verified, set `split=verified`. To use the original version of SimpleQA, please set `split=test`. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| In the below configurations, we also use `gpt-oss-120b` as the judge model. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| #### Configuration: `gpt-oss-120b` with builtin tool (python) | ||||||||||||||||||||||||
| | <div style="width:55px; display:inline-block; text-align:center">Dataset</div> | <div style="width:105px; display:inline-block; text-align:center">Questions</div> | <div style="width:85px; display:inline-block; text-align:center">Types</div> | <div style="width:145px; display:inline-block; text-align:center">Domain</div> | <div style="width:60px; display:inline-block; text-align:center">Images?</div> | <div style="width:50px; display:inline-block; text-align:center">NS default</div> | | ||||||||||||||||||||||||
| |:---|:---:|:---:|:---|:---:|:---:| | ||||||||||||||||||||||||
| | **[HLE](https://huggingface.co/datasets/cais/hle)** | 2500 | Open ended, MCQ | Engineering, Physics, Chemistry, Bio, etc. | Yes | text only | | ||||||||||||||||||||||||
| | **[GPQA ](https://huggingface.co/datasets/Idavidrein/gpqa)** | 448 (main)<br>198 (diamond)</br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond | | ||||||||||||||||||||||||
| | **[SuperGPQA](https://huggingface.co/datasets/m-a-p/SuperGPQA)** | 26,529 | MCQ (≤ 10) | Science, Eng, Humanities, etc. | No | test | | ||||||||||||||||||||||||
| | **[MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test | | ||||||||||||||||||||||||
| | **[SciCode](https://huggingface.co/datasets/SciCode1/SciCode)** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | | ||||||||||||||||||||||||
| | **[FrontierScience](https://huggingface.co/datasets/openai/frontierscience)** | 100 | Short-answer | Physics, Chemistry, Biology | No | all | | ||||||||||||||||||||||||
| | **[Physics](https://huggingface.co/datasets/desimfj/PHYSICS)** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN | | ||||||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Documentation table states default is "EN", but Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time! |
||||||||||||||||||||||||
| | **[MMLU](https://huggingface.co/datasets/cais/mmlu)** | 14,042 | MCQ (4) | Multiple Subjects | No | test | | ||||||||||||||||||||||||
| | **[MMLU-Redux](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux)** | 5,385| MCQ (4) | Multiple Subjects | No | test | | ||||||||||||||||||||||||
| | **[SimpleQA](https://github.com/openai/simple-evals/)** | 4,326 (test), 1,000 (verified) | Open ended | Factuality, Parametric knowledge| No | verified | | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ## Evaluate `NVIDIA-Nemotron-3-Nano` on an MCQ dataset | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ```python | ||||||||||||||||||||||||
| from nemo_skills.pipeline.cli import wrap_arguments, eval | ||||||||||||||||||||||||
| cluster = 'slurm' | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| cluster = "slurm" | ||||||||||||||||||||||||
| eval( | ||||||||||||||||||||||||
| ctx=wrap_arguments( | ||||||||||||||||||||||||
| "++inference.temperature=1.0 ++inference.tokens_to_generate=65536 " | ||||||||||||||||||||||||
| "++code_tags=gpt-oss ++server.code_execution.max_code_executions=100 " | ||||||||||||||||||||||||
| "++inference.endpoint_type=text ++chat_template_kwargs.builtin_tools=[python] " | ||||||||||||||||||||||||
| "++chat_template_kwargs.reasoning_effort=high ++code_execution=true " | ||||||||||||||||||||||||
| "++parse_reasoning=True " | ||||||||||||||||||||||||
| '\'++end_reasoning_string="<|start|>assistant<|channel|>final<|message|>"\'' | ||||||||||||||||||||||||
| "++inference.temperature=1.0 ++inference.top_p=1.0 " | ||||||||||||||||||||||||
| "++inference.tokens_to_generate=131072 " | ||||||||||||||||||||||||
| "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " | ||||||||||||||||||||||||
| ), | ||||||||||||||||||||||||
| cluster=cluster, | ||||||||||||||||||||||||
| expname="simpleqa-gpt-oss-120b-tool-output-only", | ||||||||||||||||||||||||
| model="openai/gpt-oss-120b", | ||||||||||||||||||||||||
| server_type="vllm", | ||||||||||||||||||||||||
| server_gpus=8, | ||||||||||||||||||||||||
| server_args="--async-scheduling", | ||||||||||||||||||||||||
| benchmarks="simpleqa:2", | ||||||||||||||||||||||||
| split="verified", | ||||||||||||||||||||||||
| output_dir="/workspace/simpleqa-gpt-oss-120b-tool-output-only", | ||||||||||||||||||||||||
| with_sandbox=True, | ||||||||||||||||||||||||
| judge_model="openai/gpt-oss-120b", | ||||||||||||||||||||||||
| judge_server_type="vllm", | ||||||||||||||||||||||||
| judge_server_gpus=8, | ||||||||||||||||||||||||
| judge_server_args="--async-scheduling --reasoning-parser GptOss", | ||||||||||||||||||||||||
| server_gpus=1, | ||||||||||||||||||||||||
| server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32", | ||||||||||||||||||||||||
| model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", | ||||||||||||||||||||||||
| benchmarks="gpqa:4", | ||||||||||||||||||||||||
| output_dir="/workspace/Nano_V3_evals" | ||||||||||||||||||||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add expected results to all of these commands? You can use mkdocs dropdowns / tabs to make it use fewer space, e.g. can have a toggle per benchmark / evaluation mode or something. But having reference numbers is useful |
||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
| </br> | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| #### Configuration: `gpt-oss-120b` without tool | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ## Evaluate `NVIDIA-Nemotron-3-Nano` using LLM-as-a-judge | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ```python | ||||||||||||||||||||||||
| from nemo_skills.pipeline.cli import wrap_arguments, eval | ||||||||||||||||||||||||
| cluster = 'slurm' | ||||||||||||||||||||||||
| cluster = "slurm" | ||||||||||||||||||||||||
| eval( | ||||||||||||||||||||||||
| ctx=wrap_arguments( | ||||||||||||||||||||||||
| "++inference.temperature=1.0 ++inference.tokens_to_generate=100000 " | ||||||||||||||||||||||||
| "++inference.extra_body.reasoning_effort=high " | ||||||||||||||||||||||||
| "++inference.temperature=1.0 ++inference.top_p=1.0 " | ||||||||||||||||||||||||
| "++inference.tokens_to_generate=131072 " | ||||||||||||||||||||||||
| "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " | ||||||||||||||||||||||||
| ), | ||||||||||||||||||||||||
| cluster="ord", | ||||||||||||||||||||||||
| expname="simpleqa-gpt-oss-120b-notool", | ||||||||||||||||||||||||
| model="openai/gpt-oss-120b", | ||||||||||||||||||||||||
| cluster=cluster, | ||||||||||||||||||||||||
| server_type="vllm", | ||||||||||||||||||||||||
| server_gpus=8, | ||||||||||||||||||||||||
| server_args="--async-scheduling --reasoning-parser GptOss", | ||||||||||||||||||||||||
| benchmarks="simpleqa:2", | ||||||||||||||||||||||||
| split="verified", | ||||||||||||||||||||||||
| output_dir="/workspace/simpleqa-gpt-oss-120b-notool", | ||||||||||||||||||||||||
| server_gpus=1, | ||||||||||||||||||||||||
| server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32", | ||||||||||||||||||||||||
| model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", | ||||||||||||||||||||||||
| benchmarks="hle:4", | ||||||||||||||||||||||||
| output_dir="/workspace/Nano_V3_evals", | ||||||||||||||||||||||||
| judge_model="openai/gpt-oss-120b", | ||||||||||||||||||||||||
| judge_server_type="vllm", | ||||||||||||||||||||||||
| judge_server_gpus=8, | ||||||||||||||||||||||||
| judge_server_args="--async-scheduling --reasoning-parser GptOss", | ||||||||||||||||||||||||
| judge_server_args="--async-scheduling", | ||||||||||||||||||||||||
| extra_judge_args="++chat_template_kwargs.reasoning_effort=high ++inference.temperature=1.0 ++inference.top_p=1.0 ++inference.tokens_to_generate=120000 " | ||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| !!! note | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| The module name for `reasoning-parser` differs across `vllm` versions. Depending on your version, it might appear as `openai_gptoss` or `GptOss`. In the latest main branch, it is named `openai_gptoss`. You can verify this in [gptoss_reasoning_parser.py](https://github.com/vllm-project/vllm/blob/main/vllm/reasoning/gptoss_reasoning_parser.py) and confirm which version your environment uses. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| #### Result | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| We also tested a variant where the full generation output was provided to the judge—disabling "parse_reasoning". This configuration, labeled `simpleqa-gpt-oss-120b-tool-full-generation`, produced results nearly identical to the standard setup where the reasoning portion is excluded from the judge’s input. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| | Run Name | pass@1 | majority@2 | pass@2 | | ||||||||||||||||||||||||
| |:----------------------------------------------|-----------:|-------------:|----------:| | ||||||||||||||||||||||||
| | simpleqa-gpt-oss-120b-notool | 12.93 | 12.93 | 17.22 | | ||||||||||||||||||||||||
| | simpleqa-gpt-oss-120b-tool-full-generation | 80.30 | 80.30 | 84.78 | | ||||||||||||||||||||||||
| | simpleqa-gpt-oss-120b-tool-output-only | 79.51 | 79.51 | 83.74 | | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| The reported number for `simpleqa-gpt-oss-120b-notool` is 13.1% according to this [kaggle page](https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified). | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ### FrontierScience-Olympiad | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| - Benchmark is defined in [`nemo_skills/dataset/frontierscience-olympiad/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/frontierscience-olympiad/__init__.py) | ||||||||||||||||||||||||
| - Original benchmark source is [here](https://huggingface.co/datasets/openai/frontierscience). | ||||||||||||||||||||||||
| - Contains 100 short-answer questions crafted by international science olympiad medalists across physics, chemistry, and biology. | ||||||||||||||||||||||||
| - Available splits: `physics`, `chemistry`, `biology`, and `all` (all subjects combined, default). | ||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| #### Configuration: `gpt-oss-20b` with builtin tool (python) | ||||||||||||||||||||||||
| ## Evaluate `NVIDIA-Nemotron-3-Nano` on an MCQ dataset using tools | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ```python | ||||||||||||||||||||||||
| from nemo_skills.pipeline.cli import wrap_arguments, eval | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| cluster = "slurm" | ||||||||||||||||||||||||
| eval( | ||||||||||||||||||||||||
| ctx=wrap_arguments( | ||||||||||||||||||||||||
| "++inference.temperature=1.0 ++inference.tokens_to_generate=65536 " | ||||||||||||||||||||||||
| "++code_tags=gpt-oss ++server.code_execution.max_code_executions=100 " | ||||||||||||||||||||||||
| "++inference.endpoint_type=text ++chat_template_kwargs.builtin_tools=[python] " | ||||||||||||||||||||||||
| "++chat_template_kwargs.reasoning_effort=high ++code_execution=true" | ||||||||||||||||||||||||
| "++inference.temperature=0.6 ++inference.top_p=0.95 " | ||||||||||||||||||||||||
| "++inference.tokens_to_generate=131072 " | ||||||||||||||||||||||||
| "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " | ||||||||||||||||||||||||
| "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] " | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ), | ||||||||||||||||||||||||
|
Comment on lines
+77
to
82
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Stray blank line inside function call. Line 81 is blank inside the Suggested fix "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
-
),📝 Committable suggestion
Suggested change
🧰 Tools🪛 markdownlint-cli2 (0.20.0)[warning] 79-79: Code block style (MD046, code-block-style) 🤖 Prompt for AI Agents |
||||||||||||||||||||||||
| cluster="slurm", | ||||||||||||||||||||||||
| expname="ghb-model_gpt_oss_20b", | ||||||||||||||||||||||||
| model="openai/gpt-oss-20b", | ||||||||||||||||||||||||
| cluster=cluster, | ||||||||||||||||||||||||
| server_type="vllm", | ||||||||||||||||||||||||
| server_gpus=4, | ||||||||||||||||||||||||
| server_args="--async-scheduling", | ||||||||||||||||||||||||
| benchmarks="frontierscience-olympiad:20", | ||||||||||||||||||||||||
| split="all", | ||||||||||||||||||||||||
| num_chunks=1, | ||||||||||||||||||||||||
| output_dir="/workspace/frontierscience-ghb-model_gpt_oss_20b", | ||||||||||||||||||||||||
| server_gpus=1, | ||||||||||||||||||||||||
| server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32 --enable-auto-tool-choice --tool-call-parser qwen3_coder", | ||||||||||||||||||||||||
| model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", | ||||||||||||||||||||||||
| benchmarks="gpqa:4", | ||||||||||||||||||||||||
| output_dir="/workspace/Nano_V3_evals", | ||||||||||||||||||||||||
| with_sandbox=True, | ||||||||||||||||||||||||
| wandb_project="frontier", | ||||||||||||||||||||||||
| wandb_name="frontierscience-ghb-model_gpt_oss_20b", | ||||||||||||||||||||||||
| judge_model="openai/gpt-oss-120b", | ||||||||||||||||||||||||
| judge_server_type="vllm", | ||||||||||||||||||||||||
| judge_server_gpus=8, | ||||||||||||||||||||||||
| judge_server_args="--async-scheduling", | ||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| #### Configuration: `gpt-oss-120b` without tool | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ```python | ||||||||||||||||||||||||
| from nemo_skills.pipeline.cli import wrap_arguments, eval | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| eval( | ||||||||||||||||||||||||
| ctx=wrap_arguments( | ||||||||||||||||||||||||
| "++inference.temperature=1.0 ++inference.tokens_to_generate=65536 " | ||||||||||||||||||||||||
| "++inference.extra_body.reasoning_effort=high" | ||||||||||||||||||||||||
| ), | ||||||||||||||||||||||||
| cluster="slurm", | ||||||||||||||||||||||||
| expname="ghn-model_gpt_oss_120b", | ||||||||||||||||||||||||
| model="openai/gpt-oss-120b", | ||||||||||||||||||||||||
| server_type="vllm", | ||||||||||||||||||||||||
| server_gpus=8, | ||||||||||||||||||||||||
| server_args="--async-scheduling", | ||||||||||||||||||||||||
| benchmarks="frontierscience-olympiad:20", | ||||||||||||||||||||||||
| split="all", | ||||||||||||||||||||||||
| num_chunks=1, | ||||||||||||||||||||||||
| output_dir="/workspace/frontierscience-ghn-model_gpt_oss_120b", | ||||||||||||||||||||||||
| wandb_project="frontier", | ||||||||||||||||||||||||
| wandb_name="frontierscience-ghn-model_gpt_oss_120b", | ||||||||||||||||||||||||
| judge_model="openai/gpt-oss-120b", | ||||||||||||||||||||||||
| judge_server_type="vllm", | ||||||||||||||||||||||||
| judge_server_gpus=8, | ||||||||||||||||||||||||
| judge_server_args="--async-scheduling", | ||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||
| ``` | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| #### Result | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| | Run Name | pass@1 | majority@8 | pass@8 | | ||||||||||||||||||||||||
| |:------------------------------------------|---------:|-------------:|---------:| | ||||||||||||||||||||||||
| | gpt-oss-20b (no tool) | 49.74 | 47.00 | 71.98 | | ||||||||||||||||||||||||
| | gpt-oss-20b (with python tool) | 36.94 | 37.38 | 73.61 | | ||||||||||||||||||||||||
| | gpt-oss-120b (no tool) | 60.53 | 61.13 | 79.25 | | ||||||||||||||||||||||||
| | gpt-oss-120b (with python tool) | 54.05 | 53.00 | 80.07 | | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ### SuperGPQA | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| - Benchmark is defined in [`nemo_skills/dataset/supergpqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/supergpqa/__init__.py) | ||||||||||||||||||||||||
| - Original benchmark source is available in the [SuperGPQA repository](https://github.com/SuperGPQA/SuperGPQA). The official leaderboard is listed on the [SuperGPQA dataset page](https://supergpqa.github.io/#Dataset). | ||||||||||||||||||||||||
| - The `science` split contains all the data where the discipline is "Science". The default full split is `test`. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ### scicode | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| !!! note | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| For scicode by default we evaluate on the combined dev + test split (containing 80 problems and 338 subtasks) for consistency with [AAI evaluation methodology](https://artificialanalysis.ai/methodology/intelligence-benchmarking). If you want to only evaluate on the test set, use `--split=test`. | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| - Benchmark is defined in [`nemo_skills/dataset/scicode/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/scicode/__init__.py) | ||||||||||||||||||||||||
| - Original benchmark source is [here](https://github.com/scicode-bench/SciCode). | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ### gpqa | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| - Benchmark is defined in [`nemo_skills/dataset/gpqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/gpqa/__init__.py) | ||||||||||||||||||||||||
| - Original benchmark source is [here](https://github.com/idavidrein/gpqa). | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ### mmlu-pro | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| - Benchmark is defined in [`nemo_skills/dataset/mmlu-pro/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmlu-pro/__init__.py) | ||||||||||||||||||||||||
| - Original benchmark source is [here](https://github.com/TIGER-AI-Lab/MMLU-Pro). | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ### mmlu | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| - Benchmark is defined in [`nemo_skills/dataset/mmlu/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmlu/__init__.py) | ||||||||||||||||||||||||
| - Original benchmark source is [here](https://github.com/hendrycks/test). | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| ### mmlu-redux | ||||||||||||||||||||||||
|
|
||||||||||||||||||||||||
| - Benchmark is defined in [`nemo_skills/dataset/mmlu-redux/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/mmlu-redux/__init__.py) | ||||||||||||||||||||||||
| - Original benchmark source is [here](https://github.com/aryopg/mmlu-redux). | ||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,28 @@ | ||||||||||||||||||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | ||||||||||||||||||
| # | ||||||||||||||||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||||||||||||||||
| # you may not use this file except in compliance with the License. | ||||||||||||||||||
| # You may obtain a copy of the License at | ||||||||||||||||||
| # | ||||||||||||||||||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||||||||
| # | ||||||||||||||||||
| # Unless required by applicable law or agreed to in writing, software | ||||||||||||||||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||||||||||||||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||||||||||||||
| # See the License for the specific language governing permissions and | ||||||||||||||||||
| # limitations under the License. | ||||||||||||||||||
|
|
||||||||||||||||||
| # settings that define how evaluation should be done by default (all can be changed from cmdline) | ||||||||||||||||||
| DATASET_GROUP = "math" | ||||||||||||||||||
| METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False | ||||||||||||||||||
| GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math" | ||||||||||||||||||
|
Comment on lines
+15
to
+18
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Comment mismatch: METRICS_TYPE uses PhysicsMetrics, not MathMetrics. 🛠️ Suggested fix-METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False
+METRICS_TYPE = "physics" # Uses PhysicsMetrics (compute_no_answer defaults to False)📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||
| EVAL_SPLIT = "test" | ||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Comment on lines
+15
to
+19
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wrong dataset group/type If this benchmark is meant to show up under scientific knowledge (per docs) and be evaluated with the physics metrics, the dataset metadata should be consistent with that (group + eval_type). |
||||||||||||||||||
|
|
||||||||||||||||||
| # Setting openai judge by default, but can be overriden from command line for a locally hosted model | ||||||||||||||||||
| # Currently using o4-mini-2025-04-16 | ||||||||||||||||||
| JUDGE_PIPELINE_ARGS = { | ||||||||||||||||||
| "model": "o4-mini-2025-04-16", | ||||||||||||||||||
| "server_type": "openai", | ||||||||||||||||||
| "server_address": "https://api.openai.com/v1", | ||||||||||||||||||
| } | ||||||||||||||||||
| JUDGE_ARGS = "++prompt_config=judge/physics ++generation_key=judgement ++add_generation_stats=False" | ||||||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,69 @@ | ||||||
| # Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||||||
| # | ||||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||||
| # you may not use this file except in compliance with the License. | ||||||
| # You may obtain a copy of the License at | ||||||
| # | ||||||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||||||
| # | ||||||
| # Unless required by applicable law or agreed to in writing, software | ||||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||
| # See the License for the specific language governing permissions and | ||||||
| # limitations under the License. | ||||||
|
|
||||||
| import json | ||||||
| from pathlib import Path | ||||||
|
|
||||||
| from datasets import load_dataset | ||||||
| from tqdm import tqdm | ||||||
|
|
||||||
|
|
||||||
| def strip_boxed(s): | ||||||
| """Remove \\boxed{} if present""" | ||||||
| if s.startswith("\\boxed{") and s.endswith("}"): | ||||||
| return s[7:-1] | ||||||
| return s | ||||||
|
|
||||||
|
|
||||||
| def process_answer(answer): | ||||||
| """Flatten all answers and wrap in a single \\boxed{}""" | ||||||
| all_answers = [strip_boxed(item) for sublist in answer for item in sublist] | ||||||
| return f"\\boxed{{{', '.join(all_answers)}}}" | ||||||
|
Comment on lines
+29
to
+32
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # First, let's view the full prepare.py file to understand the context
find . -type f -name "prepare.py" -path "*/physics/*" | head -5Repository: NVIDIA-NeMo/Skills Length of output: 102 🏁 Script executed: # List files in the physics dataset directory
fd "prepare.py" --path "*physics*"Repository: NVIDIA-NeMo/Skills Length of output: 277 🏁 Script executed: # Search for the exact file
find . -type f -name "prepare.py" | grep physicsRepository: NVIDIA-NeMo/Skills Length of output: 102 🌐 Web query:
💡 Result: In Example patterns from the dataset:[1]
🏁 Script executed: # Search for references to this dataset or answer processing
rg "desimfj/PHYSICS" --type pyRepository: NVIDIA-NeMo/Skills Length of output: 154 🏁 Script executed: # Look for load_dataset calls related to physics
rg "load_dataset.*physics" --type py -iRepository: NVIDIA-NeMo/Skills Length of output: 154 🏁 Script executed: # Now let's read the full prepare.py file to see the context and line 68
cat -n ./nemo_skills/dataset/physics/prepare.pyRepository: NVIDIA-NeMo/Skills Length of output: 2866 🏁 Script executed: # Check for Python version requirements
find . -name "setup.py" -o -name "pyproject.toml" -o -name ".python-version" | xargs cat 2>/dev/null | head -50Repository: NVIDIA-NeMo/Skills Length of output: 1986 🏁 Script executed: # Also check for requirements files
find . -name "requirements*.txt" | xargs head -5 2>/dev/null | head -30Repository: NVIDIA-NeMo/Skills Length of output: 174 🏁 Script executed: # Search for python_requires in setup files
rg "python_requires|python =" --type py --type toml | head -10Repository: NVIDIA-NeMo/Skills Length of output: 220 🏁 Script executed: # Verify zip behavior and confirm both lists have same length
python3 << 'EOF'
# Simulating the actual data
eng_data = [1, 2, 3] # placeholder for actual data
ch_data = [1, 2, 3]
full_data = [1, 2, 3]
split_names = ["test", "zh", "en_zh"]
split_datas = [eng_data, ch_data, full_data]
# Check lengths match
print(f"Length of split_datas: {len(split_datas)}")
print(f"Length of split_names: {len(split_names)}")
print(f"Balanced: {len(split_datas) == len(split_names)}")
# Test that zip(strict=True) would work (Python 3.10+)
result = list(zip(split_datas, split_names, strict=True))
print(f"zip(strict=True) works: {len(result)} pairs")
EOFRepository: NVIDIA-NeMo/Skills Length of output: 158 Add The nested comprehension in Fix for line 68- for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
+ for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):🤖 Prompt for AI Agents |
||||||
|
|
||||||
|
|
||||||
| def format_entry(entry): | ||||||
| return { | ||||||
| "problem": entry["question"], | ||||||
| "expected_answer": process_answer(entry["answer"]), | ||||||
| "solution": entry["solution"], | ||||||
| "answer_type": entry["answer_type"], | ||||||
| "subset_for_metrics": entry["domain"], | ||||||
| "difficulty": entry["difficulty"], | ||||||
| "language": entry["language"], | ||||||
| } | ||||||
|
|
||||||
|
|
||||||
| def write_data_to_file(output_file, data): | ||||||
| with open(output_file, "wt", encoding="utf-8") as fout: | ||||||
| for entry in tqdm(data, desc=f"Writing {output_file.name}"): | ||||||
| json.dump(format_entry(entry), fout) | ||||||
| fout.write("\n") | ||||||
|
|
||||||
|
|
||||||
| def save_data(split_data, split_name): | ||||||
| data_dir = Path(__file__).absolute().parent | ||||||
| data_dir.mkdir(exist_ok=True) | ||||||
| output_file = data_dir / f"{split_name}.jsonl" | ||||||
|
|
||||||
| write_data_to_file(output_file, split_data) | ||||||
|
|
||||||
|
|
||||||
| if __name__ == "__main__": | ||||||
| dataset = load_dataset("desimfj/PHYSICS")["test"] | ||||||
| eng_data = [entry for entry in dataset if entry["language"] == "en"] | ||||||
| ch_data = [entry for entry in dataset if entry["language"] == "zh"] | ||||||
| full_data = eng_data + ch_data | ||||||
|
|
||||||
| for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]): | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Check Python version requirements
echo "=== Python version requirements ==="
rg -n "python_requires|requires-python|python_version" -S -i
echo ""
echo "=== pyproject.toml content (relevant sections) ==="
if [ -f pyproject.toml ]; then
head -50 pyproject.toml
fiRepository: NVIDIA-NeMo/Skills Length of output: 1818 🏁 Script executed: #!/bin/bash
# Check for Ruff configuration and B905 rule
echo "=== Looking for Ruff configuration ==="
find . -maxdepth 2 -type f \( -name "ruff.toml" -o -name ".ruff.toml" -o -name "pyproject.toml" \) -exec echo "File: {}" \; -exec cat {} \;Repository: NVIDIA-NeMo/Skills Length of output: 3330 🏁 Script executed: #!/bin/bash
# Verify the actual zip call and surrounding context
echo "=== Context around line 68 in prepare.py ==="
sed -n '60,75p' nemo_skills/dataset/physics/prepare.pyRepository: NVIDIA-NeMo/Skills Length of output: 508 Reconsider the Ruff B905 requires explicit - for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
+ for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):📝 Committable suggestion
Suggested change
🧰 Tools🪛 Ruff (0.14.14)[warning] 68-68: Add explicit value for parameter (B905) 🤖 Prompt for AI Agents |
||||||
| save_data(split_data, split_name) | ||||||
|
Comment on lines
+63
to
+69
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. EN/ZH split filenames swapped |
||||||
Uh oh!
There was an error while loading. Please reload this page.