-
Notifications
You must be signed in to change notification settings - Fork 163
Add FrontierScience-Olympiad to benchmark #1165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
cd3bfbb
c8a0cc4
7972f32
8d8b47c
fb7c1f0
cd88ccd
ca0aa86
d45f346
71b4983
fa4699e
a6abe39
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # settings that define how evaluation should be done by default (all can be changed from cmdline) | ||
| DATASET_GROUP = "math" | ||
| METRICS_TYPE = "frontierscience-olympiad" | ||
| GENERATION_ARGS = "++prompt_config=generic/default ++eval_type=math" | ||
| EVAL_SPLIT = "all" | ||
|
|
||
|
|
||
| JUDGE_PIPELINE_ARGS = { | ||
| "model": "o3-mini-2025-01-31", | ||
| "server_type": "openai", | ||
| "server_address": "https://api.openai.com/v1", | ||
| } | ||
| JUDGE_ARGS = "++prompt_config=judge/frontierscience-olympiad ++generation_key=judgement ++add_generation_stats=False" |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,103 @@ | ||||||||||||||||||||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | ||||||||||||||||||||
| # | ||||||||||||||||||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||||||||||||||||||
| # you may not use this file except in compliance with the License. | ||||||||||||||||||||
| # You may obtain a copy of the License at | ||||||||||||||||||||
| # | ||||||||||||||||||||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||||||||||
| # | ||||||||||||||||||||
| # Unless required by applicable law or agreed to in writing, software | ||||||||||||||||||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||||||||||||||||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||||||||||||||||
| # See the License for the specific language governing permissions and | ||||||||||||||||||||
| # limitations under the License. | ||||||||||||||||||||
|
|
||||||||||||||||||||
| import argparse | ||||||||||||||||||||
| import json | ||||||||||||||||||||
| import re | ||||||||||||||||||||
| from pathlib import Path | ||||||||||||||||||||
|
|
||||||||||||||||||||
| import requests | ||||||||||||||||||||
| from tqdm import tqdm | ||||||||||||||||||||
|
|
||||||||||||||||||||
| OLYMPIAD_URL = "https://huggingface.co/datasets/openai/frontierscience/resolve/main/olympiad/test.jsonl" | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # Map of available subjects | ||||||||||||||||||||
| SUBJECTS = ["chemistry", "biology", "physics"] | ||||||||||||||||||||
|
|
||||||||||||||||||||
|
|
||||||||||||||||||||
| def format_entry(entry, problem_index): | ||||||||||||||||||||
| """Format entry for nemo-skills from FrontierScience Olympiad dataset.""" | ||||||||||||||||||||
| answer = entry.get("answer", "") | ||||||||||||||||||||
| # Remove surrounding backticks (handles `, ``, ```, etc.) | ||||||||||||||||||||
| answer = re.sub(r"^`+|`+$", "", answer).strip() | ||||||||||||||||||||
|
|
||||||||||||||||||||
| formatted = { | ||||||||||||||||||||
| "id": f"olympiad-{problem_index}", | ||||||||||||||||||||
| "question": entry.get("problem", ""), | ||||||||||||||||||||
gnalbandyan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||
| "expected_answer": answer, | ||||||||||||||||||||
| "subset_for_metrics": entry.get("subject", ""), | ||||||||||||||||||||
| "task_group_id": entry.get("task_group_id", ""), | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| return formatted | ||||||||||||||||||||
|
|
||||||||||||||||||||
|
|
||||||||||||||||||||
| def write_data_to_file(output_file, data, subject_filter=None): | ||||||||||||||||||||
| """Write formatted data to JSONL file.""" | ||||||||||||||||||||
| count = 0 | ||||||||||||||||||||
| with open(output_file, "wt", encoding="utf-8") as fout: | ||||||||||||||||||||
| for idx, entry in enumerate(tqdm(data, desc=f"Writing {output_file.name}")): | ||||||||||||||||||||
| # Filter by subject if specified | ||||||||||||||||||||
| if subject_filter and entry.get("subject", "").lower() != subject_filter: | ||||||||||||||||||||
| continue | ||||||||||||||||||||
| formatted_entry = format_entry(entry, idx) | ||||||||||||||||||||
|
Comment on lines
+50
to
+54
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. [P2] When filtering by subject,
Comment on lines
+50
to
+54
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: # First, let's look at the complete prepare.py file to understand the context
cat -n nemo_skills/dataset/frontierscience-olympiad/prepare.pyRepository: NVIDIA-NeMo/Skills Length of output: 4279 🏁 Script executed: # Check other dataset prepare scripts for similar indexing patterns
rg -n "enumerate|format_entry|subject_filter" --type=py nemo_skills/dataset/ -A2 -B2 | head -80Repository: NVIDIA-NeMo/Skills Length of output: 5932 Use When filtering by subject, using formatted_entry = format_entry(entry, count)🤖 Prompt for AI Agents |
||||||||||||||||||||
| json.dump(formatted_entry, fout) | ||||||||||||||||||||
| fout.write("\n") | ||||||||||||||||||||
| count += 1 | ||||||||||||||||||||
| return count | ||||||||||||||||||||
|
|
||||||||||||||||||||
|
|
||||||||||||||||||||
| if __name__ == "__main__": | ||||||||||||||||||||
| parser = argparse.ArgumentParser() | ||||||||||||||||||||
| parser.add_argument( | ||||||||||||||||||||
| "--split", | ||||||||||||||||||||
| default="all", | ||||||||||||||||||||
| choices=["all"] + SUBJECTS, | ||||||||||||||||||||
| help="Dataset split to process (all/chemistry/biology/physics).", | ||||||||||||||||||||
| ) | ||||||||||||||||||||
| args = parser.parse_args() | ||||||||||||||||||||
|
|
||||||||||||||||||||
| # Load the FrontierScience olympiad dataset directly from HuggingFace | ||||||||||||||||||||
| print(f"Downloading FrontierScience olympiad dataset from {OLYMPIAD_URL}...") | ||||||||||||||||||||
|
|
||||||||||||||||||||
| try: | ||||||||||||||||||||
| response = requests.get(OLYMPIAD_URL, timeout=30) | ||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. logic: Missing HTTP status check - add
Suggested change
|
||||||||||||||||||||
| except Exception as e: | ||||||||||||||||||||
| raise RuntimeError(f"Error downloading dataset from {OLYMPIAD_URL}: {e}") | ||||||||||||||||||||
|
Comment on lines
+74
to
+77
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. logic: Missing HTTP status code check. If server returns 404/500, code proceeds to parse error page as JSONL, causing cryptic errors
Suggested change
|
||||||||||||||||||||
|
|
||||||||||||||||||||
| # Parse JSONL data | ||||||||||||||||||||
| olympiad_data = [] | ||||||||||||||||||||
| for line in response.text.strip().split("\n"): | ||||||||||||||||||||
| if line: | ||||||||||||||||||||
| olympiad_data.append(json.loads(line)) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| print(f"Loaded {len(olympiad_data)} olympiad problems") | ||||||||||||||||||||
|
|
||||||||||||||||||||
| data_dir = Path(__file__).absolute().parent | ||||||||||||||||||||
| data_dir.mkdir(exist_ok=True) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| if args.split == "all": | ||||||||||||||||||||
| applied_subjects = SUBJECTS | ||||||||||||||||||||
| else: | ||||||||||||||||||||
| applied_subjects = [args.split] | ||||||||||||||||||||
| # Process all subjects separately | ||||||||||||||||||||
| for subject in applied_subjects: | ||||||||||||||||||||
| output_file = data_dir / f"{subject}.jsonl" | ||||||||||||||||||||
| count = write_data_to_file(output_file, olympiad_data, subject_filter=subject) | ||||||||||||||||||||
| print(f"Saved {count} {subject} entries to {output_file}") | ||||||||||||||||||||
| if args.split == "all": | ||||||||||||||||||||
| # Also create a combined all.jsonl with all problems | ||||||||||||||||||||
| output_file = data_dir / "all.jsonl" | ||||||||||||||||||||
| count = write_data_to_file(output_file, olympiad_data) | ||||||||||||||||||||
| print(f"Saved {count} total entries to {output_file}") | ||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # Source: https://cdn.openai.com/pdf/2fcd284c-b468-4c21-8ee0-7a783933efcc/frontierscience-paper.pdf Page 13 | ||
| user: |- | ||
jiacheng-xu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| You are grading an attempted answer to a science olympiad problem. You will be given the | ||
| problem, attempted answer, and reference answer. Evaluate the solution against the provided | ||
| reference solution, ensuring it is complete and matches the reference solution. Pay close | ||
| attention to detail and grade it strictly, but fairly. | ||
| The reference answer is either a single number or expression in latex formatting, a chemical | ||
| formula, a compound name, or a phrase referring to a specific name, entity, or method. | ||
| Mark the attempted answer as correct if it fully matches the reference answer or is otherwise | ||
| equivalent (e.g., an equivalent algebraic expression, a numerical number within 1 decimal | ||
| place rounding of the reference answer (e.g., 6.69 ≈ 6.7), an equivalent name for a | ||
| compound/formula, equivalent when accounting for units, etc.). Mark it as incorrect if it is | ||
| not equivalent to the reference answer. | ||
| *** | ||
| The problem: {question} | ||
| *** | ||
| The reference answer: {expected_answer} | ||
| *** | ||
| The attempted answer: {generation} | ||
| *** | ||
| First, think step-by-step about whether the attempted answer matches the reference answer. | ||
| If the attempted answer is correct, write "Judgement: YES" in the last line of your | ||
| response, with no other text or formatting. If it is incorrect, write "Judgement: NO". | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. syntax: Missing closing The |
||
Uh oh!
There was an error while loading. Please reload this page.