Add FrontierScience-Olympiad to benchmark#1165
Conversation
Greptile SummaryThis PR adds support for the FrontierScience-Olympiad benchmark, integrating 100 olympiad-level science problems across physics, chemistry, and biology. The implementation follows established patterns in the codebase and includes comprehensive documentation. Key additions:
Issues found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant PrepareScript as prepare.py
participant HF as HuggingFace
participant Dataset as Dataset Files
participant Eval as Evaluation Pipeline
participant Judge as Judge Model
User->>PrepareScript: Run with split parameter
PrepareScript->>HF: Download olympiad dataset
HF-->>PrepareScript: Return 100 problems
PrepareScript->>PrepareScript: Parse JSONL and format entries
PrepareScript->>PrepareScript: Filter by subject if specified
PrepareScript->>Dataset: Write subject-specific files
PrepareScript->>Dataset: Write combined all.jsonl
User->>Eval: Run evaluation
Eval->>Dataset: Load questions from split
Eval->>Eval: Generate answers using model
Eval->>Judge: Send question, generation, expected_answer
Judge->>Judge: Apply judge prompt template
Judge-->>Eval: Return judgement
Eval->>Eval: Calculate metrics using MathMetrics
Eval-->>User: Return evaluation scores
|
| for idx, entry in enumerate(tqdm(data, desc=f"Writing {output_file.name}")): | ||
| # Filter by subject if specified | ||
| if subject_filter and entry.get("subject", "").lower() != subject_filter: | ||
| continue | ||
| formatted_entry = format_entry(entry, idx) |
There was a problem hiding this comment.
[P2] When filtering by subject, idx from enumerate includes skipped entries, causing non-sequential IDs. For example, if physics problems are at indices 0, 5, 10, the IDs will be olympiad-0, olympiad-5, olympiad-10 instead of olympiad-0, olympiad-1, olympiad-2. Use count instead of idx for sequential IDs within each subject file.
📝 WalkthroughWalkthroughIntroduces support for the FrontierScience Olympiad benchmark by adding dataset preparation scripts, metric registration, evaluation configuration, and documentation. The integration includes per-subject JSONL file generation, metric type registration with configurable parameters, and a judge prompt for grading science olympiad solutions. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In @nemo_skills/dataset/frontierscience-olympiad/prepare.py:
- Around line 73-74: The requests.get call that fetches OLYMPIAD_URL currently
has no timeout and can hang; update the requests.get invocation to include a
reasonable timeout (e.g., timeout=10) and handle potential timeout/connection
errors by catching requests.Timeout or requests.RequestException around the call
before calling response.raise_for_status so the script fails fast and
logs/handles the error appropriately.
- Around line 50-54: The loop uses the full-dataset index variable idx when
calling format_entry, which yields non-sequential IDs after subject filtering;
change the call to pass the running counter variable count (used to generate
sequential IDs) instead of idx (i.e., call format_entry(entry, count)), and
ensure count is only incremented when an entry is not skipped by the
subject_filter so written files get contiguous IDs.
🧹 Nitpick comments (2)
docs/evaluation/scientific-knowledge.md (1)
103-108: Use descriptive link text instead of "here".The link text on line 106 is non-descriptive. For accessibility and clarity, use meaningful link text that indicates the destination.
📝 Suggested improvement
### FrontierScience-Olympiad - Benchmark is defined in [`nemo_skills/dataset/frontierscience-olympiad/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/frontierscience-olympiad/__init__.py) -- Original benchmark source is [here](https://huggingface.co/datasets/openai/frontierscience). +- Original benchmark source is available on [HuggingFace](https://huggingface.co/datasets/openai/frontierscience). - Contains 100 short-answer questions crafted by international science olympiad medalists across physics, chemistry, and biology. - Available splits: `physics` (default), `chemistry`, `biology`, and `all` (all subjects combined).nemo_skills/dataset/frontierscience-olympiad/prepare.py (1)
63-68: Consider using list unpacking for cleaner syntax.Per Ruff RUF005, prefer unpacking over concatenation.
♻️ Optional improvement
parser.add_argument( "--split", default="all", - choices=["all"] + SUBJECTS, + choices=["all", *SUBJECTS], help="Dataset split to process (all/chemistry/biology/physics).", )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
docs/evaluation/scientific-knowledge.mdnemo_skills/dataset/frontierscience-olympiad/__init__.pynemo_skills/dataset/frontierscience-olympiad/prepare.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/evaluation/metrics/math_metrics.pynemo_skills/prompt/config/judge/frontierscience-olympiad.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-12T16:09:53.870Z
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.
Applied to files:
nemo_skills/prompt/config/judge/frontierscience-olympiad.yaml
🧬 Code graph analysis (2)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/math_metrics.py (1)
MathMetrics(25-145)
nemo_skills/dataset/frontierscience-olympiad/prepare.py (2)
nemo_skills/mcp/servers/tavily_search_tool.py (1)
answer(61-125)nemo_skills/inference/chat_interface/core.py (1)
get(136-151)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/scientific-knowledge.md
106-106: Link text should be descriptive
(MD059, descriptive-link-text)
🪛 Ruff (0.14.10)
nemo_skills/dataset/frontierscience-olympiad/prepare.py
66-66: Consider ["all", *SUBJECTS] instead of concatenation
Replace with ["all", *SUBJECTS]
(RUF005)
73-73: Probable use of requests call without timeout
(S113)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (9)
nemo_skills/dataset/frontierscience-olympiad/prepare.py (2)
29-43: LGTM!The
format_entryfunction correctly formats dataset entries, including stripping backticks from answers using regex.
87-100: LGTM!The split handling logic correctly processes individual subjects or all subjects with a combined
test.jsonloutput.nemo_skills/evaluation/metrics/map_metrics.py (1)
48-50: LGTM!The metric configuration correctly aligns with the FrontierScience-Olympiad dataset structure, using
question_key="question"to match the prepared data format andanswer_key="generation"for judge-based evaluation.nemo_skills/evaluation/metrics/math_metrics.py (2)
28-33: LGTM!Adding the configurable
question_keyparameter enables support for datasets with different field naming conventions while maintaining backward compatibility with the default"problem"value.
114-123: LGTM!The log statement correctly uses
self.question_keyto dynamically access the question field, ensuring accurate discrepancy logging across different dataset formats.nemo_skills/prompt/config/judge/frontierscience-olympiad.yaml (1)
1-22: The judge prompt's output format requires configuration verification.The prompt outputs "Judgement: YES" or "Judgement: NO", but NeMo Skills' documented parsing methods expect either:
- Regex format:
IS_CORRECT: True/False- Structured JSON format:
{"is_correct_judgement": boolean}Confirm that the evaluation configuration includes appropriate parsing logic (regex pattern or custom handler) to convert this prompt's YES/NO format to the boolean values expected by the scoring pipeline.
nemo_skills/dataset/frontierscience-olympiad/__init__.py (3)
1-14: LGTM!Standard Apache 2.0 license header with appropriate copyright year.
15-19: LGTM!Configuration constants are well-structured. Defaulting
EVAL_SPLITto"physics"is reasonable since it can be overridden from the command line as noted.
22-27: Model identifier is valid and currently supported. Theo3-mini-2025-01-31model is an active OpenAI snapshot available via the API with documented pricing and rate limits. Users must have OpenAI API access to run the judge evaluation.
| for idx, entry in enumerate(tqdm(data, desc=f"Writing {output_file.name}")): | ||
| # Filter by subject if specified | ||
| if subject_filter and entry.get("subject", "").lower() != subject_filter: | ||
| continue | ||
| formatted_entry = format_entry(entry, idx) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the complete prepare.py file to understand the context
cat -n nemo_skills/dataset/frontierscience-olympiad/prepare.pyRepository: NVIDIA-NeMo/Skills
Length of output: 4279
🏁 Script executed:
# Check other dataset prepare scripts for similar indexing patterns
rg -n "enumerate|format_entry|subject_filter" --type=py nemo_skills/dataset/ -A2 -B2 | head -80Repository: NVIDIA-NeMo/Skills
Length of output: 5932
Use count instead of idx to generate sequential IDs in filtered subject files.
When filtering by subject, using idx from the full dataset enumeration creates non-sequential problem IDs. For example, if physics entries are at positions 0, 5, 10 in the full dataset, the physics.jsonl file will have IDs olympiad-0, olympiad-5, olympiad-10 instead of olympiad-0, olympiad-1, olympiad-2. Since count is already being tracked at line 48–57, pass count to format_entry() at line 54 instead of idx:
formatted_entry = format_entry(entry, count)🤖 Prompt for AI Agents
In @nemo_skills/dataset/frontierscience-olympiad/prepare.py around lines 50 -
54, The loop uses the full-dataset index variable idx when calling format_entry,
which yields non-sequential IDs after subject filtering; change the call to pass
the running counter variable count (used to generate sequential IDs) instead of
idx (i.e., call format_entry(entry, count)), and ensure count is only
incremented when an entry is not skipped by the subject_filter so written files
get contiguous IDs.
| try: | ||
| response = requests.get(OLYMPIAD_URL, timeout=30) | ||
| except Exception as e: | ||
| raise RuntimeError(f"Error downloading dataset from {OLYMPIAD_URL}: {e}") |
There was a problem hiding this comment.
logic: Missing HTTP status code check. If server returns 404/500, code proceeds to parse error page as JSONL, causing cryptic errors
| try: | |
| response = requests.get(OLYMPIAD_URL, timeout=30) | |
| except Exception as e: | |
| raise RuntimeError(f"Error downloading dataset from {OLYMPIAD_URL}: {e}") | |
| try: | |
| response = requests.get(OLYMPIAD_URL, timeout=30) | |
| response.raise_for_status() | |
| except Exception as e: | |
| raise RuntimeError(f"Error downloading dataset from {OLYMPIAD_URL}: {e}") |
|
@activatedgeek added error handling and fixed a bug. |
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
|
@jiacheng-xu could you fix DCO? |
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
c1cad6c to
ca0aa86
Compare
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
| print(f"Downloading FrontierScience olympiad dataset from {OLYMPIAD_URL}...") | ||
|
|
||
| try: | ||
| response = requests.get(OLYMPIAD_URL, timeout=30) |
There was a problem hiding this comment.
logic: Missing HTTP status check - add response.raise_for_status() after this line to ensure server returned 200. Currently, 404/500 errors would attempt to parse HTML error pages as JSONL.
| response = requests.get(OLYMPIAD_URL, timeout=30) | |
| response = requests.get(OLYMPIAD_URL, timeout=30) | |
| response.raise_for_status() |
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
| *** | ||
| First, think step-by-step about whether the attempted answer matches the reference answer. | ||
| If the attempted answer is correct, write "Judgement: YES" in the last line of your | ||
| response, with no other text or formatting. If it is incorrect, write "Judgement: NO". |
There was a problem hiding this comment.
syntax: Missing closing </output> tag at end of YAML multiline string - won't parse correctly
The user: |- block started on line 2 needs to be closed. Check other judge configs like hle.yaml for reference.
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
FrontierScience related changes:
Implements support for FrontierScience benchmark dataset (Olympiad)
FrontierScience-Olympiad (100 olympiad-level problems)
Changes
frontierscience-olympiad/dataset with prepare.py and init.pyphysics,chemistry,biologyDataset Details
Metrics related changes:
Update the implementation of MathMetrics so users can customize their key for questions.
Comment on why not using "custom generation module"
Frontierscience has two types of problems, Olympiad and Research. From problem type to judge config to metrics type, they are different. At the end of the day, there is no aggregation of these two sets to compile a single number. I don't see much reason to combine them into a combined module like Arenahard.