Skip to content

Add FrontierScience-Olympiad to benchmark#1165

Merged
Kipok merged 11 commits intomainfrom
jcxu/frontierscience
Jan 21, 2026
Merged

Add FrontierScience-Olympiad to benchmark#1165
Kipok merged 11 commits intomainfrom
jcxu/frontierscience

Conversation

@jiacheng-xu
Copy link
Collaborator

@jiacheng-xu jiacheng-xu commented Jan 13, 2026

FrontierScience related changes:

Implements support for FrontierScience benchmark dataset (Olympiad)

  • FrontierScience-Olympiad (100 olympiad-level problems)

    Changes

    • Added frontierscience-olympiad/ dataset with prepare.py and init.py
    • Both support splits by subject: physics, chemistry, biology
    • Added documentation in docs/evaluation/scientific-knowledge.md

    Dataset Details

Metrics related changes:
Update the implementation of MathMetrics so users can customize their key for questions.

Comment on why not using "custom generation module"
Frontierscience has two types of problems, Olympiad and Research. From problem type to judge config to metrics type, they are different. At the end of the day, there is no aggregation of these two sets to compile a single number. I don't see much reason to combine them into a combined module like Arenahard.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Greptile Summary

This PR adds support for the FrontierScience-Olympiad benchmark, integrating 100 olympiad-level science problems across physics, chemistry, and biology. The implementation follows established patterns in the codebase and includes comprehensive documentation.

Key additions:

  • Dataset preparation script downloads and formats olympiad problems from HuggingFace
  • Custom judge prompt based on OpenAI's evaluation methodology
  • Metrics configuration using MathMetrics with customizable question_key parameter
  • Documentation with example configurations and benchmark results

Issues found:

  • YAML syntax error in judge config (missing closing tag)
  • Previous review comments about HTTP status checking and ID assignment remain unaddressed

Confidence Score: 3/5

  • This PR has one critical syntax issue that will prevent the judge from working correctly
  • The missing closing tag in the YAML file is a syntax error that will cause parsing failures. The metrics changes are solid and follow good patterns (adding flexibility via question_key). Previous review comments about HTTP error handling and ID sequencing remain unaddressed but are less critical.
  • Pay close attention to nemo_skills/prompt/config/judge/frontierscience-olympiad.yaml which has a syntax error

Important Files Changed

Filename Overview
nemo_skills/dataset/frontierscience-olympiad/prepare.py Added dataset preparation script with HTTP download and JSONL formatting; ID assignment uses enumerate index causing non-sequential IDs when filtering
nemo_skills/evaluation/metrics/math_metrics.py Added configurable question_key parameter to support different field names across benchmarks
nemo_skills/prompt/config/judge/frontierscience-olympiad.yaml Added judge prompt based on OpenAI paper; missing closing output tag in YAML

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant HF as HuggingFace
    participant Dataset as Dataset Files
    participant Eval as Evaluation Pipeline
    participant Judge as Judge Model

    User->>PrepareScript: Run with split parameter
    PrepareScript->>HF: Download olympiad dataset
    HF-->>PrepareScript: Return 100 problems
    PrepareScript->>PrepareScript: Parse JSONL and format entries
    PrepareScript->>PrepareScript: Filter by subject if specified
    PrepareScript->>Dataset: Write subject-specific files
    PrepareScript->>Dataset: Write combined all.jsonl
    
    User->>Eval: Run evaluation
    Eval->>Dataset: Load questions from split
    Eval->>Eval: Generate answers using model
    Eval->>Judge: Send question, generation, expected_answer
    Judge->>Judge: Apply judge prompt template
    Judge-->>Eval: Return judgement
    Eval->>Eval: Calculate metrics using MathMetrics
    Eval-->>User: Return evaluation scores
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +50 to +54
for idx, entry in enumerate(tqdm(data, desc=f"Writing {output_file.name}")):
# Filter by subject if specified
if subject_filter and entry.get("subject", "").lower() != subject_filter:
continue
formatted_entry = format_entry(entry, idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] When filtering by subject, idx from enumerate includes skipped entries, causing non-sequential IDs. For example, if physics problems are at indices 0, 5, 10, the IDs will be olympiad-0, olympiad-5, olympiad-10 instead of olympiad-0, olympiad-1, olympiad-2. Use count instead of idx for sequential IDs within each subject file.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 13, 2026

📝 Walkthrough

Walkthrough

Introduces support for the FrontierScience Olympiad benchmark by adding dataset preparation scripts, metric registration, evaluation configuration, and documentation. The integration includes per-subject JSONL file generation, metric type registration with configurable parameters, and a judge prompt for grading science olympiad solutions.

Changes

Cohort / File(s) Summary
Documentation
docs/evaluation/scientific-knowledge.md
Adds FrontierScience Olympiad section under Supported benchmarks, documenting dataset definition, source, content scope (100 short-answer questions in physics, chemistry, biology), and available splits. Note: content appears duplicated in document.
Dataset Package Initialization
nemo_skills/dataset/frontierscience-olympiad/__init__.py
Introduces new package with default evaluation constants: DATASET_GROUP ("math"), METRICS_TYPE ("frontierscience-olympiad"), GENERATION_ARGS, EVAL_SPLIT ("physics"), JUDGE_PIPELINE_ARGS (model, server configuration), and JUDGE_ARGS (prompt and evaluation settings).
Dataset Preparation
nemo_skills/dataset/frontierscience-olympiad/prepare.py
Adds data preparation script that downloads dataset from HuggingFace, formats entries with id/question/answer/subject fields, and writes per-subject JSONL files (chemistry, biology, physics) plus combined test.jsonl. Supports \-\-split CLI option for subject filtering.
Metric Registration & Updates
nemo_skills/evaluation/metrics/map_metrics.py, nemo_skills/evaluation/metrics/math_metrics.py
Registers "frontierscience-olympiad" metric type using MathMetrics with compute_no_answer=False and custom question_key. Adds question_key parameter to MathMetrics initializer for flexible question field selection, replacing hardcoded "problem" reference.
Judge Prompt Configuration
nemo_skills/prompt/config/judge/frontierscience-olympiad.yaml
Adds YAML prompt template for grading science olympiad solutions, instructing comparison of attempted answer against reference answer with allowances for equivalence (rounding, alternative naming, units), and requiring final "Judgement: YES/NO" output.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • Add compute eval #1158: Also adds new dataset integration with METRICS_MAP registration; both follow similar integration patterns for new benchmark support.

Suggested reviewers

  • gwarmstrong
  • Kipok
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a new benchmark (FrontierScience-Olympiad) to the codebase, which is reflected across all modified files including documentation, dataset initialization, preparation script, metrics mapping, and configuration files.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @nemo_skills/dataset/frontierscience-olympiad/prepare.py:
- Around line 73-74: The requests.get call that fetches OLYMPIAD_URL currently
has no timeout and can hang; update the requests.get invocation to include a
reasonable timeout (e.g., timeout=10) and handle potential timeout/connection
errors by catching requests.Timeout or requests.RequestException around the call
before calling response.raise_for_status so the script fails fast and
logs/handles the error appropriately.
- Around line 50-54: The loop uses the full-dataset index variable idx when
calling format_entry, which yields non-sequential IDs after subject filtering;
change the call to pass the running counter variable count (used to generate
sequential IDs) instead of idx (i.e., call format_entry(entry, count)), and
ensure count is only incremented when an entry is not skipped by the
subject_filter so written files get contiguous IDs.
🧹 Nitpick comments (2)
docs/evaluation/scientific-knowledge.md (1)

103-108: Use descriptive link text instead of "here".

The link text on line 106 is non-descriptive. For accessibility and clarity, use meaningful link text that indicates the destination.

📝 Suggested improvement
 ### FrontierScience-Olympiad
 
 - Benchmark is defined in [`nemo_skills/dataset/frontierscience-olympiad/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/frontierscience-olympiad/__init__.py)
-- Original benchmark source is [here](https://huggingface.co/datasets/openai/frontierscience).
+- Original benchmark source is available on [HuggingFace](https://huggingface.co/datasets/openai/frontierscience).
 - Contains 100 short-answer questions crafted by international science olympiad medalists across physics, chemistry, and biology.
 - Available splits: `physics` (default), `chemistry`, `biology`, and `all` (all subjects combined).
nemo_skills/dataset/frontierscience-olympiad/prepare.py (1)

63-68: Consider using list unpacking for cleaner syntax.

Per Ruff RUF005, prefer unpacking over concatenation.

♻️ Optional improvement
     parser.add_argument(
         "--split",
         default="all",
-        choices=["all"] + SUBJECTS,
+        choices=["all", *SUBJECTS],
         help="Dataset split to process (all/chemistry/biology/physics).",
     )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 079b106 and a3c2fe7.

📒 Files selected for processing (6)
  • docs/evaluation/scientific-knowledge.md
  • nemo_skills/dataset/frontierscience-olympiad/__init__.py
  • nemo_skills/dataset/frontierscience-olympiad/prepare.py
  • nemo_skills/evaluation/metrics/map_metrics.py
  • nemo_skills/evaluation/metrics/math_metrics.py
  • nemo_skills/prompt/config/judge/frontierscience-olympiad.yaml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-12-12T16:09:53.870Z
Learnt from: Jorjeous
Repo: NVIDIA-NeMo/Skills PR: 1103
File: nemo_skills/prompt/config/judge/audiobench.yaml:15-28
Timestamp: 2025-12-12T16:09:53.870Z
Learning: In AudioBench judge prompt configuration (nemo_skills/prompt/config/judge/audiobench.yaml), having duplicate Score0 entries is intentional - one for "refusing to give concrete results" and another for "completely misaligned" answers. These should remain as separate entries rather than being combined.

Applied to files:

  • nemo_skills/prompt/config/judge/frontierscience-olympiad.yaml
🧬 Code graph analysis (2)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/math_metrics.py (1)
  • MathMetrics (25-145)
nemo_skills/dataset/frontierscience-olympiad/prepare.py (2)
nemo_skills/mcp/servers/tavily_search_tool.py (1)
  • answer (61-125)
nemo_skills/inference/chat_interface/core.py (1)
  • get (136-151)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/scientific-knowledge.md

106-106: Link text should be descriptive

(MD059, descriptive-link-text)

🪛 Ruff (0.14.10)
nemo_skills/dataset/frontierscience-olympiad/prepare.py

66-66: Consider ["all", *SUBJECTS] instead of concatenation

Replace with ["all", *SUBJECTS]

(RUF005)


73-73: Probable use of requests call without timeout

(S113)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (9)
nemo_skills/dataset/frontierscience-olympiad/prepare.py (2)

29-43: LGTM!

The format_entry function correctly formats dataset entries, including stripping backticks from answers using regex.


87-100: LGTM!

The split handling logic correctly processes individual subjects or all subjects with a combined test.jsonl output.

nemo_skills/evaluation/metrics/map_metrics.py (1)

48-50: LGTM!

The metric configuration correctly aligns with the FrontierScience-Olympiad dataset structure, using question_key="question" to match the prepared data format and answer_key="generation" for judge-based evaluation.

nemo_skills/evaluation/metrics/math_metrics.py (2)

28-33: LGTM!

Adding the configurable question_key parameter enables support for datasets with different field naming conventions while maintaining backward compatibility with the default "problem" value.


114-123: LGTM!

The log statement correctly uses self.question_key to dynamically access the question field, ensuring accurate discrepancy logging across different dataset formats.

nemo_skills/prompt/config/judge/frontierscience-olympiad.yaml (1)

1-22: The judge prompt's output format requires configuration verification.

The prompt outputs "Judgement: YES" or "Judgement: NO", but NeMo Skills' documented parsing methods expect either:

  • Regex format: IS_CORRECT: True/False
  • Structured JSON format: {"is_correct_judgement": boolean}

Confirm that the evaluation configuration includes appropriate parsing logic (regex pattern or custom handler) to convert this prompt's YES/NO format to the boolean values expected by the scoring pipeline.

nemo_skills/dataset/frontierscience-olympiad/__init__.py (3)

1-14: LGTM!

Standard Apache 2.0 license header with appropriate copyright year.


15-19: LGTM!

Configuration constants are well-structured. Defaulting EVAL_SPLIT to "physics" is reasonable since it can be overridden from the command line as noted.


22-27: Model identifier is valid and currently supported. The o3-mini-2025-01-31 model is an active OpenAI snapshot available via the API with documented pricing and rate limits. Users must have OpenAI API access to run the judge evaluation.

Comment on lines +50 to +54
for idx, entry in enumerate(tqdm(data, desc=f"Writing {output_file.name}")):
# Filter by subject if specified
if subject_filter and entry.get("subject", "").lower() != subject_filter:
continue
formatted_entry = format_entry(entry, idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the complete prepare.py file to understand the context
cat -n nemo_skills/dataset/frontierscience-olympiad/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 4279


🏁 Script executed:

# Check other dataset prepare scripts for similar indexing patterns
rg -n "enumerate|format_entry|subject_filter" --type=py nemo_skills/dataset/ -A2 -B2 | head -80

Repository: NVIDIA-NeMo/Skills

Length of output: 5932


Use count instead of idx to generate sequential IDs in filtered subject files.

When filtering by subject, using idx from the full dataset enumeration creates non-sequential problem IDs. For example, if physics entries are at positions 0, 5, 10 in the full dataset, the physics.jsonl file will have IDs olympiad-0, olympiad-5, olympiad-10 instead of olympiad-0, olympiad-1, olympiad-2. Since count is already being tracked at line 48–57, pass count to format_entry() at line 54 instead of idx:

formatted_entry = format_entry(entry, count)
🤖 Prompt for AI Agents
In @nemo_skills/dataset/frontierscience-olympiad/prepare.py around lines 50 -
54, The loop uses the full-dataset index variable idx when calling format_entry,
which yields non-sequential IDs after subject filtering; change the call to pass
the running counter variable count (used to generate sequential IDs) instead of
idx (i.e., call format_entry(entry, count)), and ensure count is only
incremented when an entry is not skipped by the subject_filter so written files
get contiguous IDs.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +74 to +77
try:
response = requests.get(OLYMPIAD_URL, timeout=30)
except Exception as e:
raise RuntimeError(f"Error downloading dataset from {OLYMPIAD_URL}: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Missing HTTP status code check. If server returns 404/500, code proceeds to parse error page as JSONL, causing cryptic errors

Suggested change
try:
response = requests.get(OLYMPIAD_URL, timeout=30)
except Exception as e:
raise RuntimeError(f"Error downloading dataset from {OLYMPIAD_URL}: {e}")
try:
response = requests.get(OLYMPIAD_URL, timeout=30)
response.raise_for_status()
except Exception as e:
raise RuntimeError(f"Error downloading dataset from {OLYMPIAD_URL}: {e}")

@jiacheng-xu
Copy link
Collaborator Author

@activatedgeek added error handling and fixed a bug.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

@ekmb
Copy link
Collaborator

ekmb commented Jan 13, 2026

@jiacheng-xu could you fix DCO?

@jiacheng-xu jiacheng-xu requested a review from ekmb January 14, 2026 01:00
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

print(f"Downloading FrontierScience olympiad dataset from {OLYMPIAD_URL}...")

try:
response = requests.get(OLYMPIAD_URL, timeout=30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Missing HTTP status check - add response.raise_for_status() after this line to ensure server returned 200. Currently, 404/500 errors would attempt to parse HTML error pages as JSONL.

Suggested change
response = requests.get(OLYMPIAD_URL, timeout=30)
response = requests.get(OLYMPIAD_URL, timeout=30)
response.raise_for_status()

Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

***
First, think step-by-step about whether the attempted answer matches the reference answer.
If the attempted answer is correct, write "Judgement: YES" in the last line of your
response, with no other text or formatting. If it is incorrect, write "Judgement: NO".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Missing closing </output> tag at end of YAML multiline string - won't parse correctly

The user: |- block started on line 2 needs to be closed. Check other judge configs like hle.yaml for reference.

@Kipok Kipok merged commit 058e7a6 into main Jan 21, 2026
7 checks passed
@Kipok Kipok deleted the jcxu/frontierscience branch January 21, 2026 17:48
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants