Skip to content

Gnalbandyan/add physics#1214

Merged
gnalbandyan merged 6 commits intomainfrom
gnalbandyan/add_physics
Feb 11, 2026
Merged

Gnalbandyan/add physics#1214
gnalbandyan merged 6 commits intomainfrom
gnalbandyan/add_physics

Conversation

@gnalbandyan
Copy link
Collaborator

@gnalbandyan gnalbandyan commented Feb 5, 2026

Added PHYSICS benchmark, updated Scientific Knowledge documentation page

Summary by CodeRabbit

  • Documentation

    • Consolidated scientific-knowledge docs into a compact dataset overview table and updated evaluation examples to new parameter conventions.
  • New Features

    • Added physics evaluation support: dataset preparation and export, specialized scoring for physics answers, and new prompt templates for problem generation and judgement.

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
@gnalbandyan gnalbandyan requested review from Kipok, ekmb and jiacheng-xu and removed request for Kipok February 5, 2026 15:25
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +63 to +69
dataset = load_dataset("desimfj/PHYSICS")["test"]
eng_data = [entry for entry in dataset if entry["language"] == "en"]
ch_data = [entry for entry in dataset if entry["language"] == "zh"]
full_data = eng_data + ch_data

for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
save_data(split_data, split_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EN/ZH split filenames swapped
In the final loop, zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]) writes English examples to test.jsonl and Chinese examples to zh.jsonl, but then uses en_zh for the combined split. This makes test effectively EN-only and zh ZH-only, which seems fine, but contradicts the naming in the docs/config (EN default is called test). If test is intended to be the full test split, this is wrong; if test is intended to be EN-only, rename test to en (or update dataset defaults/docs) to avoid consumers accidentally evaluating the wrong language split.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

Adds a new Physics dataset package (data prep and config), physics-specific metrics and judge/prompt configs, and updates scientific-knowledge docs to a compact dataset table and revised evaluation examples.

Changes

Cohort / File(s) Summary
Physics Dataset
nemo_skills/dataset/physics/__init__.py, nemo_skills/dataset/physics/prepare.py
New dataset module with evaluation/config constants and data-preparation utilities (strip_boxed, process_answer, format_entry, write_data_to_file, save_data). Produces JSONL splits (test en, zh, en_zh) from DESIMFJ Physics.
Metrics Registration & Implementation
nemo_skills/evaluation/metrics/map_metrics.py, nemo_skills/evaluation/metrics/physics_metrics.py, nemo_skills/evaluation/metrics/math_metrics.py
Registers physics in METRICS_MAP; adds PhysicsMetrics subclass (is_correct_judgement, get_incorrect_sample) and exposes MathMetrics.is_correct_judgement wrapper used in scoring flow.
Prompt Configs
nemo_skills/prompt/config/generic/physics.yaml, nemo_skills/prompt/config/judge/physics.yaml
Adds generic problem prompt enforcing LaTeX and boxed answers, and judge prompt for producing [Correct]/[Incorrect] judgements with handling rules for multiple answers and equivalence.
Docs
docs/evaluation/scientific-knowledge.md, docs/evaluation/index.md, docs/index.md
Converts narrative benchmark sections into a consolidated dataset overview table and simplifies example listings; updates example evaluation snippets and parameter names.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Prep as DataPrep
    participant Storage as Dataset (JSONL)
    participant Evaluator as PhysicsMetrics
    participant Judge as JudgePrompt
    participant Model as JudgeModel

    User->>Prep: load_dataset(DESIMFJ Physics)
    Prep->>Prep: strip_boxed / process_answer / format_entry
    Prep->>Storage: write JSONL splits (en, zh, en_zh)
    User->>Evaluator: submit prediction(s)
    Evaluator->>Judge: craft judge prompt (problem, generation, expected_answer)
    Judge->>Model: send prompt to judge model/server
    Model-->>Judge: return judgement ([Correct]/[Incorrect])
    Judge-->>Evaluator: judgement text
    Evaluator-->>User: score dict (judge_correct, metrics)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • ekmb
  • Kipok
🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Gnalbandyan/add physics' is overly vague and uses a branch naming convention rather than describing the actual changes; it does not clearly convey the main purpose of adding a Physics benchmark. Use a more descriptive title like 'Add Physics benchmark with dataset, metrics, and evaluation configuration' to clearly summarize the main changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch gnalbandyan/add_physics

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Fix all issues with AI agents
In `@docs/evaluation/scientific-knowledge.md`:
- Around line 10-13: Remove the invalid HTML break tags in the markdown table
and following code block by replacing all occurrences of "<br>" and "</br>" with
valid markdown-friendly breaks (e.g., use "<br/>" if HTML breaks are required,
or convert to plain newlines/Markdown line breaks) in the rows containing
"**GPQA**", "**SciCode**" and the surrounding table lines (also fix the similar
instances around lines 40-40 referenced). Ensure table cells remain properly
formatted after the change.
- Around line 5-17: Add a new "Physics benchmark" subsection below the "Dataset
Overview" table that targets the "Physics" row: include the example evaluation
command (CLI or script) to run the benchmark, the expected baseline
results/metrics to compare against, model-testing details (prompt format,
scoring/judging rules and any automated judge used), and dataset-specific notes
describing the EN/ZH splits and how to select the EN split for evaluation;
reference the "Physics" dataset name from the table and ensure the subsection
succinctly documents command, expected results, model testing, and
dataset-specific setup (EN/ZH selection and judge configuration).

In `@nemo_skills/dataset/physics/__init__.py`:
- Around line 15-18: The inline comment next to METRICS_TYPE is incorrect:
update the comment to reflect that METRICS_TYPE = "physics" uses the
PhysicsMetrics class (not MathMetrics) and still sets compute_no_answer=False;
modify the comment on the METRICS_TYPE line accordingly and ensure surrounding
constants DATASET_GROUP, METRICS_TYPE, and GENERATION_ARGS remain unchanged.

In `@nemo_skills/dataset/physics/prepare.py`:
- Line 68: Change the zip call to enforce that the two iterables have identical
lengths by adding strict=True to the zip invocation used in the loop over
eng_data, ch_data, full_data and ["test", "zh", "en_zh"], i.e., update the for
loop that binds split_data and split_name so zip(..., strict=True) is used
instead of a plain zip to ensure mismatched lengths raise an error.
- Around line 29-32: Add strict=True to the zip(...) invocation used to pair the
parallel lists in this module — locate the zip call in this file (the one that
pairs items when building examples, adjacent to process_answer) and change
zip(a, b) to zip(a, b, strict=True) so mismatched lengths raise immediately;
ensure the call site where the pairing logic is implemented is updated (the zip
used inside the example-building function in this file).

In `@nemo_skills/prompt/config/generic/physics.yaml`:
- Around line 2-5: The YAML prompt contains a typo in the third rule: change the
word "seperated" to the correct spelling "separate" in the instruction that
reads "If there are multiple final answers, please seperated them by commas in
\\boxed{{}}"; update that sentence so it reads "If there are multiple final
answers, please separate them by commas in \\boxed{{}}", keeping the surrounding
LaTeX guidance and formatting intact (file:
nemo_skills/prompt/config/generic/physics.yaml; locate the rule text containing
"seperated").

In `@nemo_skills/prompt/config/judge/physics.yaml`:
- Around line 16-17: The prompt string fragment "Question: {problem}, Output
sentence: {generation}, Correct answer: {expected_answer}, Judge- ment:" is
broken by a hyphenated line break; fix it by merging the split token into a
single word and line so it reads "Judgement:" (i.e., replace "Judge- ment:" with
"Judgement:") ensuring the entire prompt line is contiguous: "Question:
{problem}, Output sentence: {generation}, Correct answer: {expected_answer},
Judgement:".

Comment on lines +5 to +17
## Dataset Overview

### hle

- Benchmark is defined in [`nemo_skills/dataset/hle/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hle/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/cais/hle).
- The `text` split includes all non-image examples. It is further divided into `eng`, `chem`, `bio`, `cs`, `phy`, `math`, `human`, `other`. Currently, **all** of these splits contain only text data.

### SimpleQA

- Benchmark is defined in [`nemo_skills/dataset/simpleqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/simpleqa/__init__.py)
- Original benchmark source code for SimpleQA (OpenAI) is [here](https://github.com/openai/simple-evals/) and the leaderboard is [here](https://www.kaggle.com/benchmarks/openai/simpleqa). An improved version with 1,000 examples from Google, SimpleQA-verified, is [here](https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified).
- To use the SimpleQA-verified, set `split=verified`. To use the original version of SimpleQA, please set `split=test`.

In the below configurations, we also use `gpt-oss-120b` as the judge model.

#### Configuration: `gpt-oss-120b` with builtin tool (python)
| <div style="width:80px; display:inline-block; text-align:center">Dataset</div> | <div style="width:110px; display:inline-block; text-align:center">Questions</div> | <div style="width:90px; display:inline-block; text-align:center">Types</div> | <div style="width:150px; display:inline-block; text-align:center">Domain</div> | <div style="width:70px; display:inline-block; text-align:center">Images?</div> | <div style="width:70px; display:inline-block; text-align:center">NS default</div> | <div style="width:50px; display:inline-block; text-align:center">Link</div> |
|:---|:---:|:---:|:---|:---:|:---:|:---:|
| **HLE** | 2500 | Open ended, MCQ | Engineering, Physics, Chemistry, Bio, etc. | Yes | text only | [HF](https://huggingface.co/datasets/cais/hle) |
| **GPQA** | 448 (main)<br>198 (diamond)</br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond | [HF](https://huggingface.co/datasets/Idavidrein/gpqa) |
| **SuperGPQA** | 26,529 | MCQ (≤ 10) | Science, Eng, Humanities, etc. | No | test | [HF](https://huggingface.co/datasets/m-a-p/SuperGPQA) |
| **MMLU-Pro** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test | [HF](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) |
| **SciCode** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) |
| **FrontierScience** | 100 | Short-answer | Physics, Chemistry, Biology | No | all | [HF](https://huggingface.co/datasets/openai/frontierscience) |
| **Physics** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN | [HF](https://huggingface.co/datasets/desimfj/PHYSICS) |
| **MMLU** | 14,042 | MCQ (4) | Multiple Subjects | No | test | [HF](https://huggingface.co/datasets/cais/mmlu) |
| **SimpleQa** | 4,326 (test), 1,000 (verified) | Open ended | Factuality, Parametric knowledge| No | verified | [HF](https://github.com/openai/simple-evals/) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add Physics benchmark details (command, expected results, model testing, dataset-specific notes).
The table introduces Physics but there’s no physics-specific example command, expected results, or dataset notes (e.g., EN/ZH splits, judge setup). Please add a short subsection covering these items.
As per coding guidelines: When adding new benchmarks, add documentation with example commands, expected results, model testing details, and dataset-specific information.

🤖 Prompt for AI Agents
In `@docs/evaluation/scientific-knowledge.md` around lines 5 - 17, Add a new
"Physics benchmark" subsection below the "Dataset Overview" table that targets
the "Physics" row: include the example evaluation command (CLI or script) to run
the benchmark, the expected baseline results/metrics to compare against,
model-testing details (prompt format, scoring/judging rules and any automated
judge used), and dataset-specific notes describing the EN/ZH splits and how to
select the EN split for evaluation; reference the "Physics" dataset name from
the table and ensure the subsection succinctly documents command, expected
results, model testing, and dataset-specific setup (EN/ZH selection and judge
configuration).

Comment on lines +10 to +13
| **GPQA** | 448 (main)<br>198 (diamond)</br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond | [HF](https://huggingface.co/datasets/Idavidrein/gpqa) |
| **SuperGPQA** | 26,529 | MCQ (≤ 10) | Science, Eng, Humanities, etc. | No | test | [HF](https://huggingface.co/datasets/m-a-p/SuperGPQA) |
| **MMLU-Pro** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test | [HF](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) |
| **SciCode** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix invalid <br> tags in the table and after the code block.

🛠️ Suggested fix
-| **GPQA** | 448 (main)<br>198 (diamond)</br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond | [HF](https://huggingface.co/datasets/Idavidrein/gpqa) |
+| **GPQA** | 448 (main)<br>198 (diamond)<br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond | [HF](https://huggingface.co/datasets/Idavidrein/gpqa) |
-| **SciCode** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) |
+| **SciCode** | 80<br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) |
-</br>
+<br>

Also applies to: 40-40

🤖 Prompt for AI Agents
In `@docs/evaluation/scientific-knowledge.md` around lines 10 - 13, Remove the
invalid HTML break tags in the markdown table and following code block by
replacing all occurrences of "<br>" and "</br>" with valid markdown-friendly
breaks (e.g., use "<br/>" if HTML breaks are required, or convert to plain
newlines/Markdown line breaks) in the rows containing "**GPQA**", "**SciCode**"
and the surrounding table lines (also fix the similar instances around lines
40-40 referenced). Ensure table cells remain properly formatted after the
change.

Comment on lines +15 to +18
# settings that define how evaluation should be done by default (all can be changed from cmdline)
DATASET_GROUP = "math"
METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False
GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment mismatch: METRICS_TYPE uses PhysicsMetrics, not MathMetrics.

🛠️ Suggested fix
-METRICS_TYPE = "physics"  # This uses the MathMetrics class, but with compute_no_answer=False
+METRICS_TYPE = "physics"  # Uses PhysicsMetrics (compute_no_answer defaults to False)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# settings that define how evaluation should be done by default (all can be changed from cmdline)
DATASET_GROUP = "math"
METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False
GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"
# settings that define how evaluation should be done by default (all can be changed from cmdline)
DATASET_GROUP = "math"
METRICS_TYPE = "physics" # Uses PhysicsMetrics (compute_no_answer defaults to False)
GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/physics/__init__.py` around lines 15 - 18, The inline
comment next to METRICS_TYPE is incorrect: update the comment to reflect that
METRICS_TYPE = "physics" uses the PhysicsMetrics class (not MathMetrics) and
still sets compute_no_answer=False; modify the comment on the METRICS_TYPE line
accordingly and ensure surrounding constants DATASET_GROUP, METRICS_TYPE, and
GENERATION_ARGS remain unchanged.

Comment on lines +29 to +32
def process_answer(answer):
"""Flatten all answers and wrap in a single \\boxed{}"""
all_answers = [strip_boxed(item) for sublist in answer for item in sublist]
return f"\\boxed{{{', '.join(all_answers)}}}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's view the full prepare.py file to understand the context
find . -type f -name "prepare.py" -path "*/physics/*" | head -5

Repository: NVIDIA-NeMo/Skills

Length of output: 102


🏁 Script executed:

# List files in the physics dataset directory
fd "prepare.py" --path "*physics*"

Repository: NVIDIA-NeMo/Skills

Length of output: 277


🏁 Script executed:

# Search for the exact file
find . -type f -name "prepare.py" | grep physics

Repository: NVIDIA-NeMo/Skills

Length of output: 102


🌐 Web query:

HuggingFace dataset desimfj/PHYSICS schema answer field format

💡 Result:

In desimfj/PHYSICS, the answer field is a nested list: List[List[str]] (HF viewer shows it as a sequence of sequences). Each outer list element corresponds to one sub-question, and the inner list contains one or more accepted answer strings (typically LaTeX, often wrapped with \\boxed{...}).[1]

Example patterns from the dataset:[1]

  • Single sub-question:
    answer = [[ "\\boxed{3}" ]]
  • Two sub-questions:
    answer = [[ "\\boxed{-60}" ], [ "\\boxed{4}" ]]
  • MCQ:
    answer = [[ "\\boxed{B}" ]]

🏁 Script executed:

# Search for references to this dataset or answer processing
rg "desimfj/PHYSICS" --type py

Repository: NVIDIA-NeMo/Skills

Length of output: 154


🏁 Script executed:

# Look for load_dataset calls related to physics
rg "load_dataset.*physics" --type py -i

Repository: NVIDIA-NeMo/Skills

Length of output: 154


🏁 Script executed:

# Now let's read the full prepare.py file to see the context and line 68
cat -n ./nemo_skills/dataset/physics/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 2866


🏁 Script executed:

# Check for Python version requirements
find . -name "setup.py" -o -name "pyproject.toml" -o -name ".python-version" | xargs cat 2>/dev/null | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 1986


🏁 Script executed:

# Also check for requirements files
find . -name "requirements*.txt" | xargs head -5 2>/dev/null | head -30

Repository: NVIDIA-NeMo/Skills

Length of output: 174


🏁 Script executed:

# Search for python_requires in setup files
rg "python_requires|python =" --type py --type toml | head -10

Repository: NVIDIA-NeMo/Skills

Length of output: 220


🏁 Script executed:

# Verify zip behavior and confirm both lists have same length
python3 << 'EOF'
# Simulating the actual data
eng_data = [1, 2, 3]  # placeholder for actual data
ch_data = [1, 2, 3]
full_data = [1, 2, 3]

split_names = ["test", "zh", "en_zh"]
split_datas = [eng_data, ch_data, full_data]

# Check lengths match
print(f"Length of split_datas: {len(split_datas)}")
print(f"Length of split_names: {len(split_names)}")
print(f"Balanced: {len(split_datas) == len(split_names)}")

# Test that zip(strict=True) would work (Python 3.10+)
result = list(zip(split_datas, split_names, strict=True))
print(f"zip(strict=True) works: {len(result)} pairs")
EOF

Repository: NVIDIA-NeMo/Skills

Length of output: 158


Add strict=True to the zip() call at line 68 for safety.

The nested comprehension in process_answer() is correct—the dataset schema confirms answer is List[List[str]], and the code handles it properly with no risk of data corruption. However, line 68 should use zip(..., strict=True) to catch potential future bugs if the list lengths ever become unequal.

Fix for line 68
-    for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
+    for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/physics/prepare.py` around lines 29 - 32, Add strict=True
to the zip(...) invocation used to pair the parallel lists in this module —
locate the zip call in this file (the one that pairs items when building
examples, adjacent to process_answer) and change zip(a, b) to zip(a, b,
strict=True) so mismatched lengths raise immediately; ensure the call site where
the pairing logic is implemented is updated (the zip used inside the
example-building function in this file).

ch_data = [entry for entry in dataset if entry["language"] == "zh"]
full_data = eng_data + ch_data

for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check Python version requirements
echo "=== Python version requirements ==="
rg -n "python_requires|requires-python|python_version" -S -i
echo ""
echo "=== pyproject.toml content (relevant sections) ==="
if [ -f pyproject.toml ]; then
  head -50 pyproject.toml
fi

Repository: NVIDIA-NeMo/Skills

Length of output: 1818


🏁 Script executed:

#!/bin/bash
# Check for Ruff configuration and B905 rule
echo "=== Looking for Ruff configuration ==="
find . -maxdepth 2 -type f \( -name "ruff.toml" -o -name ".ruff.toml" -o -name "pyproject.toml" \) -exec echo "File: {}" \; -exec cat {} \;

Repository: NVIDIA-NeMo/Skills

Length of output: 3330


🏁 Script executed:

#!/bin/bash
# Verify the actual zip call and surrounding context
echo "=== Context around line 68 in prepare.py ==="
sed -n '60,75p' nemo_skills/dataset/physics/prepare.py

Repository: NVIDIA-NeMo/Skills

Length of output: 508


Reconsider the strict= parameter value for alignment with project philosophy.

Ruff B905 requires explicit strict= parameter in zip(). However, the suggested fix of strict=False contradicts the project's coding guidelines about explicit error handling and letting errors surface rather than silently failing. Since both iterables are guaranteed to have exactly 3 elements each, use strict=True instead to enforce that expectation:

-    for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
+    for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 68-68: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In `@nemo_skills/dataset/physics/prepare.py` at line 68, Change the zip call to
enforce that the two iterables have identical lengths by adding strict=True to
the zip invocation used in the loop over eng_data, ch_data, full_data and
["test", "zh", "en_zh"], i.e., update the for loop that binds split_data and
split_name so zip(..., strict=True) is used instead of a plain zip to ensure
mismatched lengths raise an error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source of the prompt?

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +18 to +19
Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judge-
ment:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incomplete prompt, text cut off at "Judge-ment:" instead of proper instruction

Suggested change
Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judge-
ment:
Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}. Judgement:

DATASET_GROUP = "math"
METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False
GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"
EVAL_SPLIT = "test"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EVAL_SPLIT = "test" creates naming confusion - per prepare.py:68, test contains only EN examples (1000), but doc table (line 15) says default is "EN" with 1,000 examples, suggesting alignment. However, the file labeling is confusing: test.jsonl = EN-only, zh.jsonl = ZH-only, en_zh.jsonl = combined. Consider renaming test to en for clarity or update docs to explicitly state that "test" = "EN-only split"

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Collaborator

@jiacheng-xu jiacheng-xu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -1,214 +1,92 @@
# Scientific knowledge
# Scientific Knowledge
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix these issues reported by mkdocs

INFO    -  Doc file 'index.md' contains a link './evaluation/scientific-knowledge.md#hle', but the doc 'evaluation/scientific-knowledge.md'
           does not contain an anchor '#hle'.
INFO    -  Doc file 'index.md' contains a link './evaluation/scientific-knowledge.md#scicode', but the doc 'evaluation/scientific-knowledge.md'
           does not contain an anchor '#scicode'.
INFO    -  Doc file 'index.md' contains a link './evaluation/scientific-knowledge.md#gpqa', but the doc 'evaluation/scientific-knowledge.md'
           does not contain an anchor '#gpqa'.
INFO    -  Doc file 'evaluation/index.md' contains a link './scientific-knowledge.md#hle', but the doc 'evaluation/scientific-knowledge.md'
           does not contain an anchor '#hle'.
INFO    -  Doc file 'evaluation/index.md' contains a link './scientific-knowledge.md#scicode', but the doc 'evaluation/scientific-knowledge.md'
           does not contain an anchor '#scicode'.
INFO    -  Doc file 'evaluation/index.md' contains a link './scientific-knowledge.md#gpqa', but the doc 'evaluation/scientific-knowledge.md'
           does not contain an anchor '#gpqa'.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the table is a bit too wide - have to scroll through. Maybe we can reorganize to reduce number of columns? E.g. link can just be fused into the first column. And if we also remove images (can just add a footnote maybe for hle), then it's going to fit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed these, for the images column, we plan to add multimodal data soon, thats why its there

"++parse_reasoning=True "
'\'++end_reasoning_string="<|start|>assistant<|channel|>final<|message|>"\''
"++inference.temperature=1.0 ++inference.top_p=1.0 "
"++inference.tokens_to_generate=131072 ++inference.extra_body.skip_special_tokens=false "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need ++inference.extra_body.skip_special_tokens=false ?

| **SciCode** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) |
| **FrontierScience** | 100 | Short-answer | Physics, Chemistry, Biology | No | all | [HF](https://huggingface.co/datasets/openai/frontierscience) |
| **Physics** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN | [HF](https://huggingface.co/datasets/desimfj/PHYSICS) |
| **MMLU** | 14,042 | MCQ (4) | Multiple Subjects | No | test | [HF](https://huggingface.co/datasets/cais/mmlu) |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the table seems to have fewer datasets than in the original docs. E.g. mmlu-redux is missing? Also scicode section had some useful extra details which are good to keep?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added mmlu-redux, the scicide note was about gpt-oss, which is now removed.

server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32",
model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
benchmarks="gpqa:4",
output_dir="/workspace/Nano_V3_evals"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add expected results to all of these commands? You can use mkdocs dropdowns / tabs to make it use fewer space, e.g. can have a toggle per benchmark / evaluation mode or something. But having reference numbers is useful

user: |-
Below is an open-ended problem in Physics. Please answer this problem adhering to the following rules:
1. Please use LaTeX format to represent the variables and formulas used in the solution process and results.
2. Please put the final answer(s) in \\boxed{{}}, note that the unit of the answer should not be included in \\boxed{{}}.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you most likely want this to be \boxed, not \\boxed. With |- syntax yaml doesn't need \ escaping, so this will render as \\

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

| **[MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test |
| **[SciCode](https://huggingface.co/datasets/SciCode1/SciCode)** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val |
| **[FrontierScience](https://huggingface.co/datasets/openai/frontierscience)** | 100 | Short-answer | Physics, Chemistry, Biology | No | all |
| **[Physics](https://huggingface.co/datasets/desimfj/PHYSICS)** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation table states default is "EN", but __init__.py:19 uses EVAL_SPLIT = "test" which maps to EN-only split per prepare.py:68. While technically aligned (both refer to 1,000 EN examples), consider clarifying by either updating table from "EN" to "test" for consistency with code, or renaming test.jsonl to en.jsonl in prepare.py:68 and updating EVAL_SPLIT = "en" for better semantic clarity. Current naming creates confusion since test typically implies the full test set, not a language-specific subset.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@gnalbandyan gnalbandyan requested a review from Kipok February 9, 2026 14:30
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

2. Mathematical Problems: If the formats differ but the answers are mathematically equivalent, respond with [Correct].
3. Explicit Options: If the question provides explicit candidate answers, the output will be considered correct if it clearly indicates the correct option’s code or the correct option’s content.
4. No Explicit Options: If the question does not provide explicit options, the output must align with the correct answer in content and meaning to be considered [Correct].
Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judgement:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after "Judgement:" - judge will append its response directly without separation

Suggested change
Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judgement:
Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judgement:

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!



class PhysicsMetrics(MathMetrics):
def __init__(self, compute_no_answer: bool = False, answer_key: str = "generation"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can skip the init if it's identical to parent class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kipok the defaults are different. I could remove it and use a partial in map_metrics.py, but it looks more explicit and straightforward to me having the defaults here.

Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +25 to +27
def __init__(self, compute_no_answer: bool = False, answer_key: str = "generation"):
super().__init__(compute_no_answer=compute_no_answer)
self.answer_key = answer_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect super().__init__ args
PhysicsMetrics.__init__ accepts answer_key but doesn’t pass it to MathMetrics.__init__, so MathMetrics.question_key/answer_key stay at their defaults (problem/predicted_answer). This will break evaluation when predictions use the expected generation key (e.g., pass@k/majority@k will look up predicted_answer and raise KeyError). Pass answer_key (and any non-default question_key if needed) through to super().__init__ instead of only setting self.answer_key.

Comment on lines +15 to +19
# settings that define how evaluation should be done by default (all can be changed from cmdline)
DATASET_GROUP = "math"
METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False
GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math"
EVAL_SPLIT = "test"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong dataset group/type
DATASET_GROUP = "math" and GENERATION_ARGS sets ++eval_type=math, but this PR introduces a physics-specific prompt/metrics. Using the math group/type here can route PHYSICS runs through the wrong dataset category/config defaults and can select the wrong evaluation pipeline settings.

If this benchmark is meant to show up under scientific knowledge (per docs) and be evaluated with the physics metrics, the dataset metadata should be consistent with that (group + eval_type).

Comment on lines +3 to +10
user: |-
You are a diligent and precise assistant tasked with evaluating the correctness of responses. You will receive a question, an output sentence, and the correct answer. Your task is to determine if the output sentence accurately answers the question based on the provided correct answer. Respond with either [Correct] or [Incorrect].
Special considerations:
1. Multiple Answers: If the output contains multiple answers, evaluate whether later answers modify or correct earlier ones. In such cases, compare the final answer with the correct answer. If the final answer is unclear or incorrect, respond with [Incorrect].
2. Mathematical Problems: If the formats differ but the answers are mathematically equivalent, respond with [Correct].
3. Explicit Options: If the question provides explicit candidate answers, the output will be considered correct if it clearly indicates the correct option’s code or the correct option’s content.
4. No Explicit Options: If the question does not provide explicit options, the output must align with the correct answer in content and meaning to be considered [Correct].
Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judgement:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judge prompt unparseable
The prompt ends with a stray "] after Judgement:. This will be included in the model input and makes the expected output format ambiguous; it also looks like an accidental truncation/quoting artifact. Remove the extra characters so the judge sees a clean instruction ending at Judgement: (with an appropriate trailing space/newline).

Comment on lines +3 to +7
user: |-
Below is an open-ended problem in Physics. Please answer this problem adhering to the following rules:
1. Please use LaTeX format to represent the variables and formulas used in the solution process and results.
2. Please put the final answer(s) in \boxed{{}}, note that the unit of the answer should not be included in \boxed{{}}.
3. If there are multiple final answers, please seperated them by commas in \boxed{{}}, e.g., \boxed{{answer 1, answer 2}}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prompt typo affects instruction
Rule 3 says “please seperated them by commas” (typo). This gets copied into the model’s instructions and can hurt prompt clarity / consistency across benchmarks. Fix to “separate”.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docs/evaluation/scientific-knowledge.md`:
- Around line 77-82: The multiline string passed into ctx=wrap_arguments(...)
contains an unintended blank line (a stray newline) between the
"++parse_reasoning=True " and "++tool_modules=..." lines which triggers the
markdownlint indented-code-block warning; edit the argument to wrap_arguments
(the ctx=wrap_arguments(...) call) and remove the blank line so the
configuration lines are contiguous within the string (no extra empty line),
preserving existing spacing and quotes.

In `@nemo_skills/evaluation/metrics/physics_metrics.py`:
- Around line 29-39: The return type hint for is_correct_judgement is incorrect
because it can return None when return_none=True or the judgement format is
unrecognized; update the signature of is_correct_judgement to reflect
Optional[bool] (or Union[bool, None]) and add the necessary typing import (e.g.,
Optional) to the module so the annotation matches behavior and aligns with
utils.is_correct_judgement.
🧹 Nitpick comments (3)
nemo_skills/evaluation/metrics/math_metrics.py (1)

84-86: Signature mismatch with subclass override.

PhysicsMetrics.is_correct_judgement adds a return_none parameter that this base method doesn't accept. While not breaking today (callers don't pass return_none), this inconsistency could cause TypeError if someone calls the method polymorphically with return_none=True on a MathMetrics instance. Consider adding return_none: bool = False here too for a consistent interface.

Suggested fix
-    def is_correct_judgement(self, judgement: str) -> bool:
+    def is_correct_judgement(self, judgement: str, return_none: bool = False) -> bool:
         """Check if the judgement is correct."""
-        return is_correct_judgement(judgement)
+        return is_correct_judgement(judgement, return_none=return_none)
nemo_skills/evaluation/metrics/physics_metrics.py (1)

25-27: Pass answer_key through to super().__init__ instead of overriding after the fact.

MathMetrics.__init__ already accepts answer_key. Passing it through avoids the redundant set-then-override pattern:

Suggested fix
     def __init__(self, compute_no_answer: bool = False, answer_key: str = "generation"):
-        super().__init__(compute_no_answer=compute_no_answer)
-        self.answer_key = answer_key
+        super().__init__(compute_no_answer=compute_no_answer, answer_key=answer_key)
docs/index.md (1)

20-20: Consider mentioning physics in the example list and adding anchor links for consistency.

This PR adds the Physics benchmark, but the example list here says "hle, scicode, gpqa" without mentioning physics. Also, every other category line links individual benchmarks to their doc anchors, while this one uses plain text.

Comment on lines +77 to 82
"++inference.temperature=0.6 ++inference.top_p=0.95 "
"++inference.tokens_to_generate=131072 "
"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "

),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stray blank line inside function call.

Line 81 is blank inside the ctx=wrap_arguments(...) string, which causes the markdownlint indented-code-block warning and looks unintentional in the example.

Suggested fix
         "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
         "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
-
     ),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"++inference.temperature=0.6 ++inference.top_p=0.95 "
"++inference.tokens_to_generate=131072 "
"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
),
"++inference.temperature=0.6 ++inference.top_p=0.95 "
"++inference.tokens_to_generate=131072 "
"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
),
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 79-79: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

🤖 Prompt for AI Agents
In `@docs/evaluation/scientific-knowledge.md` around lines 77 - 82, The multiline
string passed into ctx=wrap_arguments(...) contains an unintended blank line (a
stray newline) between the "++parse_reasoning=True " and "++tool_modules=..."
lines which triggers the markdownlint indented-code-block warning; edit the
argument to wrap_arguments (the ctx=wrap_arguments(...) call) and remove the
blank line so the configuration lines are contiguous within the string (no extra
empty line), preserving existing spacing and quotes.

Comment on lines +29 to +39
def is_correct_judgement(self, judgement: str, return_none: bool = False) -> bool:
"""Parse physics judgement that returns [Correct] or [Incorrect]."""
if judgement:
# Look for [Correct] or [Incorrect] patterns (case insensitive)
if re.search(r"\[correct\]", judgement, re.IGNORECASE):
return True
elif re.search(r"\[incorrect\]", judgement, re.IGNORECASE):
return False

# improper judgement format, so have to judge as false
return None if return_none else False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Return type hint -> bool is inaccurate — method can return None.

When return_none=True and the judgement format is unrecognized, this returns None. The hint should reflect that:

-    def is_correct_judgement(self, judgement: str, return_none: bool = False) -> bool:
+    def is_correct_judgement(self, judgement: str, return_none: bool = False) -> bool | None:

Note: the same inaccuracy exists in utils.is_correct_judgement (which uses Union[bool, None] correctly in its signature), so this would bring them into alignment.

🤖 Prompt for AI Agents
In `@nemo_skills/evaluation/metrics/physics_metrics.py` around lines 29 - 39, The
return type hint for is_correct_judgement is incorrect because it can return
None when return_none=True or the judgement format is unrecognized; update the
signature of is_correct_judgement to reflect Optional[bool] (or Union[bool,
None]) and add the necessary typing import (e.g., Optional) to the module so the
annotation matches behavior and aligns with utils.is_correct_judgement.

@gnalbandyan gnalbandyan merged commit ef0a890 into main Feb 11, 2026
5 checks passed
@gnalbandyan gnalbandyan deleted the gnalbandyan/add_physics branch February 11, 2026 08:03
sgunasekar added a commit that referenced this pull request Mar 11, 2026
commit a5da597
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Mar 6 12:13:36 2026 -0800

    Revert "Eval kit support  (#1239)" (#1294)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit b237e33
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Mar 6 20:25:37 2026 +0400

    Eval kit support  (#1239)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

commit dc28bbf
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 5 10:17:44 2026 -0800

    Python direct tool calling without MCP (#1286)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 12454dd
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Mar 4 13:06:21 2026 -0800

    Allow het servers for nemo-rl jobs (#1223)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 8884a68
Author: Prasoon Varshney <prasoon1995@gmail.com>
Date:   Wed Mar 4 10:24:02 2026 -0800

    Support source_lang param for translation recipe (#1290)

    Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 4618b19
Author: Meriem B. <113170426+ka00ri@users.noreply.github.com>
Date:   Wed Mar 4 18:59:28 2026 +0100

    Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285)

    Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 5ac8609
Author: Talor Abramovich <talor19@gmail.com>
Date:   Wed Mar 4 02:30:06 2026 +0200

    Add SPEED-Bench (within repo) (#1279)

    Signed-off-by: Talor Abramovich <talora@nvidia.com>
    Signed-off-by: talora <talora@nvidia.com>
    Signed-off-by: Talor Abramovich <talor19@gmail.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

commit c31eec5
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 12:18:15 2026 -0800

    Fix os.getlogin() crash in ns setup (#1289)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit c228e66
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 11:04:54 2026 -0800

    Fix streaming TypeError when delta.content is None (#1267) (#1288)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit aa47923
Author: Matvei Novikov <mnovikov@nvidia.com>
Date:   Mon Mar 2 16:28:41 2026 -0800

    Add LibTrace recipe for generating domain-specific reasoning data (#1224)

    Signed-off-by: jubick1337 <mnovikov@nvidia.com>
    Signed-off-by: mnovikov <mnovikov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 313cad7
Author: Stephen Ge <stepheng@nvidia.com>
Date:   Mon Mar 2 18:28:49 2026 -0500

    fix: clean parse-failure retries in prover (#1284)

    Signed-off-by: Stephen Ge <stepheng@nvidia.com>

commit 813cfa3
Author: George Armstrong <georgea@nvidia.com>
Date:   Mon Mar 2 15:10:08 2026 -0800

    tst: rollback inference-api to integrate (#1287)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 31735f9
Author: Valentin Mendelev <vmendelev@nvidia.com>
Date:   Mon Mar 2 23:11:25 2026 +0100

    Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250)

    Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

commit d4ef8c0
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Feb 27 23:58:54 2026 +0400

    Update promt_config to working with openai format + inline setup (#1210)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit e879cbc
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:41:23 2026 -0800

    Update noc tutorial (#1282)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit f6e3505
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:17:33 2026 -0800

    Add noc reasoning tutorial (#1278)

    Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com>
    Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com>
    Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com>
    Co-authored-by: Cursor <cursoragent@cursor.com>
    Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com>

commit fc2072a
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 27 10:10:25 2026 -0800

    CritPt generation add prompt_format=None (#1280)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit c8abe5d
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 27 09:31:26 2026 -0800

    New slurm customization parameters (account, containers) (#1209)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 2b38cce
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 25 17:59:52 2026 -0800

    Add nemo-skills-core subpackage for lightweight installs (#1229)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 9fa8e83
Author: Dheeraj Peri <peri.dheeraj@gmail.com>
Date:   Wed Feb 25 12:56:35 2026 -0800

    feat: add custom judge type support for external repo integration (#1274)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com>
    Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>

commit 8a32b13
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 24 15:24:42 2026 -0800

    Exclude numb3rs form test_eval.py (#1275)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6da2219
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Mon Feb 23 18:37:46 2026 +0400

    Numb3rs ds addition (#1174)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

commit ad034b5
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Sun Feb 22 11:55:24 2026 -0800

    Add DSBench-DA evaluation (#1254)

    Squash merge of changes during code-review.
    Signed-off-by: suriya <sgunasekar@nvidia.com>

commit 7593ab3
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 20 16:42:01 2026 -0800

    Add CritPt benchmark (#1200)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 58c31b2
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 20 16:19:22 2026 -0800

    Fix no_answer metric overcounting in _compute_pass_at_k (#1245)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 1f1a2e7
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 15:58:40 2026 -0800

    Fix incorrect prompt tokens count due to HF api update (#1264)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8ebc6f5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 09:05:33 2026 -0800

    Remove deprecated dataset group (#1263)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit ea4177f
Author: Yongqiang Wang <yongqiang.seagull@gmail.com>
Date:   Thu Feb 19 19:57:25 2026 -0500

    fix deps (#1258)

commit 60905a7
Author: Minho Ryu <ryumin93@gmail.com>
Date:   Fri Feb 20 09:39:39 2026 +0900

    Add aime26 (#1256)

    Signed-off-by: bzantium <ryumin93@gmail.com>

commit b28afc5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:18:25 2026 -0800

    Rename custom -> external benchmarks (#1262)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6cc9c45
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:10:33 2026 -0800

    Add reference to internal benchmarks repo (#1261)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 5202af6
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:08:05 2026 -0800

    Remove incorrect presence-penalty setting (#1259)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 144c70b
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 15:26:33 2026 -0800

    Adding an option to store benchmarks in external repo (#1240)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 10e6e39
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Thu Feb 19 19:57:21 2026 +0400

    update vllm miltimodal for api calls convenience (#1213)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com>

commit 1ba4219
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Wed Feb 18 03:28:23 2026 +0400

    Fix --server_container not being applied to dependent jobs (#1244)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 9517614
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Mon Feb 16 11:13:24 2026 -0800

    Support mini-swe-agent as agent harness (#1212)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Signed-off-by: Charlie Truong <chtruong@nvidia.com>
    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Stephen Ge <stepheng@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
    Co-authored-by: Ivan <imoshkov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Charlie Truong <chtruong@nvidia.com>
    Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
    Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Stephen Ge <stepheng@nvidia.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
    Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
    Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
    Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
    Co-authored-by: Wei Du <wedu@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
    Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
    Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
    Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

commit a3d44dc
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 13 22:32:15 2026 -0800

    Add --installation_command support to prepare_data (#1243)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

commit e80d524
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 12 17:26:00 2026 -0800

    Fix CI disk space for Docker image builds (#1241)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit d22236c
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Feb 11 17:55:00 2026 -0800

    Fix answerbench prompt parsing (#1235)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 2401628
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 11 14:56:43 2026 -0800

    feat: add lockfiles for reproducible sandbox builds (#1233)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5a0a84d
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Wed Feb 11 13:30:03 2026 -0800

    removing datasets version restriction for LCB eval (#1230)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit ef0a890
Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Date:   Wed Feb 11 12:03:16 2026 +0400

    Gnalbandyan/add physics (#1214)

    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

commit bd9d30c
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Tue Feb 10 15:13:27 2026 -0800

    LCB generic prompting (#1215)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit 7d6c49a
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Sat Feb 7 08:45:46 2026 -0800

    Add support for different variations of nemo-rl (#1220)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit b19ba96
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 6 21:40:56 2026 -0800

    Add multi-node sandbox support for SLURM clusters (#1218)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 8950bb0
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Sat Feb 7 01:38:00 2026 +0100

    support structured outputs in hle judge for optional AA compatibility (#1186)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b84f7a2
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 6 14:51:02 2026 -0800

    A small update on running tests docs (#1219)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8e838e1
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 5 18:01:35 2026 -0800

    feat: add flag to disable sandbox replay (#1217)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5fd9085
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 5 15:57:01 2026 -0800

    Add an option to limit number of tool calls (#1216)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit d820200
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 3 10:43:55 2026 -0800

    Add arena-hard v2 (#1205)

    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: bzantium <ryumin93@gmail.com>

commit a30920e
Author: Igor Gitman <igitman@nvidia.com>
Date:   Mon Feb 2 10:53:55 2026 -0800

    Fix mkdocs warnings (#1204)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 19d7788
Author: Ivan <imoshkov@nvidia.com>
Date:   Mon Feb 2 23:25:13 2026 +0500

    Fix infinite wait in sandbox.wait_for_sandbox (#1206)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>

commit 3e65fbf
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Fri Jan 30 19:38:38 2026 -0800

    Improve tts (#1203)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 250c862
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Fri Jan 30 22:12:29 2026 +0400

    SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

commit 7ded756
Author: Ivan <imoshkov@nvidia.com>
Date:   Fri Jan 30 09:57:41 2026 +0500

     Add proper token counting to code execution model (#1184)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b986304
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Jan 29 17:57:07 2026 -0800

    Upgrade containers (#1198)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 3b44f02
Author: Dan Lord <blahblahasdf@gmail.com>
Date:   Thu Jan 29 16:40:47 2026 -0800

    Fix incorrect string format (#1199)

    Signed-off-by: dlord <dlord@nvidia.com>

commit c4854b8
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Thu Jan 29 13:43:36 2026 -0800

    Update nemo-rl to latest (#1087)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants