Conversation
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
| dataset = load_dataset("desimfj/PHYSICS")["test"] | ||
| eng_data = [entry for entry in dataset if entry["language"] == "en"] | ||
| ch_data = [entry for entry in dataset if entry["language"] == "zh"] | ||
| full_data = eng_data + ch_data | ||
|
|
||
| for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]): | ||
| save_data(split_data, split_name) |
There was a problem hiding this comment.
EN/ZH split filenames swapped
In the final loop, zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]) writes English examples to test.jsonl and Chinese examples to zh.jsonl, but then uses en_zh for the combined split. This makes test effectively EN-only and zh ZH-only, which seems fine, but contradicts the naming in the docs/config (EN default is called test). If test is intended to be the full test split, this is wrong; if test is intended to be EN-only, rename test to en (or update dataset defaults/docs) to avoid consumers accidentally evaluating the wrong language split.
📝 WalkthroughWalkthroughAdds a new Physics dataset package (data prep and config), physics-specific metrics and judge/prompt configs, and updates scientific-knowledge docs to a compact dataset table and revised evaluation examples. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Prep as DataPrep
participant Storage as Dataset (JSONL)
participant Evaluator as PhysicsMetrics
participant Judge as JudgePrompt
participant Model as JudgeModel
User->>Prep: load_dataset(DESIMFJ Physics)
Prep->>Prep: strip_boxed / process_answer / format_entry
Prep->>Storage: write JSONL splits (en, zh, en_zh)
User->>Evaluator: submit prediction(s)
Evaluator->>Judge: craft judge prompt (problem, generation, expected_answer)
Judge->>Model: send prompt to judge model/server
Model-->>Judge: return judgement ([Correct]/[Incorrect])
Judge-->>Evaluator: judgement text
Evaluator-->>User: score dict (judge_correct, metrics)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Fix all issues with AI agents
In `@docs/evaluation/scientific-knowledge.md`:
- Around line 10-13: Remove the invalid HTML break tags in the markdown table
and following code block by replacing all occurrences of "<br>" and "</br>" with
valid markdown-friendly breaks (e.g., use "<br/>" if HTML breaks are required,
or convert to plain newlines/Markdown line breaks) in the rows containing
"**GPQA**", "**SciCode**" and the surrounding table lines (also fix the similar
instances around lines 40-40 referenced). Ensure table cells remain properly
formatted after the change.
- Around line 5-17: Add a new "Physics benchmark" subsection below the "Dataset
Overview" table that targets the "Physics" row: include the example evaluation
command (CLI or script) to run the benchmark, the expected baseline
results/metrics to compare against, model-testing details (prompt format,
scoring/judging rules and any automated judge used), and dataset-specific notes
describing the EN/ZH splits and how to select the EN split for evaluation;
reference the "Physics" dataset name from the table and ensure the subsection
succinctly documents command, expected results, model testing, and
dataset-specific setup (EN/ZH selection and judge configuration).
In `@nemo_skills/dataset/physics/__init__.py`:
- Around line 15-18: The inline comment next to METRICS_TYPE is incorrect:
update the comment to reflect that METRICS_TYPE = "physics" uses the
PhysicsMetrics class (not MathMetrics) and still sets compute_no_answer=False;
modify the comment on the METRICS_TYPE line accordingly and ensure surrounding
constants DATASET_GROUP, METRICS_TYPE, and GENERATION_ARGS remain unchanged.
In `@nemo_skills/dataset/physics/prepare.py`:
- Line 68: Change the zip call to enforce that the two iterables have identical
lengths by adding strict=True to the zip invocation used in the loop over
eng_data, ch_data, full_data and ["test", "zh", "en_zh"], i.e., update the for
loop that binds split_data and split_name so zip(..., strict=True) is used
instead of a plain zip to ensure mismatched lengths raise an error.
- Around line 29-32: Add strict=True to the zip(...) invocation used to pair the
parallel lists in this module — locate the zip call in this file (the one that
pairs items when building examples, adjacent to process_answer) and change
zip(a, b) to zip(a, b, strict=True) so mismatched lengths raise immediately;
ensure the call site where the pairing logic is implemented is updated (the zip
used inside the example-building function in this file).
In `@nemo_skills/prompt/config/generic/physics.yaml`:
- Around line 2-5: The YAML prompt contains a typo in the third rule: change the
word "seperated" to the correct spelling "separate" in the instruction that
reads "If there are multiple final answers, please seperated them by commas in
\\boxed{{}}"; update that sentence so it reads "If there are multiple final
answers, please separate them by commas in \\boxed{{}}", keeping the surrounding
LaTeX guidance and formatting intact (file:
nemo_skills/prompt/config/generic/physics.yaml; locate the rule text containing
"seperated").
In `@nemo_skills/prompt/config/judge/physics.yaml`:
- Around line 16-17: The prompt string fragment "Question: {problem}, Output
sentence: {generation}, Correct answer: {expected_answer}, Judge- ment:" is
broken by a hyphenated line break; fix it by merging the split token into a
single word and line so it reads "Judgement:" (i.e., replace "Judge- ment:" with
"Judgement:") ensuring the entire prompt line is contiguous: "Question:
{problem}, Output sentence: {generation}, Correct answer: {expected_answer},
Judgement:".
| ## Dataset Overview | ||
|
|
||
| ### hle | ||
|
|
||
| - Benchmark is defined in [`nemo_skills/dataset/hle/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/hle/__init__.py) | ||
| - Original benchmark source is [here](https://huggingface.co/datasets/cais/hle). | ||
| - The `text` split includes all non-image examples. It is further divided into `eng`, `chem`, `bio`, `cs`, `phy`, `math`, `human`, `other`. Currently, **all** of these splits contain only text data. | ||
|
|
||
| ### SimpleQA | ||
|
|
||
| - Benchmark is defined in [`nemo_skills/dataset/simpleqa/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/simpleqa/__init__.py) | ||
| - Original benchmark source code for SimpleQA (OpenAI) is [here](https://github.com/openai/simple-evals/) and the leaderboard is [here](https://www.kaggle.com/benchmarks/openai/simpleqa). An improved version with 1,000 examples from Google, SimpleQA-verified, is [here](https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified). | ||
| - To use the SimpleQA-verified, set `split=verified`. To use the original version of SimpleQA, please set `split=test`. | ||
|
|
||
| In the below configurations, we also use `gpt-oss-120b` as the judge model. | ||
|
|
||
| #### Configuration: `gpt-oss-120b` with builtin tool (python) | ||
| | <div style="width:80px; display:inline-block; text-align:center">Dataset</div> | <div style="width:110px; display:inline-block; text-align:center">Questions</div> | <div style="width:90px; display:inline-block; text-align:center">Types</div> | <div style="width:150px; display:inline-block; text-align:center">Domain</div> | <div style="width:70px; display:inline-block; text-align:center">Images?</div> | <div style="width:70px; display:inline-block; text-align:center">NS default</div> | <div style="width:50px; display:inline-block; text-align:center">Link</div> | | ||
| |:---|:---:|:---:|:---|:---:|:---:|:---:| | ||
| | **HLE** | 2500 | Open ended, MCQ | Engineering, Physics, Chemistry, Bio, etc. | Yes | text only | [HF](https://huggingface.co/datasets/cais/hle) | | ||
| | **GPQA** | 448 (main)<br>198 (diamond)</br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond | [HF](https://huggingface.co/datasets/Idavidrein/gpqa) | | ||
| | **SuperGPQA** | 26,529 | MCQ (≤ 10) | Science, Eng, Humanities, etc. | No | test | [HF](https://huggingface.co/datasets/m-a-p/SuperGPQA) | | ||
| | **MMLU-Pro** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test | [HF](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) | | ||
| | **SciCode** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) | | ||
| | **FrontierScience** | 100 | Short-answer | Physics, Chemistry, Biology | No | all | [HF](https://huggingface.co/datasets/openai/frontierscience) | | ||
| | **Physics** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN | [HF](https://huggingface.co/datasets/desimfj/PHYSICS) | | ||
| | **MMLU** | 14,042 | MCQ (4) | Multiple Subjects | No | test | [HF](https://huggingface.co/datasets/cais/mmlu) | | ||
| | **SimpleQa** | 4,326 (test), 1,000 (verified) | Open ended | Factuality, Parametric knowledge| No | verified | [HF](https://github.com/openai/simple-evals/) | |
There was a problem hiding this comment.
Add Physics benchmark details (command, expected results, model testing, dataset-specific notes).
The table introduces Physics but there’s no physics-specific example command, expected results, or dataset notes (e.g., EN/ZH splits, judge setup). Please add a short subsection covering these items.
As per coding guidelines: When adding new benchmarks, add documentation with example commands, expected results, model testing details, and dataset-specific information.
🤖 Prompt for AI Agents
In `@docs/evaluation/scientific-knowledge.md` around lines 5 - 17, Add a new
"Physics benchmark" subsection below the "Dataset Overview" table that targets
the "Physics" row: include the example evaluation command (CLI or script) to run
the benchmark, the expected baseline results/metrics to compare against,
model-testing details (prompt format, scoring/judging rules and any automated
judge used), and dataset-specific notes describing the EN/ZH splits and how to
select the EN split for evaluation; reference the "Physics" dataset name from
the table and ensure the subsection succinctly documents command, expected
results, model testing, and dataset-specific setup (EN/ZH selection and judge
configuration).
| | **GPQA** | 448 (main)<br>198 (diamond)</br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond | [HF](https://huggingface.co/datasets/Idavidrein/gpqa) | | ||
| | **SuperGPQA** | 26,529 | MCQ (≤ 10) | Science, Eng, Humanities, etc. | No | test | [HF](https://huggingface.co/datasets/m-a-p/SuperGPQA) | | ||
| | **MMLU-Pro** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test | [HF](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) | | ||
| | **SciCode** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) | |
There was a problem hiding this comment.
Fix invalid <br> tags in the table and after the code block.
🛠️ Suggested fix
-| **GPQA** | 448 (main)<br>198 (diamond)</br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond | [HF](https://huggingface.co/datasets/Idavidrein/gpqa) |
+| **GPQA** | 448 (main)<br>198 (diamond)<br>546 (ext.) | MCQ (4) | Physics, Chemistry, Biology | No | diamond | [HF](https://huggingface.co/datasets/Idavidrein/gpqa) |
-| **SciCode** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) |
+| **SciCode** | 80<br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) |
-</br>
+<br>
Also applies to: 40-40
🤖 Prompt for AI Agents
In `@docs/evaluation/scientific-knowledge.md` around lines 10 - 13, Remove the
invalid HTML break tags in the markdown table and following code block by
replacing all occurrences of "<br>" and "</br>" with valid markdown-friendly
breaks (e.g., use "<br/>" if HTML breaks are required, or convert to plain
newlines/Markdown line breaks) in the rows containing "**GPQA**", "**SciCode**"
and the surrounding table lines (also fix the similar instances around lines
40-40 referenced). Ensure table cells remain properly formatted after the
change.
| # settings that define how evaluation should be done by default (all can be changed from cmdline) | ||
| DATASET_GROUP = "math" | ||
| METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False | ||
| GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math" |
There was a problem hiding this comment.
Comment mismatch: METRICS_TYPE uses PhysicsMetrics, not MathMetrics.
🛠️ Suggested fix
-METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False
+METRICS_TYPE = "physics" # Uses PhysicsMetrics (compute_no_answer defaults to False)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # settings that define how evaluation should be done by default (all can be changed from cmdline) | |
| DATASET_GROUP = "math" | |
| METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False | |
| GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math" | |
| # settings that define how evaluation should be done by default (all can be changed from cmdline) | |
| DATASET_GROUP = "math" | |
| METRICS_TYPE = "physics" # Uses PhysicsMetrics (compute_no_answer defaults to False) | |
| GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math" |
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/physics/__init__.py` around lines 15 - 18, The inline
comment next to METRICS_TYPE is incorrect: update the comment to reflect that
METRICS_TYPE = "physics" uses the PhysicsMetrics class (not MathMetrics) and
still sets compute_no_answer=False; modify the comment on the METRICS_TYPE line
accordingly and ensure surrounding constants DATASET_GROUP, METRICS_TYPE, and
GENERATION_ARGS remain unchanged.
| def process_answer(answer): | ||
| """Flatten all answers and wrap in a single \\boxed{}""" | ||
| all_answers = [strip_boxed(item) for sublist in answer for item in sublist] | ||
| return f"\\boxed{{{', '.join(all_answers)}}}" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's view the full prepare.py file to understand the context
find . -type f -name "prepare.py" -path "*/physics/*" | head -5
Repository: NVIDIA-NeMo/Skills
Length of output: 102
🏁 Script executed:
# List files in the physics dataset directory
fd "prepare.py" --path "*physics*"
Repository: NVIDIA-NeMo/Skills
Length of output: 277
🏁 Script executed:
# Search for the exact file
find . -type f -name "prepare.py" | grep physics
Repository: NVIDIA-NeMo/Skills
Length of output: 102
🌐 Web query:
HuggingFace dataset desimfj/PHYSICS schema answer field format
💡 Result:
In desimfj/PHYSICS, the answer field is a nested list: List[List[str]] (HF viewer shows it as a sequence of sequences). Each outer list element corresponds to one sub-question, and the inner list contains one or more accepted answer strings (typically LaTeX, often wrapped with \\boxed{...}).[1]
Example patterns from the dataset:[1]
- Single sub-question:
answer = [[ "\\boxed{3}" ]] - Two sub-questions:
answer = [[ "\\boxed{-60}" ], [ "\\boxed{4}" ]] - MCQ:
answer = [[ "\\boxed{B}" ]]
🏁 Script executed:
# Search for references to this dataset or answer processing
rg "desimfj/PHYSICS" --type py
Repository: NVIDIA-NeMo/Skills
Length of output: 154
🏁 Script executed:
# Look for load_dataset calls related to physics
rg "load_dataset.*physics" --type py -i
Repository: NVIDIA-NeMo/Skills
Length of output: 154
🏁 Script executed:
# Now let's read the full prepare.py file to see the context and line 68
cat -n ./nemo_skills/dataset/physics/prepare.py
Repository: NVIDIA-NeMo/Skills
Length of output: 2866
🏁 Script executed:
# Check for Python version requirements
find . -name "setup.py" -o -name "pyproject.toml" -o -name ".python-version" | xargs cat 2>/dev/null | head -50
Repository: NVIDIA-NeMo/Skills
Length of output: 1986
🏁 Script executed:
# Also check for requirements files
find . -name "requirements*.txt" | xargs head -5 2>/dev/null | head -30
Repository: NVIDIA-NeMo/Skills
Length of output: 174
🏁 Script executed:
# Search for python_requires in setup files
rg "python_requires|python =" --type py --type toml | head -10
Repository: NVIDIA-NeMo/Skills
Length of output: 220
🏁 Script executed:
# Verify zip behavior and confirm both lists have same length
python3 << 'EOF'
# Simulating the actual data
eng_data = [1, 2, 3] # placeholder for actual data
ch_data = [1, 2, 3]
full_data = [1, 2, 3]
split_names = ["test", "zh", "en_zh"]
split_datas = [eng_data, ch_data, full_data]
# Check lengths match
print(f"Length of split_datas: {len(split_datas)}")
print(f"Length of split_names: {len(split_names)}")
print(f"Balanced: {len(split_datas) == len(split_names)}")
# Test that zip(strict=True) would work (Python 3.10+)
result = list(zip(split_datas, split_names, strict=True))
print(f"zip(strict=True) works: {len(result)} pairs")
EOF
Repository: NVIDIA-NeMo/Skills
Length of output: 158
Add strict=True to the zip() call at line 68 for safety.
The nested comprehension in process_answer() is correct—the dataset schema confirms answer is List[List[str]], and the code handles it properly with no risk of data corruption. However, line 68 should use zip(..., strict=True) to catch potential future bugs if the list lengths ever become unequal.
Fix for line 68
- for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
+ for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/physics/prepare.py` around lines 29 - 32, Add strict=True
to the zip(...) invocation used to pair the parallel lists in this module —
locate the zip call in this file (the one that pairs items when building
examples, adjacent to process_answer) and change zip(a, b) to zip(a, b,
strict=True) so mismatched lengths raise immediately; ensure the call site where
the pairing logic is implemented is updated (the zip used inside the
example-building function in this file).
| ch_data = [entry for entry in dataset if entry["language"] == "zh"] | ||
| full_data = eng_data + ch_data | ||
|
|
||
| for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check Python version requirements
echo "=== Python version requirements ==="
rg -n "python_requires|requires-python|python_version" -S -i
echo ""
echo "=== pyproject.toml content (relevant sections) ==="
if [ -f pyproject.toml ]; then
head -50 pyproject.toml
fi
Repository: NVIDIA-NeMo/Skills
Length of output: 1818
🏁 Script executed:
#!/bin/bash
# Check for Ruff configuration and B905 rule
echo "=== Looking for Ruff configuration ==="
find . -maxdepth 2 -type f \( -name "ruff.toml" -o -name ".ruff.toml" -o -name "pyproject.toml" \) -exec echo "File: {}" \; -exec cat {} \;
Repository: NVIDIA-NeMo/Skills
Length of output: 3330
🏁 Script executed:
#!/bin/bash
# Verify the actual zip call and surrounding context
echo "=== Context around line 68 in prepare.py ==="
sed -n '60,75p' nemo_skills/dataset/physics/prepare.py
Repository: NVIDIA-NeMo/Skills
Length of output: 508
Reconsider the strict= parameter value for alignment with project philosophy.
Ruff B905 requires explicit strict= parameter in zip(). However, the suggested fix of strict=False contradicts the project's coding guidelines about explicit error handling and letting errors surface rather than silently failing. Since both iterables are guaranteed to have exactly 3 elements each, use strict=True instead to enforce that expectation:
- for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]):
+ for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True):
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"]): | |
| for split_data, split_name in zip([eng_data, ch_data, full_data], ["test", "zh", "en_zh"], strict=True): |
🧰 Tools
🪛 Ruff (0.14.14)
[warning] 68-68: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In `@nemo_skills/dataset/physics/prepare.py` at line 68, Change the zip call to
enforce that the two iterables have identical lengths by adding strict=True to
the zip invocation used in the loop over eng_data, ch_data, full_data and
["test", "zh", "en_zh"], i.e., update the for loop that binds split_data and
split_name so zip(..., strict=True) is used instead of a plain zip to ensure
mismatched lengths raise an error.
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
| Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judge- | ||
| ment: |
There was a problem hiding this comment.
incomplete prompt, text cut off at "Judge-ment:" instead of proper instruction
| Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judge- | |
| ment: | |
| Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}. Judgement: |
| DATASET_GROUP = "math" | ||
| METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False | ||
| GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math" | ||
| EVAL_SPLIT = "test" |
There was a problem hiding this comment.
EVAL_SPLIT = "test" creates naming confusion - per prepare.py:68, test contains only EN examples (1000), but doc table (line 15) says default is "EN" with 1,000 examples, suggesting alignment. However, the file labeling is confusing: test.jsonl = EN-only, zh.jsonl = ZH-only, en_zh.jsonl = combined. Consider renaming test to en for clarity or update docs to explicitly state that "test" = "EN-only split"
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| @@ -1,214 +1,92 @@ | |||
| # Scientific knowledge | |||
| # Scientific Knowledge | |||
There was a problem hiding this comment.
please fix these issues reported by mkdocs
INFO - Doc file 'index.md' contains a link './evaluation/scientific-knowledge.md#hle', but the doc 'evaluation/scientific-knowledge.md'
does not contain an anchor '#hle'.
INFO - Doc file 'index.md' contains a link './evaluation/scientific-knowledge.md#scicode', but the doc 'evaluation/scientific-knowledge.md'
does not contain an anchor '#scicode'.
INFO - Doc file 'index.md' contains a link './evaluation/scientific-knowledge.md#gpqa', but the doc 'evaluation/scientific-knowledge.md'
does not contain an anchor '#gpqa'.
INFO - Doc file 'evaluation/index.md' contains a link './scientific-knowledge.md#hle', but the doc 'evaluation/scientific-knowledge.md'
does not contain an anchor '#hle'.
INFO - Doc file 'evaluation/index.md' contains a link './scientific-knowledge.md#scicode', but the doc 'evaluation/scientific-knowledge.md'
does not contain an anchor '#scicode'.
INFO - Doc file 'evaluation/index.md' contains a link './scientific-knowledge.md#gpqa', but the doc 'evaluation/scientific-knowledge.md'
does not contain an anchor '#gpqa'.
There was a problem hiding this comment.
also the table is a bit too wide - have to scroll through. Maybe we can reorganize to reduce number of columns? E.g. link can just be fused into the first column. And if we also remove images (can just add a footnote maybe for hle), then it's going to fit
There was a problem hiding this comment.
fixed these, for the images column, we plan to add multimodal data soon, thats why its there
| "++parse_reasoning=True " | ||
| '\'++end_reasoning_string="<|start|>assistant<|channel|>final<|message|>"\'' | ||
| "++inference.temperature=1.0 ++inference.top_p=1.0 " | ||
| "++inference.tokens_to_generate=131072 ++inference.extra_body.skip_special_tokens=false " |
There was a problem hiding this comment.
do we need ++inference.extra_body.skip_special_tokens=false ?
| | **SciCode** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | [HF](https://huggingface.co/datasets/SciCode1/SciCode) | | ||
| | **FrontierScience** | 100 | Short-answer | Physics, Chemistry, Biology | No | all | [HF](https://huggingface.co/datasets/openai/frontierscience) | | ||
| | **Physics** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN | [HF](https://huggingface.co/datasets/desimfj/PHYSICS) | | ||
| | **MMLU** | 14,042 | MCQ (4) | Multiple Subjects | No | test | [HF](https://huggingface.co/datasets/cais/mmlu) | |
There was a problem hiding this comment.
the table seems to have fewer datasets than in the original docs. E.g. mmlu-redux is missing? Also scicode section had some useful extra details which are good to keep?
There was a problem hiding this comment.
Added mmlu-redux, the scicide note was about gpt-oss, which is now removed.
| server_args="--no-enable-prefix-caching --mamba_ssm_cache_dtype float32", | ||
| model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16", | ||
| benchmarks="gpqa:4", | ||
| output_dir="/workspace/Nano_V3_evals" |
There was a problem hiding this comment.
Can we add expected results to all of these commands? You can use mkdocs dropdowns / tabs to make it use fewer space, e.g. can have a toggle per benchmark / evaluation mode or something. But having reference numbers is useful
| user: |- | ||
| Below is an open-ended problem in Physics. Please answer this problem adhering to the following rules: | ||
| 1. Please use LaTeX format to represent the variables and formulas used in the solution process and results. | ||
| 2. Please put the final answer(s) in \\boxed{{}}, note that the unit of the answer should not be included in \\boxed{{}}. |
There was a problem hiding this comment.
you most likely want this to be \boxed, not \\boxed. With |- syntax yaml doesn't need \ escaping, so this will render as \\
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
| | **[MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)** | 12,032 | MCQ (≤ 10) | Multiple subjects | No | test | | ||
| | **[SciCode](https://huggingface.co/datasets/SciCode1/SciCode)** | 80</br>(338 subtasks) | Code gen | Scientific computing | No | test+val | | ||
| | **[FrontierScience](https://huggingface.co/datasets/openai/frontierscience)** | 100 | Short-answer | Physics, Chemistry, Biology | No | all | | ||
| | **[Physics](https://huggingface.co/datasets/desimfj/PHYSICS)** | 1,000 (EN), 1,000 (ZH) | Open-ended | Physics | No | EN | |
There was a problem hiding this comment.
Documentation table states default is "EN", but __init__.py:19 uses EVAL_SPLIT = "test" which maps to EN-only split per prepare.py:68. While technically aligned (both refer to 1,000 EN examples), consider clarifying by either updating table from "EN" to "test" for consistency with code, or renaming test.jsonl to en.jsonl in prepare.py:68 and updating EVAL_SPLIT = "en" for better semantic clarity. Current naming creates confusion since test typically implies the full test set, not a language-specific subset.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
| 2. Mathematical Problems: If the formats differ but the answers are mathematically equivalent, respond with [Correct]. | ||
| 3. Explicit Options: If the question provides explicit candidate answers, the output will be considered correct if it clearly indicates the correct option’s code or the correct option’s content. | ||
| 4. No Explicit Options: If the question does not provide explicit options, the output must align with the correct answer in content and meaning to be considered [Correct]. | ||
| Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judgement: |
There was a problem hiding this comment.
Missing space after "Judgement:" - judge will append its response directly without separation
| Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judgement: | |
| Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judgement: |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
|
||
|
|
||
| class PhysicsMetrics(MathMetrics): | ||
| def __init__(self, compute_no_answer: bool = False, answer_key: str = "generation"): |
There was a problem hiding this comment.
you can skip the init if it's identical to parent class
There was a problem hiding this comment.
@Kipok the defaults are different. I could remove it and use a partial in map_metrics.py, but it looks more explicit and straightforward to me having the defaults here.
Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
| def __init__(self, compute_no_answer: bool = False, answer_key: str = "generation"): | ||
| super().__init__(compute_no_answer=compute_no_answer) | ||
| self.answer_key = answer_key |
There was a problem hiding this comment.
Incorrect super().__init__ args
PhysicsMetrics.__init__ accepts answer_key but doesn’t pass it to MathMetrics.__init__, so MathMetrics.question_key/answer_key stay at their defaults (problem/predicted_answer). This will break evaluation when predictions use the expected generation key (e.g., pass@k/majority@k will look up predicted_answer and raise KeyError). Pass answer_key (and any non-default question_key if needed) through to super().__init__ instead of only setting self.answer_key.
| # settings that define how evaluation should be done by default (all can be changed from cmdline) | ||
| DATASET_GROUP = "math" | ||
| METRICS_TYPE = "physics" # This uses the MathMetrics class, but with compute_no_answer=False | ||
| GENERATION_ARGS = "++prompt_config=generic/physics ++eval_type=math" | ||
| EVAL_SPLIT = "test" |
There was a problem hiding this comment.
Wrong dataset group/type
DATASET_GROUP = "math" and GENERATION_ARGS sets ++eval_type=math, but this PR introduces a physics-specific prompt/metrics. Using the math group/type here can route PHYSICS runs through the wrong dataset category/config defaults and can select the wrong evaluation pipeline settings.
If this benchmark is meant to show up under scientific knowledge (per docs) and be evaluated with the physics metrics, the dataset metadata should be consistent with that (group + eval_type).
| user: |- | ||
| You are a diligent and precise assistant tasked with evaluating the correctness of responses. You will receive a question, an output sentence, and the correct answer. Your task is to determine if the output sentence accurately answers the question based on the provided correct answer. Respond with either [Correct] or [Incorrect]. | ||
| Special considerations: | ||
| 1. Multiple Answers: If the output contains multiple answers, evaluate whether later answers modify or correct earlier ones. In such cases, compare the final answer with the correct answer. If the final answer is unclear or incorrect, respond with [Incorrect]. | ||
| 2. Mathematical Problems: If the formats differ but the answers are mathematically equivalent, respond with [Correct]. | ||
| 3. Explicit Options: If the question provides explicit candidate answers, the output will be considered correct if it clearly indicates the correct option’s code or the correct option’s content. | ||
| 4. No Explicit Options: If the question does not provide explicit options, the output must align with the correct answer in content and meaning to be considered [Correct]. | ||
| Question: {problem}, Output sentence: {generation}, Correct answer: {expected_answer}, Judgement: |
There was a problem hiding this comment.
Judge prompt unparseable
The prompt ends with a stray "] after Judgement:. This will be included in the model input and makes the expected output format ambiguous; it also looks like an accidental truncation/quoting artifact. Remove the extra characters so the judge sees a clean instruction ending at Judgement: (with an appropriate trailing space/newline).
| user: |- | ||
| Below is an open-ended problem in Physics. Please answer this problem adhering to the following rules: | ||
| 1. Please use LaTeX format to represent the variables and formulas used in the solution process and results. | ||
| 2. Please put the final answer(s) in \boxed{{}}, note that the unit of the answer should not be included in \boxed{{}}. | ||
| 3. If there are multiple final answers, please seperated them by commas in \boxed{{}}, e.g., \boxed{{answer 1, answer 2}}. |
There was a problem hiding this comment.
Prompt typo affects instruction
Rule 3 says “please seperated them by commas” (typo). This gets copied into the model’s instructions and can hurt prompt clarity / consistency across benchmarks. Fix to “separate”.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/evaluation/scientific-knowledge.md`:
- Around line 77-82: The multiline string passed into ctx=wrap_arguments(...)
contains an unintended blank line (a stray newline) between the
"++parse_reasoning=True " and "++tool_modules=..." lines which triggers the
markdownlint indented-code-block warning; edit the argument to wrap_arguments
(the ctx=wrap_arguments(...) call) and remove the blank line so the
configuration lines are contiguous within the string (no extra empty line),
preserving existing spacing and quotes.
In `@nemo_skills/evaluation/metrics/physics_metrics.py`:
- Around line 29-39: The return type hint for is_correct_judgement is incorrect
because it can return None when return_none=True or the judgement format is
unrecognized; update the signature of is_correct_judgement to reflect
Optional[bool] (or Union[bool, None]) and add the necessary typing import (e.g.,
Optional) to the module so the annotation matches behavior and aligns with
utils.is_correct_judgement.
🧹 Nitpick comments (3)
nemo_skills/evaluation/metrics/math_metrics.py (1)
84-86: Signature mismatch with subclass override.
PhysicsMetrics.is_correct_judgementadds areturn_noneparameter that this base method doesn't accept. While not breaking today (callers don't passreturn_none), this inconsistency could causeTypeErrorif someone calls the method polymorphically withreturn_none=Trueon aMathMetricsinstance. Consider addingreturn_none: bool = Falsehere too for a consistent interface.Suggested fix
- def is_correct_judgement(self, judgement: str) -> bool: + def is_correct_judgement(self, judgement: str, return_none: bool = False) -> bool: """Check if the judgement is correct.""" - return is_correct_judgement(judgement) + return is_correct_judgement(judgement, return_none=return_none)nemo_skills/evaluation/metrics/physics_metrics.py (1)
25-27: Passanswer_keythrough tosuper().__init__instead of overriding after the fact.
MathMetrics.__init__already acceptsanswer_key. Passing it through avoids the redundant set-then-override pattern:Suggested fix
def __init__(self, compute_no_answer: bool = False, answer_key: str = "generation"): - super().__init__(compute_no_answer=compute_no_answer) - self.answer_key = answer_key + super().__init__(compute_no_answer=compute_no_answer, answer_key=answer_key)docs/index.md (1)
20-20: Consider mentioningphysicsin the example list and adding anchor links for consistency.This PR adds the Physics benchmark, but the example list here says "hle, scicode, gpqa" without mentioning physics. Also, every other category line links individual benchmarks to their doc anchors, while this one uses plain text.
| "++inference.temperature=0.6 ++inference.top_p=0.95 " | ||
| "++inference.tokens_to_generate=131072 " | ||
| "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " | ||
| "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] " | ||
|
|
||
| ), |
There was a problem hiding this comment.
Stray blank line inside function call.
Line 81 is blank inside the ctx=wrap_arguments(...) string, which causes the markdownlint indented-code-block warning and looks unintentional in the example.
Suggested fix
"++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True "
"++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] "
-
),
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "++inference.temperature=0.6 ++inference.top_p=0.95 " | |
| "++inference.tokens_to_generate=131072 " | |
| "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " | |
| "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] " | |
| ), | |
| "++inference.temperature=0.6 ++inference.top_p=0.95 " | |
| "++inference.tokens_to_generate=131072 " | |
| "++chat_template_kwargs.enable_thinking=true ++parse_reasoning=True " | |
| "++tool_modules=[nemo_skills.mcp.servers.python_tool::PythonTool] " | |
| ), |
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 79-79: Code block style
Expected: fenced; Actual: indented
(MD046, code-block-style)
🤖 Prompt for AI Agents
In `@docs/evaluation/scientific-knowledge.md` around lines 77 - 82, The multiline
string passed into ctx=wrap_arguments(...) contains an unintended blank line (a
stray newline) between the "++parse_reasoning=True " and "++tool_modules=..."
lines which triggers the markdownlint indented-code-block warning; edit the
argument to wrap_arguments (the ctx=wrap_arguments(...) call) and remove the
blank line so the configuration lines are contiguous within the string (no extra
empty line), preserving existing spacing and quotes.
| def is_correct_judgement(self, judgement: str, return_none: bool = False) -> bool: | ||
| """Parse physics judgement that returns [Correct] or [Incorrect].""" | ||
| if judgement: | ||
| # Look for [Correct] or [Incorrect] patterns (case insensitive) | ||
| if re.search(r"\[correct\]", judgement, re.IGNORECASE): | ||
| return True | ||
| elif re.search(r"\[incorrect\]", judgement, re.IGNORECASE): | ||
| return False | ||
|
|
||
| # improper judgement format, so have to judge as false | ||
| return None if return_none else False |
There was a problem hiding this comment.
Return type hint -> bool is inaccurate — method can return None.
When return_none=True and the judgement format is unrecognized, this returns None. The hint should reflect that:
- def is_correct_judgement(self, judgement: str, return_none: bool = False) -> bool:
+ def is_correct_judgement(self, judgement: str, return_none: bool = False) -> bool | None:
Note: the same inaccuracy exists in utils.is_correct_judgement (which uses Union[bool, None] correctly in its signature), so this would bring them into alignment.
🤖 Prompt for AI Agents
In `@nemo_skills/evaluation/metrics/physics_metrics.py` around lines 29 - 39, The
return type hint for is_correct_judgement is incorrect because it can return
None when return_none=True or the judgement format is unrecognized; update the
signature of is_correct_judgement to reflect Optional[bool] (or Union[bool,
None]) and add the necessary typing import (e.g., Optional) to the module so the
annotation matches behavior and aligns with utils.is_correct_judgement.
commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Added PHYSICS benchmark, updated Scientific Knowledge documentation page
Summary by CodeRabbit
Documentation
New Features