Conversation
bd0fab5 to
63188c9
Compare
📝 WalkthroughWalkthroughAdds an AIME 2026 dataset package with module-level evaluation/generation constants and a CLI-able prepare script that loads the MathArena/aime_2026 dataset and exports split-specific JSONL manifests. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI
participant Prepare as prepare.py
participant HF as "HuggingFace Dataset\n(MathArena/aime_2026)"
participant FS as "Local Filesystem"
User->>CLI: run prepare.py --split <split>
CLI->>Prepare: invoke main(args)
Prepare->>HF: load_dataset(..., split=split)
HF-->>Prepare: dataset iterator
loop for each entry
Prepare->>Prepare: format_entry(entry)
Prepare->>FS: append JSONL line
FS-->>Prepare: write ACK
end
Prepare->>User: exit (file written)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
nemo_skills/dataset/aime26/prepare.py (1)
23-28: Missing type hints on functions.As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code."
Suggested type hints
-def format_entry(entry): - return { +def format_entry(entry: dict) -> dict: + return { "id": f"aime26-{entry['problem_idx']}", "problem": entry["problem"], "expected_answer": str(entry["answer"]), }-def write_data_to_file(output_file, data): +def write_data_to_file(output_file: Path, data: list) -> None:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/aime26/prepare.py` around lines 23 - 28, The function format_entry lacks type hints; update its signature to accept a typed parameter (e.g., entry: Dict[str, Any] or Mapping[str, Any]) and declare the return type (e.g., Dict[str, str]) and add the necessary typing imports (from typing import Any, Dict or Mapping) so format_entry(entry: Dict[str, Any]) -> Dict[str, str] is explicit; ensure the keys "id", "problem", and "expected_answer" are reflected in the return type annotation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/aime26/__init__.py`:
- Line 1: Update the copyright year in the module header of
nemo_skills/dataset/aime26/__init__.py from 2024 to 2026 to match prepare.py;
open the file, locate the top-of-file copyright comment (the string "Copyright
(c) 2024, NVIDIA CORPORATION. All rights reserved."), and change the year to
2026 so both files use the same copyright year.
In `@nemo_skills/dataset/aime26/prepare.py`:
- Around line 38-43: In main(), the call load_dataset("MathArena/aime_2026",
split="train") is hardcoded to "train" while the output file is named using
args.split (output_file = data_dir / f"{args.split}.jsonl"); add a brief
clarifying comment above the load_dataset line (inside the main function)
explaining that the HuggingFace dataset only exposes a single "train" split but
the benchmark uses that data as the target split (e.g., test/validation), so we
intentionally load "train" regardless of args.split to produce the appropriately
named output file.
- Line 48: The parser.add_argument call for "--split" uses choices=("test")
which is a string, causing argparse to check membership against characters;
change it to a proper iterable like a single-element tuple or list (e.g.,
choices=("test",) or choices=["test"]) in the parser.add_argument call to ensure
only "test" is accepted; update the choices parameter in the
parser.add_argument("--split", ...) invocation and mirror the style used by
other dataset prepare.py files.
---
Nitpick comments:
In `@nemo_skills/dataset/aime26/prepare.py`:
- Around line 23-28: The function format_entry lacks type hints; update its
signature to accept a typed parameter (e.g., entry: Dict[str, Any] or
Mapping[str, Any]) and declare the return type (e.g., Dict[str, str]) and add
the necessary typing imports (from typing import Any, Dict or Mapping) so
format_entry(entry: Dict[str, Any]) -> Dict[str, str] is explicit; ensure the
keys "id", "problem", and "expected_answer" are reflected in the return type
annotation.
| @@ -0,0 +1,18 @@ | |||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
Copyright year mismatch with prepare.py.
This file says Copyright (c) 2024 while prepare.py says Copyright (c) 2026. Since this is a new file for a 2026 benchmark, the year should likely be 2026 to be consistent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/aime26/__init__.py` at line 1, Update the copyright year
in the module header of nemo_skills/dataset/aime26/__init__.py from 2024 to 2026
to match prepare.py; open the file, locate the top-of-file copyright comment
(the string "Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved."),
and change the year to 2026 so both files use the same copyright year.
| def main(args): | ||
| dataset = load_dataset("MathArena/aime_2026", split="train") | ||
| data_dir = Path(__file__).absolute().parent | ||
| data_dir.mkdir(exist_ok=True) | ||
| output_file = data_dir / f"{args.split}.jsonl" | ||
| write_data_to_file(output_file, dataset) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if other prepare.py scripts in this repo also load a different split than the output name
rg -n 'load_dataset.*split=' --glob 'nemo_skills/dataset/*/prepare.py' -A2 -B2Repository: NVIDIA-NeMo/Skills
Length of output: 10510
Add a clarifying comment explaining the hardcoded split.
The dataset is loaded with split="train" (line 39) regardless of args.split, but the output file is named {args.split}.jsonl. This pattern—where the HuggingFace split differs from the output filename—appears elsewhere in the codebase (e.g., simpleqa), suggesting it's intentional because the HuggingFace dataset only exposes a train split but the benchmark treats it as a test set. A brief comment explaining this would prevent future confusion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/aime26/prepare.py` around lines 38 - 43, In main(), the
call load_dataset("MathArena/aime_2026", split="train") is hardcoded to "train"
while the output file is named using args.split (output_file = data_dir /
f"{args.split}.jsonl"); add a brief clarifying comment above the load_dataset
line (inside the main function) explaining that the HuggingFace dataset only
exposes a single "train" split but the benchmark uses that data as the target
split (e.g., test/validation), so we intentionally load "train" regardless of
args.split to produce the appropriately named output file.
Signed-off-by: bzantium <ryumin93@gmail.com>
63188c9 to
ef883de
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
nemo_skills/dataset/aime26/prepare.py (1)
31-35: Missing type hints; consider pre-computing entries before opening the output file.Two points:
- Type hints — same guideline violation as
format_entry:♻️ Proposed fix for type hints
-def write_data_to_file(output_file, data): +def write_data_to_file(output_file: Path, data) -> None:
- Computation / I/O ordering —
format_entryis called inside theopen()block. AKeyErroron a malformed entry mid-stream would leave a partially-written JSONL file on disk with no indication. Pre-materializing the formatted list before opening the file costs negligible memory for an AIME dataset (30 problems) and eliminates partial-output risk:♻️ Proposed fix for computation ordering
def write_data_to_file(output_file: Path, data) -> None: + formatted = [format_entry(entry) for entry in tqdm(data, desc="Processing")] with open(output_file, "wt", encoding="utf-8") as fout: - for entry in tqdm(data, desc=f"Writing {output_file.name}"): - json.dump(format_entry(entry), fout, ensure_ascii=False) + for entry in formatted: + json.dump(entry, fout, ensure_ascii=False) fout.write("\n")As per coding guidelines, "Use type hints for simple types" and "perform all computations before re-opening files for writing to avoid accidental data loss if code fails during execution."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/aime26/prepare.py` around lines 31 - 35, Add type hints to write_data_to_file (e.g., def write_data_to_file(output_file: Path, data: Iterable[dict]) -> None) and precompute formatted entries before opening the file to avoid partial writes: build a list (e.g., formatted = [format_entry(entry) for entry in data]) then open the output_file and iterate over formatted with tqdm, json.dump each item and write a newline; reference the existing write_data_to_file and format_entry symbols when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nemo_skills/dataset/aime26/prepare.py`:
- Around line 38-43: The call in main that uses
load_dataset("MathArena/aime_2026", split="train") is hardcoded and should be
annotated so future readers understand this is intentional; update the main
function (around load_dataset, dataset variable) to add a concise clarifying
comment noting that the HF dataset only exposes a "train" split and therefore
args.split is not used for loading (but still used for output_file), so the
hardcoded split should remain; ensure the comment references args.split and
write_data_to_file to make the intent obvious.
---
Nitpick comments:
In `@nemo_skills/dataset/aime26/prepare.py`:
- Around line 31-35: Add type hints to write_data_to_file (e.g., def
write_data_to_file(output_file: Path, data: Iterable[dict]) -> None) and
precompute formatted entries before opening the file to avoid partial writes:
build a list (e.g., formatted = [format_entry(entry) for entry in data]) then
open the output_file and iterate over formatted with tqdm, json.dump each item
and write a newline; reference the existing write_data_to_file and format_entry
symbols when making the change.
|
@Kipok I fixed the bug above and attached an evaluation. Model Evaluation: Qwen3-30B-A3B-Thinking-2507The following table summarizes the performance scores across different AIME (American Invitational Mathematics Examination) datasets.
Click the arrow below to expand the full technical breakdown, including View JSON Execution Data{
"aime26": {
"pass@1": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 87.66666666666667,
"no_answer": 0.0
},
"pass@2": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 91.25925925925927,
"no_answer": 0.0
},
"pass@1[avg-of-2]": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 91.66666666666667,
"no_answer": 0.0,
"reasoning_tokens_statistics": {
"avg": 0.0,
"std_dev_across_runs": 0.0,
"avg_sample_std_dev": 0.0,
"std_err_across_runs": 0.0
},
"answer_tokens_statistics": {
"avg": 17334.65,
"std_dev_across_runs": 215.1725935150659,
"avg_sample_std_dev": 2548.012145553645,
"std_err_across_runs": 152.14999999999964
},
"symbolic_correct_statistics": {
"avg": 0.9166666666666666,
"std_dev_across_runs": 0.02357022603955158,
"avg_sample_std_dev": 0.023570226039551587,
"std_err_across_runs": 0.016666666666666663
}
},
"pass@3": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 92.25,
"no_answer": 0.0
},
"pass@1[avg-of-3]": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 87.77777777777777,
"no_answer": 0.0,
"reasoning_tokens_statistics": {
"avg": 0.0,
"std_dev_across_runs": 0.0,
"avg_sample_std_dev": 0.0,
"std_err_across_runs": 0.0
},
"answer_tokens_statistics": {
"avg": 17662.333333333332,
"std_dev_across_runs": 587.6042233794223,
"avg_sample_std_dev": 3435.5203326507794,
"std_err_across_runs": 339.25345654507043
},
"symbolic_correct_statistics": {
"avg": 0.8777777777777778,
"std_dev_across_runs": 0.06938886664887108,
"avg_sample_std_dev": 0.07698003589195011,
"std_err_across_runs": 0.04006168083848876
}
},
"pass@4": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 92.76190476190476,
"no_answer": 0.0
},
"pass@1[avg-of-4]": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 86.66666666666667,
"no_answer": 0.0,
"reasoning_tokens_statistics": {
"avg": 0.0,
"std_dev_across_runs": 0.0,
"avg_sample_std_dev": 0.0,
"std_err_across_runs": 0.0
},
"answer_tokens_statistics": {
"avg": 18331.816666666666,
"std_dev_across_runs": 1422.3282145833991,
"avg_sample_std_dev": 4074.4175305478043,
"std_err_across_runs": 711.1641072916996
},
"symbolic_correct_statistics": {
"avg": 0.8666666666666667,
"std_dev_across_runs": 0.06085806194501844,
"avg_sample_std_dev": 0.0859116756396542,
"std_err_across_runs": 0.03042903097250922
}
},
"pass@5": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 93.05555555555557,
"no_answer": 0.0
},
"pass@1[avg-of-5]": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 86.66666666666667,
"no_answer": 0.0,
"reasoning_tokens_statistics": {
"avg": 0.0,
"std_dev_across_runs": 0.0,
"avg_sample_std_dev": 0.0,
"std_err_across_runs": 0.0
},
"answer_tokens_statistics": {
"avg": 18328.713333333333,
"std_dev_across_runs": 1231.791912576507,
"avg_sample_std_dev": 4199.473728063009,
"std_err_across_runs": 550.8740901311095
},
"symbolic_correct_statistics": {
"avg": 0.8666666666666667,
"std_dev_across_runs": 0.05270462766947298,
"avg_sample_std_dev": 0.08123619671700687,
"std_err_across_runs": 0.02357022603955158
}
},
"pass@6": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 93.22222222222221,
"no_answer": 0.0
},
"pass@1[avg-of-6]": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 86.66666666666669,
"no_answer": 0.0,
"reasoning_tokens_statistics": {
"avg": 0.0,
"std_dev_across_runs": 0.0,
"avg_sample_std_dev": 0.0,
"std_err_across_runs": 0.0
},
"answer_tokens_statistics": {
"avg": 18596.200000000004,
"std_dev_across_runs": 1281.8516894624663,
"avg_sample_std_dev": 4426.99811124109,
"std_err_across_runs": 523.3137608512666
},
"symbolic_correct_statistics": {
"avg": 0.8666666666666668,
"std_dev_across_runs": 0.04714045207910316,
"avg_sample_std_dev": 0.08885962402813656,
"std_err_across_runs": 0.019245008972987525
}
},
"pass@7": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 93.30555555555556,
"no_answer": 0.0
},
"pass@1[avg-of-7]": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 87.14285714285714,
"no_answer": 0.0,
"reasoning_tokens_statistics": {
"avg": 0.0,
"std_dev_across_runs": 0.0,
"avg_sample_std_dev": 0.0,
"std_err_across_runs": 0.0
},
"answer_tokens_statistics": {
"avg": 18356.109523809522,
"std_dev_across_runs": 1331.4617977182336,
"avg_sample_std_dev": 4374.646122160459,
"std_err_across_runs": 503.2452567064904
},
"symbolic_correct_statistics": {
"avg": 0.8714285714285714,
"std_dev_across_runs": 0.044839513942303275,
"avg_sample_std_dev": 0.08814386586003546,
"std_err_across_runs": 0.016947743257192553
}
},
"pass@8": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 93.33333333333333,
"no_answer": 0.0
},
"pass@1[avg-of-8]": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 87.5,
"no_answer": 0.0,
"reasoning_tokens_statistics": {
"avg": 0.0,
"std_dev_across_runs": 0.0,
"avg_sample_std_dev": 0.0,
"std_err_across_runs": 0.0
},
"answer_tokens_statistics": {
"avg": 18257.5375,
"std_dev_across_runs": 1263.8300049487398,
"avg_sample_std_dev": 4177.041148241753,
"std_err_across_runs": 446.83138338314086
},
"symbolic_correct_statistics": {
"avg": 0.875,
"std_dev_across_runs": 0.04272466296411038,
"avg_sample_std_dev": 0.08346764803530463,
"std_err_across_runs": 0.015105449452916093
}
},
"pass@9": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 93.33333333333333,
"no_answer": 0.0
},
"pass@1[avg-of-9]": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 87.03703703703702,
"no_answer": 0.0,
"reasoning_tokens_statistics": {
"avg": 0.0,
"std_dev_across_runs": 0.0,
"avg_sample_std_dev": 0.0,
"std_err_across_runs": 0.0
},
"answer_tokens_statistics": {
"avg": 18251.462962962964,
"std_dev_across_runs": 1182.345167975312,
"avg_sample_std_dev": 4085.528060506555,
"std_err_across_runs": 394.1150559917707
},
"symbolic_correct_statistics": {
"avg": 0.8703703703703703,
"std_dev_across_runs": 0.042309850588132816,
"avg_sample_std_dev": 0.0914098904572087,
"std_err_across_runs": 0.014103283529377605
}
},
"pass@10": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 93.33333333333333,
"no_answer": 0.0
},
"pass@1[avg-of-10]": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 87.66666666666667,
"no_answer": 0.0,
"reasoning_tokens_statistics": {
"avg": 0.0,
"std_dev_across_runs": 0.0,
"avg_sample_std_dev": 0.0,
"std_err_across_runs": 0.0
},
"answer_tokens_statistics": {
"avg": 18102.21,
"std_dev_across_runs": 1210.5279378880143,
"avg_sample_std_dev": 4033.7664918246287,
"std_err_across_runs": 382.8025454992963
},
"symbolic_correct_statistics": {
"avg": 0.8766666666666667,
"std_dev_across_runs": 0.044583116994975275,
"avg_sample_std_dev": 0.08953305854857022,
"std_err_across_runs": 0.01409841948938835
}
},
"majority@2": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 91.66666666666667,
"no_answer": 0.0
},
"majority@3": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 91.1111111111111,
"no_answer": 0.0
},
"majority@4": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 90.83333333333333,
"no_answer": 0.0
},
"majority@5": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 90.66666666666667,
"no_answer": 0.0
},
"majority@6": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 90.55555555555557,
"no_answer": 0.0
},
"majority@7": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 93.33333333333333,
"no_answer": 0.0
},
"majority@8": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 93.33333333333333,
"no_answer": 0.0
},
"majority@9": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 93.33333333333333,
"no_answer": 0.0
},
"majority@10": {
"num_entries": 30,
"avg_tokens": 18102,
"gen_seconds": 6223,
"symbolic_correct": 93.33333333333333,
"no_answer": 0.0
}
}
}
|
Signed-off-by: bzantium <ryumin93@gmail.com>
Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>
commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com>
Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: dgitman <dgitman@nvidia.com>
resolved: #1255
Summary by CodeRabbit