Super v3 math tool calling slurm test by gwarmstrong · Pull Request #1303 · NVIDIA-NeMo/Skills

gwarmstrong · 2026-03-12T16:56:49Z

Summary by CodeRabbit

Tests
- Added automated test suite for validating tool-calling functionality on nano_30B model using mathematical benchmarks (AIME24 and AIME25).
- Performs comprehensive validation of tool usage metrics, execution timeouts, and mathematical problem-solving accuracy.
- Integrates cluster-based testing infrastructure for streamlined validation.

Signed-off-by: George Armstrong <georgea@nvidia.com>

coderabbitai · 2026-03-12T16:58:46Z

📝 Walkthrough

Walkthrough

Introduces two new SLURM-based test scripts for evaluating MCP tool-calling functionality with a Nano 30B model on clusters. The orchestration script prepares data and launches evaluation jobs for aime24/aime25 benchmarks, while the validation script checks tool usage frequency, timeout counts, and math metric correctness against defined ranges.

Changes

Cohort / File(s)	Summary
MCP Tool-Calling Evaluation Tests `tests/slurm-tests/nano_30b_tool_calling/run_test.py`, `tests/slurm-tests/nano_30b_tool_calling/check_results.py`	Introduces test orchestration and validation scripts. `run_test.py` defines MODEL and SERVER_ARGS constants, implements `eval_math_tool_calling()` to schedule evaluation jobs on SLURM, and `main()` to prepare data and schedule result-checking. `check_results.py` implements three validators: `check_tool_usage()` counting tool calls from JSONL output, `check_timeouts()` detecting timeout indicators via regex, and `check_math_tool_calling()` validating symbolic_correct percentages against metric ranges. Uses soft assertions to aggregate validation failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title references 'Super v3' and 'math tool calling slurm test', which partially aligns with the changeset that introduces SLURM-based test scripts for math tool calling, but 'Super v3' is vague and doesn't clearly convey what the changes accomplish.	Consider a more descriptive title that clearly explains the primary change, such as 'Add SLURM test suite for nano-30B math tool calling evaluation' or similar.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch georgea/super-tool-calling-slurm-test

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can generate a title for your PR based on the changes with custom instructions.

Set the reviews.auto_title_instructions setting to generate a title for your PR based on the changes in the PR with custom instructions.

Signed-off-by: George Armstrong <georgea@nvidia.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

tests/slurm-tests/super_120b_tool_calling/check_results.py (3)
136-146: Consider using Path consistently for path operations.

Line 139 uses os.path.join while the other check functions use Path (lines 73, 110). Using Path consistently would improve readability.
Suggested change
 def check_math_tool_calling(eval_dir: str):
     """Check accuracy metrics for math benchmarks with tool calling."""
     for benchmark in MATH_BENCHMARKS:
-        f = os.path.join(eval_dir, "eval-results", benchmark, "metrics.json")
+        f = Path(eval_dir) / "eval-results" / benchmark / "metrics.json"
         data = load_json(f)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/super_120b_tool_calling/check_results.py` around lines 136
- 146, The function check_math_tool_calling builds file paths with os.path.join;
change it to use pathlib.Path consistently (e.g., create p = Path(eval_dir) and
compute p / "eval-results" / benchmark / "metrics.json") to match other checks,
then pass either the Path or str(p) into load_json as used elsewhere; update the
reference in load_json call and anywhere the file path is used so
MATH_BENCHMARKS, eval_dir, check_math_tool_calling, and soft_assert remain
unchanged.
67-97: Consider using direct access for num_tool_calls since this test exercises the tool-calling path.

Line 84 uses .get("num_tool_calls", 0). Since this test specifically uses the tool-calling wrapper (per run_test.py), the num_tool_calls field should always be present in output. Using direct access row["num_tool_calls"] would surface unexpected issues where the field is missing.
Suggested change
-                    if row.get("num_tool_calls", 0) > 0:
+                    if row["num_tool_calls"] > 0:
As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/super_120b_tool_calling/check_results.py` around lines 67 -
97, In check_tool_usage, replace the permissive lookup row.get("num_tool_calls",
0) with direct indexing row["num_tool_calls"] so missing fields raise an error
and surface unexpected output shape; update any local logic that relied on the
default (e.g., the if that checks > 0 remains the same) and let the test fail
with a KeyError if the tool-calling wrapper omitted the field.
148-163: Consider updating type hints to str | Path for consistency.

The function signatures declare eval_dir: str, but main() passes Path objects (e.g., eval_root / "math_tool_calling"). The code works because the functions either wrap with Path() or use os.path.join(), but the type hints don't reflect this.
Suggested change for function signatures
+from typing import Union
+
-def check_tool_usage(eval_dir: str):
+def check_tool_usage(eval_dir: Union[str, Path]):

-def check_timeouts(eval_dir: str):
+def check_timeouts(eval_dir: Union[str, Path]):

-def check_math_tool_calling(eval_dir: str):
+def check_math_tool_calling(eval_dir: Union[str, Path]):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/super_120b_tool_calling/check_results.py` around lines 148
- 163, The type hints for the functions that accept evaluation directories
(check_tool_usage, check_timeouts, check_math_tool_calling) are misleading
because main() passes Path objects; update the function signatures from
eval_dir: str to eval_dir: str | Path (or Path | str) and import typing as
needed so these functions accept both str and Path, and adjust any internal
Path() wrapping if present to avoid double-wrapping.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/slurm-tests/super_120b_tool_calling/run_test.py`:
- Around line 44-63: The eval(...) call in eval_math_tool_calling passes
wandb_project and wandb_name but log_samples defaults to False so those params
are ignored; update the eval call in eval_math_tool_calling to enable logging by
adding log_samples=True (or alternatively remove wandb_project and wandb_name if
sample logging is not desired) so wandb_project/wandb_name are actually used;
look for the eval invocation inside eval_math_tool_calling and modify its
arguments (eval, ctx=wrap_arguments(COMMON_PARAMS), model=MODEL, etc.) to
include log_samples=True or drop the wandb_* args.

---

Nitpick comments:
In `@tests/slurm-tests/super_120b_tool_calling/check_results.py`:
- Around line 136-146: The function check_math_tool_calling builds file paths
with os.path.join; change it to use pathlib.Path consistently (e.g., create p =
Path(eval_dir) and compute p / "eval-results" / benchmark / "metrics.json") to
match other checks, then pass either the Path or str(p) into load_json as used
elsewhere; update the reference in load_json call and anywhere the file path is
used so MATH_BENCHMARKS, eval_dir, check_math_tool_calling, and soft_assert
remain unchanged.
- Around line 67-97: In check_tool_usage, replace the permissive lookup
row.get("num_tool_calls", 0) with direct indexing row["num_tool_calls"] so
missing fields raise an error and surface unexpected output shape; update any
local logic that relied on the default (e.g., the if that checks > 0 remains the
same) and let the test fail with a KeyError if the tool-calling wrapper omitted
the field.
- Around line 148-163: The type hints for the functions that accept evaluation
directories (check_tool_usage, check_timeouts, check_math_tool_calling) are
misleading because main() passes Path objects; update the function signatures
from eval_dir: str to eval_dir: str | Path (or Path | str) and import typing as
needed so these functions accept both str and Path, and adjust any internal
Path() wrapping if present to avoid double-wrapping.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4ab0c42c-bed4-4778-845c-29d85d0e73af

📥 Commits

Reviewing files that changed from the base of the PR and between 2b0a84d and 86301b1.

📒 Files selected for processing (2)

tests/slurm-tests/super_120b_tool_calling/check_results.py
tests/slurm-tests/super_120b_tool_calling/run_test.py

coderabbitai · 2026-03-12T17:11:06Z

tests/slurm-tests/super_120b_tool_calling/run_test.py

+def eval_math_tool_calling(workspace, cluster, expname_prefix, wandb_project, partition):
+    """Run AIME24 and AIME25 with MCP tool calling."""
+    eval(
+        ctx=wrap_arguments(COMMON_PARAMS),
+        cluster=cluster,
+        model=MODEL,
+        server_type="vllm",
+        server_gpus=8,
+        server_args=SERVER_ARGS,
+        output_dir=f"{workspace}/math_tool_calling",
+        benchmarks="aime24:16,aime25:16",
+        with_sandbox=True,
+        num_jobs=1,
+        partition=partition,
+        expname=f"{expname_prefix}-math-tool-calling",
+        wandb_project=wandb_project,
+        wandb_name=f"{expname_prefix}-math-tool-calling",
+    )
+
+    return [f"{expname_prefix}-math-tool-calling"]


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if other slurm tests pass wandb parameters with or without log_samples rg -n "wandb_project|wandb_name|log_samples" tests/slurm-tests/ --type py -C 2

Repository: NVIDIA-NeMo/Skills

Length of output: 19619

🏁 Script executed:

fd -t f "*.py" | xargs rg -l "def eval\(" | head -5

Repository: NVIDIA-NeMo/Skills

Length of output: 431

🏁 Script executed:

rg -n "def eval\(" nemo_skills/pipeline/eval.py -A 30

Repository: NVIDIA-NeMo/Skills

Length of output: 1879

🏁 Script executed:

rg -n "def eval\(" nemo_skills/pipeline/eval.py -A 80 | head -100

Repository: NVIDIA-NeMo/Skills

Length of output: 5050

🏁 Script executed:

rg -n "wandb_project|wandb_name|log_samples" nemo_skills/pipeline/eval.py -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 1914

Add log_samples=True or remove unused wandb parameters.

The eval() function only uses wandb_project and wandb_name when log_samples=True. Since log_samples defaults to False, these parameters are ignored. Either enable logging with log_samples=True, or remove the unused parameters.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/slurm-tests/super_120b_tool_calling/run_test.py` around lines 44 - 63, The eval(...) call in eval_math_tool_calling passes wandb_project and wandb_name but log_samples defaults to False so those params are ignored; update the eval call in eval_math_tool_calling to enable logging by adding log_samples=True (or alternatively remove wandb_project and wandb_name if sample logging is not desired) so wandb_project/wandb_name are actually used; look for the eval invocation inside eval_math_tool_calling and modify its arguments (eval, ctx=wrap_arguments(COMMON_PARAMS), model=MODEL, etc.) to include log_samples=True or drop the wandb_* args.

the other slurm tests have this same pattern

@gwarmstrong, understood — thanks for the clarification! I'll note this pattern for future reviews.

(∪ ･ ω ･)∪

🧠 Learnings used

Learnt from: CR Repo: NVIDIA-NeMo/Skills PR: 0 File: CONTRIBUTING.md:0-0 Timestamp: 2026-02-26T02:00:21.772Z Learning: When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation

Signed-off-by: George Armstrong <georgea@nvidia.com>

Validates the ToolCallingWrapper / PythonTool code path on SLURM, covering tool usage, accuracy, and timeout regression for aime24/aime25. Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com>

coderabbitai

🧹 Nitpick comments (2)

tests/slurm-tests/nano_30b_tool_calling/check_results.py (1)
140-149: Minor path handling inconsistency.

This function uses os.path.join while check_tool_usage and check_timeouts use Path(). Not a functional issue, but could be unified for consistency.
♻️ Suggestion to use Path() consistently
 def check_math_tool_calling(eval_dir: str):
     """Check accuracy metrics for math benchmarks with tool calling."""
     for benchmark in MATH_BENCHMARKS:
-        f = os.path.join(eval_dir, "eval-results", benchmark, "metrics.json")
+        f = Path(eval_dir) / "eval-results" / benchmark / "metrics.json"
         data = load_json(f)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_tool_calling/check_results.py` around lines 140 -
149, Unify path handling in check_math_tool_calling by replacing os.path.join
usage with pathlib.Path like the other helpers; build the metrics path with
Path(eval_dir) / "eval-results" / benchmark / "metrics.json" (and call load_json
with either the Path or str(path) depending on load_json's expected type) so
check_math_tool_calling uses the same Path-based approach as check_tool_usage
and check_timeouts.
tests/slurm-tests/nano_30b_tool_calling/run_test.py (1)
27-29: Acknowledge: eval import shadows Python builtin.

The static analysis tool flagged that importing eval from nemo_skills.pipeline.cli shadows the Python builtin. This appears intentional since eval is the CLI command name, and the builtin eval() is not used in this script. Consider using an alias if this causes confusion in the future.
♻️ Optional: Use alias to avoid shadowing
-from nemo_skills.pipeline.cli import eval, prepare_data, run_cmd, wrap_arguments
+from nemo_skills.pipeline.cli import eval as run_eval, prepare_data, run_cmd, wrap_arguments
Then update the call site accordingly. However, this is optional if the pattern is consistent with other test scripts in this directory.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/slurm-tests/nano_30b_tool_calling/run_test.py` around lines 27 - 29,
The import of eval from nemo_skills.pipeline.cli shadows the Python builtin;
change the import to alias the CLI command (e.g., import eval as cli_eval) and
update all call sites in this file that currently call eval(...) to use the new
alias (cli_eval(...)); specifically modify the import line that references eval
and any subsequent uses of eval in run_test.py (other imported symbols:
prepare_data, run_cmd, wrap_arguments remain unchanged).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/slurm-tests/nano_30b_tool_calling/check_results.py`:
- Around line 140-149: Unify path handling in check_math_tool_calling by
replacing os.path.join usage with pathlib.Path like the other helpers; build the
metrics path with Path(eval_dir) / "eval-results" / benchmark / "metrics.json"
(and call load_json with either the Path or str(path) depending on load_json's
expected type) so check_math_tool_calling uses the same Path-based approach as
check_tool_usage and check_timeouts.

In `@tests/slurm-tests/nano_30b_tool_calling/run_test.py`:
- Around line 27-29: The import of eval from nemo_skills.pipeline.cli shadows
the Python builtin; change the import to alias the CLI command (e.g., import
eval as cli_eval) and update all call sites in this file that currently call
eval(...) to use the new alias (cli_eval(...)); specifically modify the import
line that references eval and any subsequent uses of eval in run_test.py (other
imported symbols: prepare_data, run_cmd, wrap_arguments remain unchanged).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 99a2634a-ba1c-4fc9-a1ca-1393646b9362

📥 Commits

Reviewing files that changed from the base of the PR and between 86301b1 and be08678.

📒 Files selected for processing (2)

tests/slurm-tests/nano_30b_tool_calling/check_results.py
tests/slurm-tests/nano_30b_tool_calling/run_test.py

Signed-off-by: George Armstrong <georgea@nvidia.com>

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

commit f5c0c53 Author: Dav Karamyan <47416614+naymaraq@users.noreply.github.com> Date: Mon Mar 16 16:45:33 2026 +0400 Add Global PIQA benchmark (#1299) Signed-off-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 86071c1 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Thu Mar 12 21:16:32 2026 -0700 fixing sandbox use for livecodebench (#1304) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 4928ef5 Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 12 15:28:41 2026 -0700 nano v3 math tool calling slurm test (#1303) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d4e4450 Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 12 14:17:03 2026 -0700 fix: restore SIGINT handler in sandbox shell worker to prevent session resets (#1302) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 2b0a84d Author: Mahan <25934206+MahanFathi@users.noreply.github.com> Date: Thu Mar 12 00:07:49 2026 -0400 Add HotpotQA multi-hop QA benchmark (#1292) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Signed-off-by: Mahan Fathi <mfathi@nvidia.com> Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: Meriem B. <113170426+ka00ri@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Prasoon Varshney <prasoon1995@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 75314b6 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Thu Mar 12 08:06:51 2026 +0400 Gnalbandyan/ugph hle verified (#1293) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8bbf387 Author: George Armstrong <georgea@nvidia.com> Date: Wed Mar 11 15:48:21 2026 -0700 build: fix gpu ci (#1301) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 005cd03 Author: Vahid Noroozi <VahidooX@users.noreply.github.com> Date: Tue Mar 10 12:52:27 2026 -0700 Fix 1-hour client timeout in long-running generation jobs (#1297) Signed-off-by: vahidoox <vnoroozi@nvidia.com> commit 596b888 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Tue Mar 10 19:11:26 2026 +0100 skip output-rs*_submissions.jsonl files when summarizing critpt (#1300) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> commit fe92aec Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Tue Mar 10 00:00:57 2026 +0100 use output-rs prefix when detecting sampling results (#1296) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit f6f7041 Author: Dav Karamyan <47416614+naymaraq@users.noreply.github.com> Date: Tue Mar 10 02:40:06 2026 +0400 Add MMMLU benchmark (#1281) Signed-off-by: naymaraq <dkaramyan@nvidia.com> Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com> Co-authored-by: naymaraq <dkaramyan@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

gwarmstrong added 3 commits March 11, 2026 14:28

tst: initial draft of super_120b-tool_calling test

3cf4b47

Signed-off-by: George Armstrong <georgea@nvidia.com>

tst: intial draft of slurm test for aime with tool calling

20db2b3

Signed-off-by: George Armstrong <georgea@nvidia.com>

maint: restructure for consistency

a73ac98

Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong requested a review from Kipok March 12, 2026 16:56

Merge branch 'main' into georgea/super-tool-calling-slurm-test

9288110

make partition an arg

86301b1

Signed-off-by: George Armstrong <georgea@nvidia.com>

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

gwarmstrong added 5 commits March 12, 2026 11:07

update test call structure for logging

bc9b3cb

Signed-off-by: George Armstrong <georgea@nvidia.com>

move super to nano

1bfc7c1

Signed-off-by: George Armstrong <georgea@nvidia.com>

Add nano_30b_tool_calling SLURM test for PythonTool MCP pipeline

8729901

Validates the ToolCallingWrapper / PythonTool code path on SLURM, covering tool usage, accuracy, and timeout regression for aime24/aime25. Signed-off-by: George Armstrong <georgea@nvidia.com>

maint: make default jobs 1

be08678

Signed-off-by: George Armstrong <georgea@nvidia.com>

Merge branch 'main' into georgea/super-tool-calling-slurm-test

712cb1c

coderabbitai bot reviewed Mar 12, 2026

View reviewed changes

Kipok approved these changes Mar 12, 2026

View reviewed changes

gwarmstrong merged commit 4928ef5 into main Mar 12, 2026
5 checks passed

gwarmstrong deleted the georgea/super-tool-calling-slurm-test branch March 12, 2026 22:28

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

nano v3 math tool calling slurm test (#1303)

9d63c9c

Signed-off-by: George Armstrong <georgea@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

nano v3 math tool calling slurm test (#1303)

e1c003c

Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Super v3 math tool calling slurm test#1303

Super v3 math tool calling slurm test#1303
gwarmstrong merged 10 commits intomainfrom
georgea/super-tool-calling-slurm-test

gwarmstrong commented Mar 12, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 12, 2026 •

edited

Loading

Uh oh!

gwarmstrong Mar 12, 2026

Uh oh!

coderabbitai bot Mar 12, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gwarmstrong commented Mar 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gwarmstrong commented Mar 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 12, 2026 •

edited

Loading

coderabbitai bot Mar 12, 2026 •

edited

Loading