Skip to content

Support mini-swe-agent as agent harness#1212

Merged
wasiahmad merged 104 commits intomainfrom
mini_swe_agent
Feb 16, 2026
Merged

Support mini-swe-agent as agent harness#1212
wasiahmad merged 104 commits intomainfrom
mini_swe_agent

Conversation

@wasiahmad
Copy link
Collaborator

@wasiahmad wasiahmad commented Feb 4, 2026

Summary by CodeRabbit

  • New Features

    • Added mini-SWE-agent as a supported agent framework for SWE-bench evaluation, including per-datapoint execution, containerized run support, result generation in SWE-bench format, and integration into the processing flow.
    • Added three mini-SWE-agent prompt/config templates (standard, backticks, XML) to drive interactive, single-command iterative agent runs.
  • Documentation

    • Updated SWE-bench docs to reflect support and defaults for three agent frameworks: SWE-agent, mini-SWE-agent, and OpenHands.

wasiahmad and others added 30 commits November 29, 2025 15:55
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
…ize_robustness generic for more benchmarks, update docstrings. (#1079)

Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 7, 2026

Additional Comments (4)

nemo_skills/inference/eval/swebench.py
extra_body type mismatch

SweBenchInferenceConfig.extra_body is declared as a plain dict, but later code treats it as an OmegaConf object (OmegaConf.to_container(self.cfg.inference.extra_body, ...)). If extra_body remains a regular dict (as annotated/defaulted here), this call will error at runtime. Either declare extra_body with the same OmegaConf-compatible type used elsewhere in the codebase (e.g., DictConfig), or convert/merge it without calling OmegaConf.to_container when it’s already a dict.


nemo_skills/inference/eval/swebench.py
Malformed CLI arguments

In _run_swe_agent, there’s no && after uv pip install -e . so the next fragment (uv pip install rich==...) will be appended as additional arguments to the python invocation instead of a separate shell command, causing the container command to fail.

                "uv pip install -e . && "

nemo_skills/inference/eval/swebench.py
Unescaped instance_id in shell

data_point['instance_id'] is interpolated into the container command for --output trajectories/{instance_id}.traj.json without shell quoting. If an instance id contains spaces/quotes/metacharacters, the command will break or execute unintended tokens. Please wrap it with shlex.quote(...) (similar to problem_statement) or otherwise ensure it’s safely escaped before embedding in the shell string.


nemo_skills/inference/eval/swebench.py
Temp config path traversal

tmp_config_filename = f"configs/config_{data_point['instance_id']}.yaml" is used to build host_tmp_path under self.output_dir. If instance_id contains / or .., this will create nested paths or allow writing/removing files outside output_dir when creating/cleaning up the temp config. Sanitize the filename (e.g., replace path separators and strip ..) or use a hash of instance_id for the on-disk name.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/evaluation/code.md (1)

280-281: ⚠️ Potential issue | 🟡 Minor

Add note clarifying that mini-SWE-agent is not supported for multilingual evaluation.

The swe-bench-multilingual section should explicitly document that mini-SWE-agent does not have multilingual support yet. While the current documentation correctly omits mini-SWE-agent from the multilingual example, this should be made explicit for clarity, especially since mini-SWE-agent is mentioned as a supported alternative in the regular SWE-bench section (lines 160-161).

Consider adding a note like: "Currently, only OpenHands and SWE-agent support multilingual evaluation. Mini-SWE-agent support for multilingual datasets is not yet available."

🤖 Fix all issues with AI agents
In `@nemo_skills/inference/eval/swebench.py`:
- Line 635: Replace the silent fallback to {} and fail-fast when expected key is
missing: change the access of trajectory_dict.get("info", {}) to direct
dictionary access trajectory_dict["info"] inside the function where
trajectory_info is assigned (variable name trajectory_info in swebench.py) so a
missing "info" raises a KeyError; if you need clearer context, wrap the access
in a try/except and re-raise a more specific error mentioning the trajectory id
or source (use the same trajectory_dict/trajectory_info variables) rather than
silently using an empty dict.

In `@nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml`:
- Around line 40-45: The numbered rules in the prompt skip item 2 (they read 1,
3, 4, 5); update the ordered list in swebench.yaml so the sequence is
consecutive by changing "3. The system executes the command(s) in a subshell" to
"2. The system executes the command(s) in a subshell" and then renumber the
following lines accordingly ("4."→"3.", "5."→"4.") so the list reads 1–4 in
proper order.
- Line 18: The XML-like tag casing is inconsistent: the opening tag
`<IMPORTANT>` is uppercase but the closing tag `</important>` is lowercase;
update the closing tag to exactly match the opener (change `</important>` to
`</IMPORTANT>`) or make both tags consistently lowercase (e.g.,
`<important>`/`</important>`), ensuring the pair is identical and preserving the
surrounding sentence and punctuation in the `swebench.yaml` snippet.
🧹 Nitpick comments (2)
docs/evaluation/code.md (1)

160-161: Consider adding mini-SWE-agent-specific expected results or a note about expected performance.

The sample run section only shows expected results for OpenHands (issues_resolved: 48.4). While the instructions say to replace openhands with mini_swe_agent, users would benefit from knowing what resolve rate to expect with mini-swe-agent, even approximately. Based on learnings, documentation for benchmarks should include expected results for tested models.

nemo_skills/inference/eval/swebench.py (1)

558-652: <!-- [scratchpad_start]
[task review_file_5/5 review_method_run_mini_swe_agent]
Deep review of _run_mini_swe_agent method.
[observations]

  • Past review comments addressed:
    1. Config path: now uses "eval/swe-bench/mini-swe-agent/swebench" ✓
    2. Python -m: now uses python -m minisweagent.run.mini (dotted module path) ✓
    3. Search path: now uses f"{data_point['instance_id']}.traj.json" matching output ✓
  • The completion_kwargs building logic (lines 563-572) is duplicated from _run_swe_agent (lines 497-506)
  • The method structure is clean: build kwargs → load config → modify config → write temp config → run command → parse output → cleanup

[analysis]

  • Code duplication of completion_kwargs building across _run_swe_agent, _run_mini_swe_agent, and _run_openhands
  • This is a refactor opportunity but not essential
  • The method correctly handles cleanup in the finally block
  • The YAML config manipulation is straightforward

[actions]

  • Suggest optional refactor for completion_kwargs duplication
  • Approve the overall structure
    [scratchpad_end] -->
    Overall implementation looks solid — past review issues are resolved.

The three issues flagged in previous reviews are all addressed:

  1. Default config path now correctly points to "eval/swe-bench/mini-swe-agent/swebench" (line 574)
  2. Module invocation uses proper dotted path python -m minisweagent.run.mini (line 613)
  3. Search path {instance_id}.traj.json matches the --output argument (lines 617, 624)

The completion_kwargs construction logic (lines 563–572) is duplicated across _run_swe_agent, _run_mini_swe_agent, and _run_openhands. Consider extracting it into a shared helper to reduce duplication.

Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml`:
- Around line 21-24: In swebench.yaml update the contradictory prompt
requirement on the response rules: replace the phrase "Provide exactly ONE bash
command to execute" with "Provide AT LEAST ONE bash command to execute" so it
matches the other occurrences and the format_error_template semantics; ensure
the change is made in the response instruction block near the top of the file so
all references (including format_error_template and lines that currently state
"AT LEAST ONE") are consistent.
🧹 Nitpick comments (2)
nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_backticks.yaml (1)

175-176: Nit: trailing dot on cost_limit value.

cost_limit: 3. is valid YAML (parses as 3.0) but reads oddly. Consider 3.0 for clarity. This is consistent with the same pattern in the sibling swebench.yaml, so it's a minor cosmetic point across all three configs.

nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml (1)

113-114: Nit: trailing dot on cost_limit value.

Same as the sibling config — cost_limit: 3. is valid YAML but 3.0 would be more conventional.

@Kipok
Copy link
Collaborator

Kipok commented Feb 10, 2026

no issues from my side, but will leave this to @ludwig-n for final approval. @wasiahmad please fix DCO

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

@ludwig-n
Copy link
Collaborator

Score for Qwen3-Coder-480B-A35B: ~52.5% (avg of 3 runs). Officially reported score is 55.4%. Good enough to merge in my opinion.

With this framework, patches don't always follow the git diff format, so it causes a higher percentage of "patch can't apply" errors. However, this seems to be a "feature" of mini-swe-agent, as it asks the LLM to create the patch file, rather than running git diff manually after the agent is done like SWE-agent/OpenHands. So it's not an issue with our implementation.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@wasiahmad
Copy link
Collaborator Author

Evaluated Minimax-M2.1 and Minimax-M2.5 on swe-bench-verified using mini-swe-agent and got the following scores.

  • Minimax-M2.1 => Pass@1: 70.6% (avg-of-3: 71.0, 71.0, 69.8)
  • Minimax-M2.5 => Pass@1: 75.9% (avg-of-3: 76.6, 75.0, 76.0)

No official reported score of these models with mini-swe-agent.
Their official score for swe-bench-verified is 74.0 and 80.2, respectively.

@wasiahmad wasiahmad merged commit 9517614 into main Feb 16, 2026
5 checks passed
@wasiahmad wasiahmad deleted the mini_swe_agent branch February 16, 2026 19:13
talorabr pushed a commit to talorabr/Nemo-Skills that referenced this pull request Feb 22, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Ivan <imoshkov@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
Co-authored-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
Co-authored-by: Wei Du <wedu@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
sgunasekar added a commit that referenced this pull request Mar 11, 2026
commit a5da597
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Mar 6 12:13:36 2026 -0800

    Revert "Eval kit support  (#1239)" (#1294)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit b237e33
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Mar 6 20:25:37 2026 +0400

    Eval kit support  (#1239)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

commit dc28bbf
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 5 10:17:44 2026 -0800

    Python direct tool calling without MCP (#1286)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 12454dd
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Mar 4 13:06:21 2026 -0800

    Allow het servers for nemo-rl jobs (#1223)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 8884a68
Author: Prasoon Varshney <prasoon1995@gmail.com>
Date:   Wed Mar 4 10:24:02 2026 -0800

    Support source_lang param for translation recipe (#1290)

    Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 4618b19
Author: Meriem B. <113170426+ka00ri@users.noreply.github.com>
Date:   Wed Mar 4 18:59:28 2026 +0100

    Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285)

    Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 5ac8609
Author: Talor Abramovich <talor19@gmail.com>
Date:   Wed Mar 4 02:30:06 2026 +0200

    Add SPEED-Bench (within repo) (#1279)

    Signed-off-by: Talor Abramovich <talora@nvidia.com>
    Signed-off-by: talora <talora@nvidia.com>
    Signed-off-by: Talor Abramovich <talor19@gmail.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

commit c31eec5
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 12:18:15 2026 -0800

    Fix os.getlogin() crash in ns setup (#1289)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit c228e66
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 11:04:54 2026 -0800

    Fix streaming TypeError when delta.content is None (#1267) (#1288)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit aa47923
Author: Matvei Novikov <mnovikov@nvidia.com>
Date:   Mon Mar 2 16:28:41 2026 -0800

    Add LibTrace recipe for generating domain-specific reasoning data (#1224)

    Signed-off-by: jubick1337 <mnovikov@nvidia.com>
    Signed-off-by: mnovikov <mnovikov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 313cad7
Author: Stephen Ge <stepheng@nvidia.com>
Date:   Mon Mar 2 18:28:49 2026 -0500

    fix: clean parse-failure retries in prover (#1284)

    Signed-off-by: Stephen Ge <stepheng@nvidia.com>

commit 813cfa3
Author: George Armstrong <georgea@nvidia.com>
Date:   Mon Mar 2 15:10:08 2026 -0800

    tst: rollback inference-api to integrate (#1287)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 31735f9
Author: Valentin Mendelev <vmendelev@nvidia.com>
Date:   Mon Mar 2 23:11:25 2026 +0100

    Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250)

    Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

commit d4ef8c0
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Feb 27 23:58:54 2026 +0400

    Update promt_config to working with openai format + inline setup (#1210)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit e879cbc
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:41:23 2026 -0800

    Update noc tutorial (#1282)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit f6e3505
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:17:33 2026 -0800

    Add noc reasoning tutorial (#1278)

    Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com>
    Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com>
    Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com>
    Co-authored-by: Cursor <cursoragent@cursor.com>
    Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com>

commit fc2072a
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 27 10:10:25 2026 -0800

    CritPt generation add prompt_format=None (#1280)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit c8abe5d
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 27 09:31:26 2026 -0800

    New slurm customization parameters (account, containers) (#1209)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 2b38cce
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 25 17:59:52 2026 -0800

    Add nemo-skills-core subpackage for lightweight installs (#1229)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 9fa8e83
Author: Dheeraj Peri <peri.dheeraj@gmail.com>
Date:   Wed Feb 25 12:56:35 2026 -0800

    feat: add custom judge type support for external repo integration (#1274)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com>
    Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>

commit 8a32b13
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 24 15:24:42 2026 -0800

    Exclude numb3rs form test_eval.py (#1275)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6da2219
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Mon Feb 23 18:37:46 2026 +0400

    Numb3rs ds addition (#1174)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

commit ad034b5
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Sun Feb 22 11:55:24 2026 -0800

    Add DSBench-DA evaluation (#1254)

    Squash merge of changes during code-review.
    Signed-off-by: suriya <sgunasekar@nvidia.com>

commit 7593ab3
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 20 16:42:01 2026 -0800

    Add CritPt benchmark (#1200)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 58c31b2
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 20 16:19:22 2026 -0800

    Fix no_answer metric overcounting in _compute_pass_at_k (#1245)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 1f1a2e7
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 15:58:40 2026 -0800

    Fix incorrect prompt tokens count due to HF api update (#1264)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8ebc6f5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 09:05:33 2026 -0800

    Remove deprecated dataset group (#1263)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit ea4177f
Author: Yongqiang Wang <yongqiang.seagull@gmail.com>
Date:   Thu Feb 19 19:57:25 2026 -0500

    fix deps (#1258)

commit 60905a7
Author: Minho Ryu <ryumin93@gmail.com>
Date:   Fri Feb 20 09:39:39 2026 +0900

    Add aime26 (#1256)

    Signed-off-by: bzantium <ryumin93@gmail.com>

commit b28afc5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:18:25 2026 -0800

    Rename custom -> external benchmarks (#1262)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6cc9c45
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:10:33 2026 -0800

    Add reference to internal benchmarks repo (#1261)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 5202af6
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:08:05 2026 -0800

    Remove incorrect presence-penalty setting (#1259)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 144c70b
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 15:26:33 2026 -0800

    Adding an option to store benchmarks in external repo (#1240)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 10e6e39
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Thu Feb 19 19:57:21 2026 +0400

    update vllm miltimodal for api calls convenience (#1213)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com>

commit 1ba4219
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Wed Feb 18 03:28:23 2026 +0400

    Fix --server_container not being applied to dependent jobs (#1244)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 9517614
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Mon Feb 16 11:13:24 2026 -0800

    Support mini-swe-agent as agent harness (#1212)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Signed-off-by: Charlie Truong <chtruong@nvidia.com>
    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Stephen Ge <stepheng@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
    Co-authored-by: Ivan <imoshkov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Charlie Truong <chtruong@nvidia.com>
    Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
    Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Stephen Ge <stepheng@nvidia.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
    Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
    Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
    Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
    Co-authored-by: Wei Du <wedu@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
    Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
    Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
    Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

commit a3d44dc
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 13 22:32:15 2026 -0800

    Add --installation_command support to prepare_data (#1243)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

commit e80d524
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 12 17:26:00 2026 -0800

    Fix CI disk space for Docker image builds (#1241)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit d22236c
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Feb 11 17:55:00 2026 -0800

    Fix answerbench prompt parsing (#1235)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 2401628
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 11 14:56:43 2026 -0800

    feat: add lockfiles for reproducible sandbox builds (#1233)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5a0a84d
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Wed Feb 11 13:30:03 2026 -0800

    removing datasets version restriction for LCB eval (#1230)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit ef0a890
Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Date:   Wed Feb 11 12:03:16 2026 +0400

    Gnalbandyan/add physics (#1214)

    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

commit bd9d30c
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Tue Feb 10 15:13:27 2026 -0800

    LCB generic prompting (#1215)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit 7d6c49a
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Sat Feb 7 08:45:46 2026 -0800

    Add support for different variations of nemo-rl (#1220)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit b19ba96
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 6 21:40:56 2026 -0800

    Add multi-node sandbox support for SLURM clusters (#1218)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 8950bb0
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Sat Feb 7 01:38:00 2026 +0100

    support structured outputs in hle judge for optional AA compatibility (#1186)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b84f7a2
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 6 14:51:02 2026 -0800

    A small update on running tests docs (#1219)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8e838e1
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 5 18:01:35 2026 -0800

    feat: add flag to disable sandbox replay (#1217)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5fd9085
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 5 15:57:01 2026 -0800

    Add an option to limit number of tool calls (#1216)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit d820200
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 3 10:43:55 2026 -0800

    Add arena-hard v2 (#1205)

    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: bzantium <ryumin93@gmail.com>

commit a30920e
Author: Igor Gitman <igitman@nvidia.com>
Date:   Mon Feb 2 10:53:55 2026 -0800

    Fix mkdocs warnings (#1204)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 19d7788
Author: Ivan <imoshkov@nvidia.com>
Date:   Mon Feb 2 23:25:13 2026 +0500

    Fix infinite wait in sandbox.wait_for_sandbox (#1206)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>

commit 3e65fbf
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Fri Jan 30 19:38:38 2026 -0800

    Improve tts (#1203)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 250c862
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Fri Jan 30 22:12:29 2026 +0400

    SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

commit 7ded756
Author: Ivan <imoshkov@nvidia.com>
Date:   Fri Jan 30 09:57:41 2026 +0500

     Add proper token counting to code execution model (#1184)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b986304
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Jan 29 17:57:07 2026 -0800

    Upgrade containers (#1198)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 3b44f02
Author: Dan Lord <blahblahasdf@gmail.com>
Date:   Thu Jan 29 16:40:47 2026 -0800

    Fix incorrect string format (#1199)

    Signed-off-by: dlord <dlord@nvidia.com>

commit c4854b8
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Thu Jan 29 13:43:36 2026 -0800

    Update nemo-rl to latest (#1087)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Ivan <imoshkov@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
Co-authored-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
Co-authored-by: Wei Du <wedu@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Ivan <imoshkov@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Charlie Truong <chtruong@nvidia.com>
Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
Co-authored-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
Co-authored-by: Wei Du <wedu@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.