NVIDIA-NeMo · wasiahmad · Feb 16, 2026 · Nov 29, 2025 · Dec 21, 2025 · Dec 21, 2025
diff --git a/docs/evaluation/code.md b/docs/evaluation/code.md
@@ -14,7 +14,7 @@ More details are coming soon!
 - Benchmark is defined in [`nemo_skills/dataset/swe-bench/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/swe-bench/__init__.py)
 - Original benchmark source is [here](https://github.com/SWE-bench/SWE-bench).
 
-Nemo-Skills can run inference (rollout) on SWE-bench-style datasets using 2 agentic frameworks: [SWE-agent](https://swe-agent.com/latest/) and [OpenHands](https://www.all-hands.dev/). It can then evaluate the generated patches on SWE-bench Verified/Lite/Full using the [official SWE-bench harness](https://www.swebench.com/SWE-bench/guides/evaluation/).
+Nemo-Skills can run inference (rollout) on SWE-bench-style datasets using 3 agent frameworks: [SWE-agent](https://swe-agent.com/latest/), [mini-SWE-agent](https://mini-swe-agent.com/latest/) and [OpenHands](https://www.all-hands.dev/). It can then evaluate the generated patches on SWE-bench Verified/Lite/Full using the [official SWE-bench harness](https://www.swebench.com/SWE-bench/guides/evaluation/).
 
 #### Data preparation
 
@@ -66,19 +66,19 @@ When this path is accessed during evaluation, `{instance_id}` will be replaced b
 
 There are a few parameters specific to SWE-bench. They have to be specified with the `++` prefix. All of them are optional, except for ++agent_framework.
 
-- **++agent_framework:** which agentic framework to use. Must be either `swe_agent` or `openhands`. No default, must be specified explicitly.
+- **++agent_framework:** which agent framework to use. Must be one of `swe_agent`, `mini_swe_agent` or `openhands`. No default, must be specified explicitly.
 
-- **++agent_framework_repo:** URL of the repository to use for SWE-agent/OpenHands. Allows you to pass in a custom fork of these repositories. If you do this, you may find it helpful to check [nemo_skills/inference/eval/swebench.py](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/eval/swebench.py) to understand how the frameworks are used internally. This is passed directly as an argument to `git clone`. Defaults to the official repositories: [`https://github.com/SWE-agent/SWE-agent.git`](https://github.com/SWE-agent/SWE-agent) for SWE-agent, [`https://github.com/All-Hands-AI/OpenHands.git`](https://github.com/All-Hands-AI/OpenHands) for OpenHands.
+- **++agent_framework_repo:** URL of the repository to use for SWE-agent/mini-SWE-agent/OpenHands. Allows you to pass in a custom fork of these repositories. If you do this, you may find it helpful to check [nemo_skills/inference/eval/swebench.py](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/eval/swebench.py) to understand how the frameworks are used internally. This is passed directly as an argument to `git clone`. Defaults to the official repositories: [`https://github.com/SWE-agent/SWE-agent.git`](https://github.com/SWE-agent/SWE-agent) for SWE-agent, [`https://github.com/SWE-agent/mini-swe-agent.git`](https://github.com/SWE-agent/mini-swe-agent) for mini-SWE-agent, [`https://github.com/All-Hands-AI/OpenHands.git`](https://github.com/All-Hands-AI/OpenHands) for OpenHands.
 
-- **++agent_framework_commit:** The commit hash, branch or tag to checkout after cloning agent_framework_repo. Allows you to pin SWE-agent/OpenHands to a specific version. Defaults to `HEAD`, i.e. the latest commit.
+- **++agent_framework_commit:** The commit hash, branch or tag to checkout after cloning agent_framework_repo. Allows you to pin SWE-agent/mini-SWE-agent/OpenHands to a specific version. Defaults to `HEAD` for SWE-agent & OpenHands and `v2.0` for mini-SWE-agent.
 
-- **++agent_config:** The config file to use for SWE-agent/OpenHands.
-    - For SWE-agent, this is a YAML file. See the [SWE-agent docs](https://swe-agent.com/latest/config/config/).
+- **++agent_config:** The config file to use for the agent framework.
+    - For SWE-agent and mini-SWE-agent, this is a YAML file. See the docs: [SWE-agent](https://swe-agent.com/latest/config/config/), [mini-SWE-agent](https://mini-swe-agent.com/latest/advanced/yaml_configuration/).
     - For OpenHands, this is a TOML file. Nemo-Skills runs OpenHands via their SWE-bench evaluation script, so the only settings you can set are the LLM settings under the `[llm.model]` section. For more details, see the [OpenHands evaluation README](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/README.md). Note that Nemo-Skills always uses the `[llm.model]` config section and therefore does not support multiple LLM configurations in one TOML file.
     - Nemo-Skills overrides certain parameters, even if they are specified in the config file. These parameters are listed in a comment in the default config files below.
-    - Defaults to [eval/swe-bench/swe-agent/default](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/swe-agent/default.yaml) for SWE-agent, [eval/swe-bench/openhands/default](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/openhands/default.toml) for OpenHands. Note that if you store your configs in your local Nemo-Skills repo, then the path can be relative to the `nemo_skills/prompt` folder and the file extension is added automatically (same as how it works with regular [prompt configs](../basics/prompt-format.md)).
+    - Defaults to [eval/swe-bench/swe-agent/default](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/swe-agent/default.yaml) for SWE-agent, [eval/swe-bench/mini-swe-agent/swebench](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml) for mini-SWE-agent, [eval/swe-bench/openhands/default](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/openhands/default.toml) for OpenHands. Note that if you store your configs in your local Nemo-Skills repo, then the path can be relative to the `nemo_skills/prompt` folder and the file extension is added automatically (same as how it works with regular [prompt configs](../basics/prompt-format.md)).
 
-- **++agent_max_turns:** The maximum number of turns the agent is allowed to take before the trajectory is forcibly terminated. Defaults to 100 for both SWE-agent and OpenHands.
+- **++agent_max_turns:** The maximum number of turns the agent is allowed to take before the trajectory is forcibly terminated. Defaults to 100 for all agent frameworks.
 
 - **++eval_harness_repo:** URL of the repository to use for the evaluation harness. This is passed directly as an argument to `git clone`. Defaults to [`https://github.com/Kipok/SWE-bench.git`](https://github.com/Kipok/SWE-bench), our fork of SWE-bench that supports local evaluation.
 
@@ -94,24 +94,25 @@ There are a few parameters specific to SWE-bench. They have to be specified with
 
 #### Inference parameters
 
-For this benchmark, inference parameters work a bit differently. This is because it does not use the Nemo-Skills LLM client, instead the interaction with the LLM server is handled by SWE-agent/OpenHands.
+For this benchmark, inference parameters work a bit differently. This is because it does not use the Nemo-Skills LLM client, instead the interaction with the LLM server is handled by the agent framework.
 
 Most inference parameters are not passed to the LLM by default if you don't explicitly specify them, with the exception of temperature (defaults to 0) and top_p (defaults to 0.95). Any parameters you set explicitly will be passed. Custom parameters can be set via extra_body like this: `++inference.extra_body.chat_template_kwargs.enable_thinking=False`. However, keep in mind certain parameters may not be supported by your LLM server.
 
 It's worth noting that when using VLLM with a HuggingFace model, any parameters that are not passed to the server will be taken from the model's config on HuggingFace by default. This may or may not be what you want. To disable this, you can add `--generation-config vllm` to the `--server_args` parameter. See [VLLM docs](https://docs.vllm.ai/en/latest/configuration/engine_args.html#-generation-config).
 
 #### Tool calling
 
-SWE-bench requires models to call custom tools. By default SWE-agent & OpenHands expect that the LLM server supports *native tool calling*, which means the server can parse the model's tool calls and return them in a structured format separately from the rest of the model's output. This is convenient because the agentic framework doesn't have to know what the model's preferred tool call format is. In order to set this up, you need to add these arguments to `--server_args`:
+SWE-bench requires models to call custom tools. By default agent frameworks expect that the LLM server supports *native tool calling*, which means the server can parse the model's tool calls and return them in a structured format separately from the rest of the model's output. This is convenient because the agent framework doesn't have to know what the model's preferred tool call format is. In order to set this up, you need to add these arguments to `--server_args`:
 
 - for VLLM: `--enable-auto-tool-choice --tool-call-parser <PARSER_NAME>`
 - for SGLang: `--tool-call-parser <PARSER_NAME>`
 
 For more details and the list of supported parsers, see the docs: [VLLM](https://docs.vllm.ai/en/stable/features/tool_calling.html#automatic-function-calling), [SGLang](https://docs.sglang.ai/advanced_features/function_calling.html).
 
-In addition, both SWE-agent and OpenHands can run without native tool calling. This means the tool calls will be parsed by the agentic framework itself. To try this out, you can use the following configs with the `++agent_config` parameter:
+In addition, all supported agent frameworks can run without native tool calling. This means the tool calls will be parsed by the agent framework itself. To try this out, you can use the following configs with the `++agent_config` parameter:
 
 - for SWE-agent: [eval/swe-bench/swe-agent/swe-agent-lm-32b](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/swe-agent/swe-agent-lm-32b.yaml). This was the config used for [SWE-agent-LM-32B](https://huggingface.co/SWE-bench/SWE-agent-LM-32B). Note that there are significant differences with the default config.
+- for mini-SWE-agent: [eval/swe-bench/mini-swe-agent/swebench_xml](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_xml.yaml) or [eval/swe-bench/mini-swe-agent/swebench_backticks](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_backticks.yaml).
 - for OpenHands: [eval/swe-bench/openhands/no-native-tool-calling](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/openhands/no-native-tool-calling.toml). This simply sets `native_tool_calling` to `false`.
 
 Keep in mind that by default the tool call format expected by these frameworks will likely be different from the one that the model was trained on.
@@ -156,8 +157,8 @@ After all jobs are complete, you can check the results in `<OUTPUT_DIR>/eval-res
 ```
 Keep in mind there is some variance between runs, so we recommend running evaluation multiple times and averaging out the resolve rate. To do that automatically, you can set `--benchmarks=swe-bench:N`, where N is your desired number of repeats.
 
-To evaluate the same model with SWE-agent,
-all you need to do is replace `openhands` with `swe_agent` in the command above.
+To evaluate the same model with SWE-agent or mini-SWE-agent,
+all you need to do is replace `openhands` with `swe_agent` or `mini_swe_agent` in the command above.
 
 !!! note
     There are some instances where the gold (ground truth) patches do not pass the evaluation tests. Therefore, it's likely that on those instances even patches that resolve the issue will be incorrectly evaluated as "unresolved". We have observed 11 such instances in SWE-bench Verified: `astropy__astropy-7606`, `astropy__astropy-8707`, `astropy__astropy-8872`, `django__django-10097`, `psf__requests-1724`, `psf__requests-1766`, `psf__requests-1921`, `psf__requests-2317`, `pylint-dev__pylint-6528`, `pylint-dev__pylint-7080`, `pylint-dev__pylint-7277`. Depending on your setup, this set of instances may be different.

diff --git a/nemo_skills/inference/eval/swebench.py b/nemo_skills/inference/eval/swebench.py
@@ -26,6 +26,7 @@
 
 import hydra
 import tomlkit
+import yaml
 from omegaconf import OmegaConf
 
 from nemo_skills.inference.generate import GenerationTask
@@ -44,6 +45,7 @@
 class SupportedAgentFrameworks(str, Enum):
     swe_agent = "swe_agent"
     openhands = "openhands"
+    mini_swe_agent = "mini_swe_agent"
 
 
 # Like nemo_skills.inference.generate.InferenceConfig, except most parameters are not passed by default
@@ -254,6 +256,27 @@ def __init__(self, cfg: SweBenchGenerationConfig):
                 "uv pip install rich==14.2.0"
             )
 
+        elif self.cfg.agent_framework == SupportedAgentFrameworks.mini_swe_agent:
+            if self.cfg.agent_framework_repo is None:
+                self.cfg.agent_framework_repo = "https://github.com/SWE-agent/mini-swe-agent.git"
+            if self.cfg.agent_framework_commit is None:
+                self.cfg.agent_framework_commit = "v2.0"
+            setup_commands.append(
+                # clone the swe-agent repo
+                "rm -rf /root/mini-swe-agent && "
+                f"git clone {self.cfg.agent_framework_repo} /root/mini-swe-agent && "
+                "cd /root/mini-swe-agent && "
+                # Bypass the interactive setup wizard by pointing to the default config
+                "export MSWEA_MINI_CONFIG_PATH=/root/mini-swe-agent/src/minisweagent/config/benchmarks/swebench.yaml && "
+                f"git checkout {self.cfg.agent_framework_commit} && "
+                # make venv & install mini-swe-agent dependencies
+                "uv venv --python 3.12 --managed-python venv && "
+                "source venv/bin/activate && "
+                "uv pip install -e . && "
+                # force downgrade rich - newer versions cause the swe-agent logger to hang in some instances
+                "uv pip install rich==14.2.0"
+            )
+
         elif self.cfg.agent_framework == SupportedAgentFrameworks.openhands:
             if self.cfg.multilingual:
                 if self.cfg.agent_framework_repo is None:
@@ -532,6 +555,102 @@ async def _run_swe_agent(self, data_point, api_base):
 
         return pred_jsonl_file
 
+    async def _run_mini_swe_agent(self, data_point, api_base):
+        """
+        Runs mini-swe-agent on one instance.
+        Returns the absolute (not mounted) path to a .jsonl file in the SWE-bench evaluation format.
+        """
+        completion_kwargs = {
+            openai_param: getattr(self.cfg.inference, ns_param)
+            for ns_param, openai_param in NS_TO_OPENAI_PARAM.items()
+            if getattr(self.cfg.inference, ns_param) is not None
+        }
+        completion_kwargs.update(OmegaConf.to_container(self.cfg.inference.extra_body, resolve=True))
+        if "top_logprobs" in completion_kwargs:
+            completion_kwargs["logprobs"] = True
+        if "reasoning_effort" in completion_kwargs:
+            completion_kwargs["allowed_openai_params"] = ["reasoning_effort"]
+
+        base_config_path = get_config_path(self.cfg.agent_config or "eval/swe-bench/mini-swe-agent/swebench")
+        with open(base_config_path, "r") as f:
+            full_config = yaml.safe_load(f)
+
+        if "agent" not in full_config:
+            full_config["agent"] = {}
+        full_config["agent"]["step_limit"] = self.cfg.agent_max_turns
+
+        if "model" not in full_config:
+            full_config["model"] = {}
+        if "model_kwargs" not in full_config["model"]:
+            full_config["model"]["model_kwargs"] = {}
+
+        full_config["model"]["model_kwargs"].update(
+            {
+                **completion_kwargs,
+                "api_base": api_base,
+                "temperature": self.cfg.inference.temperature,
+                "top_p": self.cfg.inference.top_p,
+            }
+        )
+
+        (self.output_dir / "configs").mkdir(parents=True, exist_ok=True)
+        tmp_config_filename = f"configs/config_{data_point['instance_id']}.yaml"
+        host_tmp_path = os.path.join(self.output_dir, tmp_config_filename)
+
+        # Inside the container, this path maps to /trajectories_mount/
+        container_tmp_path = os.path.join("/trajectories_mount", tmp_config_filename)
+
+        with open(host_tmp_path, "w") as f:
+            yaml.dump(full_config, f)
+
+        try:
+            mini_swe_agent_cmd = (
+                "cp -r /root_mount/mini-swe-agent /root && "
+                "cp -r /root_mount/uv /root && "
+                "cd /root/mini-swe-agent && "
+                "export MSWEA_CONFIGURED=true && "
+                f"export MSWEA_MINI_CONFIG_PATH={container_tmp_path} && "
+                f"/root/mini-swe-agent/venv/bin/python -m minisweagent.run.mini "
+                f"--config {container_tmp_path} "
+                f"--model hosted_vllm/{self.cfg.server.model} "
+                f"--task {shlex.quote(data_point['problem_statement'])} "
+                f"--output trajectories/{data_point['instance_id']}.traj.json "
+                f"--yolo "
+                f"--exit-immediately && "
+                "mkdir -p /trajectories_mount/trajectories && cp -r trajectories/* /trajectories_mount/trajectories/"
+            )
+
+            # Execute mini-swe-agent command
+            search_path = os.path.join(self.output_dir, "trajectories", f"{data_point['instance_id']}.traj.json")
+
+            pred_file = await self._execute_container_command(
+                data_point, mini_swe_agent_cmd, search_path, mode="agent"
+            )
+
+            with open(pred_file, "r") as f:
+                trajectory_dict = json.loads(f.read().strip())
+
+            pred_jsonl_file = pred_file.replace(".traj.json", ".jsonl")
+            with open(pred_jsonl_file, "w") as f:
+                trajectory_info = trajectory_dict.get("info", {})
+                trajectory_info["model_name_or_path"] = self.cfg.server.model
+                trajectory_info["instance_id"] = data_point["instance_id"]
+
+                patch = trajectory_info.pop("submission", None)
+                if not patch:
+                    patch = None
+                elif not patch.endswith("\n"):
+                    patch += "\n"
+                trajectory_info["model_patch"] = patch
+
+                f.write(json.dumps(trajectory_info))
+
+            return pred_jsonl_file
+
+        finally:
+            if os.path.exists(host_tmp_path):
+                os.remove(host_tmp_path)
+
     async def _run_openhands(self, data_point, api_base):
         """
         Runs OpenHands on one instance.
@@ -688,6 +807,8 @@ async def _process_single_datapoint_impl(self, data_point, data):
 
         if self.cfg.agent_framework == SupportedAgentFrameworks.swe_agent:
             pred_file = await self._run_swe_agent(data_point, api_base)
+        elif self.cfg.agent_framework == SupportedAgentFrameworks.mini_swe_agent:
+            pred_file = await self._run_mini_swe_agent(data_point, api_base)
         elif self.cfg.agent_framework == SupportedAgentFrameworks.openhands:
             pred_file = await self._run_openhands(data_point, api_base)
         else: