diff --git a/docs/evaluation/code.md b/docs/evaluation/code.md index 596b15b850..26943e5f97 100644 --- a/docs/evaluation/code.md +++ b/docs/evaluation/code.md @@ -14,7 +14,7 @@ More details are coming soon! - Benchmark is defined in [`nemo_skills/dataset/swe-bench/__init__.py`](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/dataset/swe-bench/__init__.py) - Original benchmark source is [here](https://github.com/SWE-bench/SWE-bench). -Nemo-Skills can run inference (rollout) on SWE-bench-style datasets using 2 agentic frameworks: [SWE-agent](https://swe-agent.com/latest/) and [OpenHands](https://www.all-hands.dev/). It can then evaluate the generated patches on SWE-bench Verified/Lite/Full using the [official SWE-bench harness](https://www.swebench.com/SWE-bench/guides/evaluation/). +Nemo-Skills can run inference (rollout) on SWE-bench-style datasets using 3 agent frameworks: [SWE-agent](https://swe-agent.com/latest/), [mini-SWE-agent](https://mini-swe-agent.com/latest/) and [OpenHands](https://www.all-hands.dev/). It can then evaluate the generated patches on SWE-bench Verified/Lite/Full using the [official SWE-bench harness](https://www.swebench.com/SWE-bench/guides/evaluation/). #### Data preparation @@ -66,19 +66,19 @@ When this path is accessed during evaluation, `{instance_id}` will be replaced b There are a few parameters specific to SWE-bench. They have to be specified with the `++` prefix. All of them are optional, except for ++agent_framework. -- **++agent_framework:** which agentic framework to use. Must be either `swe_agent` or `openhands`. No default, must be specified explicitly. +- **++agent_framework:** which agent framework to use. Must be one of `swe_agent`, `mini_swe_agent` or `openhands`. No default, must be specified explicitly. -- **++agent_framework_repo:** URL of the repository to use for SWE-agent/OpenHands. Allows you to pass in a custom fork of these repositories. If you do this, you may find it helpful to check [nemo_skills/inference/eval/swebench.py](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/eval/swebench.py) to understand how the frameworks are used internally. This is passed directly as an argument to `git clone`. Defaults to the official repositories: [`https://github.com/SWE-agent/SWE-agent.git`](https://github.com/SWE-agent/SWE-agent) for SWE-agent, [`https://github.com/All-Hands-AI/OpenHands.git`](https://github.com/All-Hands-AI/OpenHands) for OpenHands. +- **++agent_framework_repo:** URL of the repository to use for SWE-agent/mini-SWE-agent/OpenHands. Allows you to pass in a custom fork of these repositories. If you do this, you may find it helpful to check [nemo_skills/inference/eval/swebench.py](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/inference/eval/swebench.py) to understand how the frameworks are used internally. This is passed directly as an argument to `git clone`. Defaults to the official repositories: [`https://github.com/SWE-agent/SWE-agent.git`](https://github.com/SWE-agent/SWE-agent) for SWE-agent, [`https://github.com/SWE-agent/mini-swe-agent.git`](https://github.com/SWE-agent/mini-swe-agent) for mini-SWE-agent, [`https://github.com/All-Hands-AI/OpenHands.git`](https://github.com/All-Hands-AI/OpenHands) for OpenHands. -- **++agent_framework_commit:** The commit hash, branch or tag to checkout after cloning agent_framework_repo. Allows you to pin SWE-agent/OpenHands to a specific version. Defaults to `HEAD`, i.e. the latest commit. +- **++agent_framework_commit:** The commit hash, branch or tag to checkout after cloning agent_framework_repo. Allows you to pin SWE-agent/mini-SWE-agent/OpenHands to a specific version. Defaults to `HEAD` for SWE-agent & OpenHands and `v2.0` for mini-SWE-agent. -- **++agent_config:** The config file to use for SWE-agent/OpenHands. - - For SWE-agent, this is a YAML file. See the [SWE-agent docs](https://swe-agent.com/latest/config/config/). +- **++agent_config:** The config file to use for the agent framework. + - For SWE-agent and mini-SWE-agent, this is a YAML file. See the docs: [SWE-agent](https://swe-agent.com/latest/config/config/), [mini-SWE-agent](https://mini-swe-agent.com/latest/advanced/yaml_configuration/). - For OpenHands, this is a TOML file. Nemo-Skills runs OpenHands via their SWE-bench evaluation script, so the only settings you can set are the LLM settings under the `[llm.model]` section. For more details, see the [OpenHands evaluation README](https://github.com/All-Hands-AI/OpenHands/blob/main/evaluation/README.md). Note that Nemo-Skills always uses the `[llm.model]` config section and therefore does not support multiple LLM configurations in one TOML file. - Nemo-Skills overrides certain parameters, even if they are specified in the config file. These parameters are listed in a comment in the default config files below. - - Defaults to [eval/swe-bench/swe-agent/default](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/swe-agent/default.yaml) for SWE-agent, [eval/swe-bench/openhands/default](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/openhands/default.toml) for OpenHands. Note that if you store your configs in your local Nemo-Skills repo, then the path can be relative to the `nemo_skills/prompt` folder and the file extension is added automatically (same as how it works with regular [prompt configs](../basics/prompt-format.md)). + - Defaults to [eval/swe-bench/swe-agent/default](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/swe-agent/default.yaml) for SWE-agent, [eval/swe-bench/mini-swe-agent/swebench](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml) for mini-SWE-agent, [eval/swe-bench/openhands/default](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/openhands/default.toml) for OpenHands. Note that if you store your configs in your local Nemo-Skills repo, then the path can be relative to the `nemo_skills/prompt` folder and the file extension is added automatically (same as how it works with regular [prompt configs](../basics/prompt-format.md)). -- **++agent_max_turns:** The maximum number of turns the agent is allowed to take before the trajectory is forcibly terminated. Defaults to 100 for both SWE-agent and OpenHands. +- **++agent_max_turns:** The maximum number of turns the agent is allowed to take before the trajectory is forcibly terminated. Defaults to 100 for all agent frameworks. - **++eval_harness_repo:** URL of the repository to use for the evaluation harness. This is passed directly as an argument to `git clone`. Defaults to [`https://github.com/Kipok/SWE-bench.git`](https://github.com/Kipok/SWE-bench), our fork of SWE-bench that supports local evaluation. @@ -94,7 +94,7 @@ There are a few parameters specific to SWE-bench. They have to be specified with #### Inference parameters -For this benchmark, inference parameters work a bit differently. This is because it does not use the Nemo-Skills LLM client, instead the interaction with the LLM server is handled by SWE-agent/OpenHands. +For this benchmark, inference parameters work a bit differently. This is because it does not use the Nemo-Skills LLM client, instead the interaction with the LLM server is handled by the agent framework. Most inference parameters are not passed to the LLM by default if you don't explicitly specify them, with the exception of temperature (defaults to 0) and top_p (defaults to 0.95). Any parameters you set explicitly will be passed. Custom parameters can be set via extra_body like this: `++inference.extra_body.chat_template_kwargs.enable_thinking=False`. However, keep in mind certain parameters may not be supported by your LLM server. @@ -102,16 +102,17 @@ It's worth noting that when using VLLM with a HuggingFace model, any parameters #### Tool calling -SWE-bench requires models to call custom tools. By default SWE-agent & OpenHands expect that the LLM server supports *native tool calling*, which means the server can parse the model's tool calls and return them in a structured format separately from the rest of the model's output. This is convenient because the agentic framework doesn't have to know what the model's preferred tool call format is. In order to set this up, you need to add these arguments to `--server_args`: +SWE-bench requires models to call custom tools. By default agent frameworks expect that the LLM server supports *native tool calling*, which means the server can parse the model's tool calls and return them in a structured format separately from the rest of the model's output. This is convenient because the agent framework doesn't have to know what the model's preferred tool call format is. In order to set this up, you need to add these arguments to `--server_args`: - for VLLM: `--enable-auto-tool-choice --tool-call-parser ` - for SGLang: `--tool-call-parser ` For more details and the list of supported parsers, see the docs: [VLLM](https://docs.vllm.ai/en/stable/features/tool_calling.html#automatic-function-calling), [SGLang](https://docs.sglang.ai/advanced_features/function_calling.html). -In addition, both SWE-agent and OpenHands can run without native tool calling. This means the tool calls will be parsed by the agentic framework itself. To try this out, you can use the following configs with the `++agent_config` parameter: +In addition, all supported agent frameworks can run without native tool calling. This means the tool calls will be parsed by the agent framework itself. To try this out, you can use the following configs with the `++agent_config` parameter: - for SWE-agent: [eval/swe-bench/swe-agent/swe-agent-lm-32b](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/swe-agent/swe-agent-lm-32b.yaml). This was the config used for [SWE-agent-LM-32B](https://huggingface.co/SWE-bench/SWE-agent-LM-32B). Note that there are significant differences with the default config. +- for mini-SWE-agent: [eval/swe-bench/mini-swe-agent/swebench_xml](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_xml.yaml) or [eval/swe-bench/mini-swe-agent/swebench_backticks](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_backticks.yaml). - for OpenHands: [eval/swe-bench/openhands/no-native-tool-calling](https://github.com/NVIDIA-NeMo/Skills/blob/main/nemo_skills/prompt/config/eval/swe-bench/openhands/no-native-tool-calling.toml). This simply sets `native_tool_calling` to `false`. Keep in mind that by default the tool call format expected by these frameworks will likely be different from the one that the model was trained on. @@ -156,8 +157,8 @@ After all jobs are complete, you can check the results in `/eval-res ``` Keep in mind there is some variance between runs, so we recommend running evaluation multiple times and averaging out the resolve rate. To do that automatically, you can set `--benchmarks=swe-bench:N`, where N is your desired number of repeats. -To evaluate the same model with SWE-agent, -all you need to do is replace `openhands` with `swe_agent` in the command above. +To evaluate the same model with SWE-agent or mini-SWE-agent, +all you need to do is replace `openhands` with `swe_agent` or `mini_swe_agent` in the command above. !!! note There are some instances where the gold (ground truth) patches do not pass the evaluation tests. Therefore, it's likely that on those instances even patches that resolve the issue will be incorrectly evaluated as "unresolved". We have observed 11 such instances in SWE-bench Verified: `astropy__astropy-7606`, `astropy__astropy-8707`, `astropy__astropy-8872`, `django__django-10097`, `psf__requests-1724`, `psf__requests-1766`, `psf__requests-1921`, `psf__requests-2317`, `pylint-dev__pylint-6528`, `pylint-dev__pylint-7080`, `pylint-dev__pylint-7277`. Depending on your setup, this set of instances may be different. diff --git a/nemo_skills/inference/eval/swebench.py b/nemo_skills/inference/eval/swebench.py index 4a75bbf7d4..9cbd82b121 100644 --- a/nemo_skills/inference/eval/swebench.py +++ b/nemo_skills/inference/eval/swebench.py @@ -26,6 +26,7 @@ import hydra import tomlkit +import yaml from omegaconf import OmegaConf from nemo_skills.inference.generate import GenerationTask @@ -44,6 +45,7 @@ class SupportedAgentFrameworks(str, Enum): swe_agent = "swe_agent" openhands = "openhands" + mini_swe_agent = "mini_swe_agent" # Like nemo_skills.inference.generate.InferenceConfig, except most parameters are not passed by default @@ -254,6 +256,27 @@ def __init__(self, cfg: SweBenchGenerationConfig): "uv pip install rich==14.2.0" ) + elif self.cfg.agent_framework == SupportedAgentFrameworks.mini_swe_agent: + if self.cfg.agent_framework_repo is None: + self.cfg.agent_framework_repo = "https://github.com/SWE-agent/mini-swe-agent.git" + if self.cfg.agent_framework_commit is None: + self.cfg.agent_framework_commit = "v2.0" + setup_commands.append( + # clone the swe-agent repo + "rm -rf /root/mini-swe-agent && " + f"git clone {self.cfg.agent_framework_repo} /root/mini-swe-agent && " + "cd /root/mini-swe-agent && " + # Bypass the interactive setup wizard by pointing to the default config + "export MSWEA_MINI_CONFIG_PATH=/root/mini-swe-agent/src/minisweagent/config/benchmarks/swebench.yaml && " + f"git checkout {self.cfg.agent_framework_commit} && " + # make venv & install mini-swe-agent dependencies + "uv venv --python 3.12 --managed-python venv && " + "source venv/bin/activate && " + "uv pip install -e . && " + # force downgrade rich - newer versions cause the swe-agent logger to hang in some instances + "uv pip install rich==14.2.0" + ) + elif self.cfg.agent_framework == SupportedAgentFrameworks.openhands: if self.cfg.multilingual: if self.cfg.agent_framework_repo is None: @@ -532,6 +555,102 @@ async def _run_swe_agent(self, data_point, api_base): return pred_jsonl_file + async def _run_mini_swe_agent(self, data_point, api_base): + """ + Runs mini-swe-agent on one instance. + Returns the absolute (not mounted) path to a .jsonl file in the SWE-bench evaluation format. + """ + completion_kwargs = { + openai_param: getattr(self.cfg.inference, ns_param) + for ns_param, openai_param in NS_TO_OPENAI_PARAM.items() + if getattr(self.cfg.inference, ns_param) is not None + } + completion_kwargs.update(OmegaConf.to_container(self.cfg.inference.extra_body, resolve=True)) + if "top_logprobs" in completion_kwargs: + completion_kwargs["logprobs"] = True + if "reasoning_effort" in completion_kwargs: + completion_kwargs["allowed_openai_params"] = ["reasoning_effort"] + + base_config_path = get_config_path(self.cfg.agent_config or "eval/swe-bench/mini-swe-agent/swebench") + with open(base_config_path, "r") as f: + full_config = yaml.safe_load(f) + + if "agent" not in full_config: + full_config["agent"] = {} + full_config["agent"]["step_limit"] = self.cfg.agent_max_turns + + if "model" not in full_config: + full_config["model"] = {} + if "model_kwargs" not in full_config["model"]: + full_config["model"]["model_kwargs"] = {} + + full_config["model"]["model_kwargs"].update( + { + **completion_kwargs, + "api_base": api_base, + "temperature": self.cfg.inference.temperature, + "top_p": self.cfg.inference.top_p, + } + ) + + (self.output_dir / "configs").mkdir(parents=True, exist_ok=True) + tmp_config_filename = f"configs/config_{data_point['instance_id']}.yaml" + host_tmp_path = os.path.join(self.output_dir, tmp_config_filename) + + # Inside the container, this path maps to /trajectories_mount/ + container_tmp_path = os.path.join("/trajectories_mount", tmp_config_filename) + + with open(host_tmp_path, "w") as f: + yaml.dump(full_config, f) + + try: + mini_swe_agent_cmd = ( + "cp -r /root_mount/mini-swe-agent /root && " + "cp -r /root_mount/uv /root && " + "cd /root/mini-swe-agent && " + "export MSWEA_CONFIGURED=true && " + f"export MSWEA_MINI_CONFIG_PATH={container_tmp_path} && " + f"/root/mini-swe-agent/venv/bin/python -m minisweagent.run.mini " + f"--config {container_tmp_path} " + f"--model hosted_vllm/{self.cfg.server.model} " + f"--task {shlex.quote(data_point['problem_statement'])} " + f"--output trajectories/{data_point['instance_id']}.traj.json " + f"--yolo " + f"--exit-immediately && " + "mkdir -p /trajectories_mount/trajectories && cp -r trajectories/* /trajectories_mount/trajectories/" + ) + + # Execute mini-swe-agent command + search_path = os.path.join(self.output_dir, "trajectories", f"{data_point['instance_id']}.traj.json") + + pred_file = await self._execute_container_command( + data_point, mini_swe_agent_cmd, search_path, mode="agent" + ) + + with open(pred_file, "r") as f: + trajectory_dict = json.loads(f.read().strip()) + + pred_jsonl_file = pred_file.replace(".traj.json", ".jsonl") + with open(pred_jsonl_file, "w") as f: + trajectory_info = trajectory_dict.get("info", {}) + trajectory_info["model_name_or_path"] = self.cfg.server.model + trajectory_info["instance_id"] = data_point["instance_id"] + + patch = trajectory_info.pop("submission", None) + if not patch: + patch = None + elif not patch.endswith("\n"): + patch += "\n" + trajectory_info["model_patch"] = patch + + f.write(json.dumps(trajectory_info)) + + return pred_jsonl_file + + finally: + if os.path.exists(host_tmp_path): + os.remove(host_tmp_path) + async def _run_openhands(self, data_point, api_base): """ Runs OpenHands on one instance. @@ -688,6 +807,8 @@ async def _process_single_datapoint_impl(self, data_point, data): if self.cfg.agent_framework == SupportedAgentFrameworks.swe_agent: pred_file = await self._run_swe_agent(data_point, api_base) + elif self.cfg.agent_framework == SupportedAgentFrameworks.mini_swe_agent: + pred_file = await self._run_mini_swe_agent(data_point, api_base) elif self.cfg.agent_framework == SupportedAgentFrameworks.openhands: pred_file = await self._run_openhands(data_point, api_base) else: diff --git a/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml b/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml new file mode 100644 index 0000000000..8ccd2cba8e --- /dev/null +++ b/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench.yaml @@ -0,0 +1,153 @@ +# source: https://github.com/SWE-agent/mini-swe-agent/blob/v2.0/src/minisweagent/config/benchmarks/swebench.yaml +agent: + system_template: | + You are a helpful assistant that can interact with a computer shell to solve programming tasks. + instance_template: | + + Consider the following PR description: + {{task}} + + + + # Task Instructions + + ## Overview + + You're a software engineer interacting continuously with a computer by submitting commands. + You'll be helping implement necessary changes to meet requirements in the PR description. + Your task is specifically to make changes to non-test files in the current directory in order to fix the issue described in the PR description in a way that is general and consistent with the codebase. + This is an interactive process where you will think and issue AT LEAST ONE command, see the result, then think and issue your next command(s). + + For each response: + + 1. Include a THOUGHT section explaining your reasoning and what you're trying to accomplish + 2. Provide exactly ONE bash command to execute + + ## Important Boundaries + + - MODIFY: Regular source code files in /testbed (this is the working directory for all your subsequent commands) + - DO NOT MODIFY: Tests, configuration files (pyproject.toml, setup.cfg, etc.) + + ## Recommended Workflow + + 1. Analyze the codebase by finding and reading relevant files + 2. Create a script to reproduce the issue + 3. Edit the source code to resolve the issue + 4. Verify your fix works by running your script again + 5. Test edge cases to ensure your fix is robust + + ## Command Execution Rules + + You are operating in an environment where + + 1. You issue at least one command + 3. The system executes the command(s) in a subshell + 4. You see the result(s) + 5. You write your next command(s) + + Each response should include: + + 1. **Reasoning text** where you explain your analysis and plan + 2. At least one tool call with your command + + **CRITICAL REQUIREMENTS:** + + - Your response SHOULD include reasoning text explaining what you're doing + - Your response MUST include AT LEAST ONE bash tool call + - Directory or environment variable changes are not persistent. Every action is executed in a new subshell. + - However, you can prefix any action with `MY_ENV_VAR=MY_VALUE cd /path/to/working/dir && ...` or write/load environment variables from files + + Example of a CORRECT response: + + I need to understand the structure of the repository first. Let me check what files are in the current directory to get a better understanding of the codebase. + + [Makes bash tool call with {"command": "ls -la"} as arguments] + + + ## Environment Details + + - You have a full Linux shell environment + - Always use non-interactive flags (-y, -f) for commands + - Avoid interactive tools like vi, nano, or any that require user input + - You can use bash commands or invoke any tool that is available in the environment + - You can also create new tools or scripts to help you with the task + - If a tool isn't available, you can also install it + + ## Submission + + When you've completed your work, you MUST submit your changes as a git patch. + Follow these steps IN ORDER, with SEPARATE commands: + + Step 1: Create the patch file + Run `git diff -- path/to/file1 path/to/file2 > patch.txt` listing only the source files you modified. + Do NOT commit your changes. + + + The patch must only contain changes to the specific source files you modified to fix the issue. + Do not submit file creations or changes to any of the following files: + + - test and reproduction files + - helper scripts, tests, or tools that you created + - installation, build, packaging, configuration, or setup scripts unless they are directly part of the issue you were fixing (you can assume that the environment is already set up for your client) + - binary or compiled files + + + Step 2: Verify your patch + Inspect patch.txt to confirm it only contains your intended changes and headers show `--- a/` and `+++ b/` paths. + + Step 3: Submit (EXACT command required) + You MUST use this EXACT command to submit: + + ```bash + echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt + ``` + + If the command fails (nonzero exit status), it will not submit. + + + - Creating/viewing the patch and submitting it MUST be separate commands (not combined with &&). + - If you modify patch.txt after verifying, you SHOULD verify again before submitting. + - You CANNOT continue working (reading, editing, testing) in any way on this task after submitting. + + + cost_limit: 0. # no limit + +environment: + cwd: "/testbed" + timeout: 60 + env: + PAGER: cat + MANPAGER: cat + LESS: -R + PIP_PROGRESS_BAR: 'off' + TQDM_DISABLE: '1' + environment_class: docker + +model: + observation_template: | + {%- if output.output | length < 10000 -%} + { + "returncode": {{ output.returncode }}, + "output": {{ output.output | tojson }} + {%- if output.exception_info %}, "exception_info": {{ output.exception_info | tojson }}{% endif %} + } + {%- else -%} + { + "returncode": {{ output.returncode }}, + "output_head": {{ output.output[:5000] | tojson }}, + "output_tail": {{ output.output[-5000:] | tojson }}, + "elided_chars": {{ output.output | length - 10000 }}, + "warning": "Output too long." + {%- if output.exception_info %}, "exception_info": {{ output.exception_info | tojson }}{% endif %} + } + {%- endif -%} + format_error_template: | + Tool call error. Every response needs to use the 'bash' tool at least once to execute commands. + + Call the bash tool with your command as the argument: + - Tool: bash + - Arguments: {"command": "your_command_here"} + + If you have completed your assignment, please consult the first message about how to + submit your solution (you will not be able to continue working on this task after that). + cost_tracking: "ignore_errors" diff --git a/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_backticks.yaml b/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_backticks.yaml new file mode 100644 index 0000000000..dea9d41255 --- /dev/null +++ b/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_backticks.yaml @@ -0,0 +1,232 @@ +# source: https://github.com/SWE-agent/mini-swe-agent/blob/v2.0/src/minisweagent/config/benchmarks/swebench_backticks.yaml +agent: + system_template: | + You are a helpful assistant that can interact multiple times with a computer shell to solve programming tasks. + Your response must contain exactly ONE bash code block with ONE command (or commands connected with && or ||). + + Include a THOUGHT section before your command where you explain your reasoning process. + Format your response as shown in . + + + THOUGHT: Your reasoning and analysis here + + ```mswea_bash_command + your_command_here + ``` + + + Failure to follow these rules will cause your response to be rejected. + instance_template: | + + Consider the following PR description: + {{task}} + + + + # Task Instructions + + ## Overview + + You're a software engineer interacting continuously with a computer by submitting commands. + You'll be helping implement necessary changes to meet requirements in the PR description. + Your task is specifically to make changes to non-test files in the current directory in order to fix the issue described in the PR description in a way that is general and consistent with the codebase. + + This is an interactive process where you will think and issue ONE command, see its result, then think and issue your next command. + + For each response: + + 1. Include a THOUGHT section explaining your reasoning and what you're trying to accomplish + 2. Provide exactly ONE bash command to execute + + ## Important Boundaries + + - MODIFY: Regular source code files in /testbed (this is the working directory for all your subsequent commands) + - DO NOT MODIFY: Tests, configuration files (pyproject.toml, setup.cfg, etc.) + + ## Recommended Workflow + + 1. Analyze the codebase by finding and reading relevant files + 2. Create a script to reproduce the issue + 3. Edit the source code to resolve the issue + 4. Verify your fix works by running your script again + 5. Test edge cases to ensure your fix is robust + + ## Command Execution Rules + + You are operating in an environment where + + 1. You write a single command + 2. The system executes that command in a subshell + 3. You see the result + 4. You write your next command + + Each response should include: + + 1. A **THOUGHT** section where you explain your reasoning and plan + 2. A single bash code block with your command + + Format your responses like demonstrated within the block: + + + THOUGHT: Here I explain my reasoning process, analysis of the current situation, + and what I'm trying to accomplish with the command below. + + ```mswea_bash_command + your_command_here + ``` + + + Commands must be specified in a single bash code block: + + ```mswea_bash_command + your_command_here + ``` + + **CRITICAL REQUIREMENTS:** + + - Your response SHOULD include a THOUGHT section explaining your reasoning + - Your response MUST include EXACTLY ONE bash code block + - This bash block MUST contain EXACTLY ONE command (or a set of commands connected with && or ||) + - If you include zero or multiple bash blocks, or no command at all, YOUR RESPONSE WILL FAIL + - Do NOT try to run multiple independent commands in separate blocks in one response + - Directory or environment variable changes are not persistent. Every action is executed in a new subshell. + - However, you can prefix any action with `MY_ENV_VAR=MY_VALUE cd /path/to/working/dir && ...` or write/load environment variables from files + + Example of a CORRECT response: + + THOUGHT: I need to understand the structure of the repository first. Let me check what files are in the current directory to get a better understanding of the codebase. + + ```mswea_bash_command + ls -la + ``` + + + Example of an INCORRECT response: + + + THOUGHT: I need to examine the codebase and then look at a specific file. I'll run multiple commands to do this. + + ```mswea_bash_command + ls -la + ``` + + Now I'll read the file: + + ```mswea_bash_command + cat file.txt + ``` + + + If you need to run multiple commands, either: + + 1. Combine them in one block using && or || + ```mswea_bash_command + command1 && command2 || echo "Error occurred" + ``` + + 2. Wait for the first command to complete, see its output, then issue the next command in your following response. + + ## Environment Details + + - You have a full Linux shell environment + - Always use non-interactive flags (-y, -f) for commands + - Avoid interactive tools like vi, nano, or any that require user input + - You can use bash commands or invoke any tool that is available in the environment + - You can also create new tools or scripts to help you with the task + - If a tool isn't available, you can also install it + + ## Submission + + When you've completed your work, you MUST submit your changes as a git patch. + Follow these steps IN ORDER, with SEPARATE commands: + + Step 1: Create the patch file + Run `git diff -- path/to/file1 path/to/file2 > patch.txt` listing only the source files you modified. + Do NOT commit your changes. + + + The patch must only contain changes to the specific source files you modified to fix the issue. + Do not submit file creations or changes to any of the following files: + + - test and reproduction files + - helper scripts, tests, or tools that you created + - installation, build, packaging, configuration, or setup scripts unless they are directly part of the issue you were fixing (you can assume that the environment is already set up for your client) + - binary or compiled files + + + Step 2: Verify your patch + Inspect patch.txt to confirm it only contains your intended changes and headers show `--- a/` and `+++ b/` paths. + + Step 3: Submit (EXACT command required) + You MUST use this EXACT command to submit: + + ```mswea_bash_command + echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt + ``` + + If the command fails (nonzero exit status), it will not submit. + + + - Creating/viewing the patch and submitting it MUST be separate commands (not combined with &&). + - If you modify patch.txt after verifying, you SHOULD verify again before submitting. + - You CANNOT continue working (reading, editing, testing) in any way on this task after submitting. + + + cost_limit: 0. # no limit + +environment: + cwd: "/testbed" + timeout: 60 + env: + PAGER: cat + MANPAGER: cat + LESS: -R + PIP_PROGRESS_BAR: 'off' + TQDM_DISABLE: '1' + environment_class: docker + +model: + observation_template: | + {% if output.exception_info -%} + {{output.exception_info}} + {% endif -%} + {{output.returncode}} + {% if output.output | length < 10000 -%} + + {{ output.output -}} + + {%- else -%} + + The output of your last command was too long. + Please try a different command that produces less output. + If you're looking at a file you can try use head, tail or sed to view a smaller number of lines selectively. + If you're using grep or find and it produced too much output, you can use a more selective search pattern. + If you really need to see something from the full command's output, you can redirect output to a file and then search in that file. + + {%- set elided_chars = output.output | length - 10000 -%} + + {{ output.output[:5000] }} + + + {{ elided_chars }} characters elided + + + {{ output.output[-5000:] }} + + {%- endif -%} + format_error_template: | + Please always provide EXACTLY ONE action in triple backticks, found {{actions|length}} actions. + + Please format your action in triple backticks as shown in . + + + Here are some thoughts about why you want to perform the action. + + ```mswea_bash_command + + ``` + + + If you have completed your assignment, please consult the first message about how to + submit your solution (you will not be able to continue working on this task after that). + cost_tracking: "ignore_errors" diff --git a/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_xml.yaml b/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_xml.yaml new file mode 100644 index 0000000000..803ed8238f --- /dev/null +++ b/nemo_skills/prompt/config/eval/swe-bench/mini-swe-agent/swebench_xml.yaml @@ -0,0 +1,215 @@ +# source: https://github.com/SWE-agent/mini-swe-agent/blob/v2.0/src/minisweagent/config/benchmarks/swebench_xml.yaml +agent: + system_template: | + You are a helpful assistant that can interact multiple times with a computer shell to solve programming tasks. + Your response must contain exactly ONE bash code block with ONE command (or commands connected with && or ||). + + Include a THOUGHT section before your command where you explain your reasoning process. + Format your response as shown in . + + + THOUGHT: Your reasoning and analysis here + + your_command_here + + + Failure to follow these rules will cause your response to be rejected. + instance_template: | + + Consider the following PR description: + {{task}} + + + + # Task Instructions + + ## Overview + + You're a software engineer interacting continuously with a computer by submitting commands. + You'll be helping implement necessary changes to meet requirements in the PR description. + Your task is specifically to make changes to non-test files in the current directory in order to fix the issue described in the PR description in a way that is general and consistent with the codebase. + + This is an interactive process where you will think and issue ONE command, see its result, then think and issue your next command. + + For each response: + + 1. Include a THOUGHT section explaining your reasoning and what you're trying to accomplish + 2. Provide exactly ONE bash command to execute + + ## Important Boundaries + + - MODIFY: Regular source code files in /testbed (this is the working directory for all your subsequent commands) + - DO NOT MODIFY: Tests, configuration files (pyproject.toml, setup.cfg, etc.) + + ## Recommended Workflow + + 1. Analyze the codebase by finding and reading relevant files + 2. Create a script to reproduce the issue + 3. Edit the source code to resolve the issue + 4. Verify your fix works by running your script again + 5. Test edge cases to ensure your fix is robust + + ## Command Execution Rules + + You are operating in an environment where + + 1. You write a single command + 2. The system executes that command in a subshell + 3. You see the result + 4. You write your next command + + Each response should include: + + 1. A **THOUGHT** section where you explain your reasoning and plan + 2. A single bash code block with your command + + Format your responses like demonstrated within the block: + + + THOUGHT: Here I explain my reasoning process, analysis of the current situation, + and what I'm trying to accomplish with the command below. + + your_command_here + Commands must be specified in a single bash XML tag: + + your_command_here + + **CRITICAL REQUIREMENTS:** + + - Your response SHOULD include a THOUGHT section explaining your reasoning + - Your response MUST include EXACTLY ONE mswea_bash_command tag + - This bash mswea_bash_command MUST contain EXACTLY ONE command (or a set of commands connected with && or ||) + - If you include zero or multiple tags, or no command at all, YOUR RESPONSE WILL FAIL + - Do NOT try to run multiple independent commands in separate blocks in one response + - Directory or environment variable changes are not persistent. Every action is executed in a new subshell. + - However, you can prefix any action with `MY_ENV_VAR=MY_VALUE cd /path/to/working/dir && ...` or write/load environment variables from files + + Example of a CORRECT response: + + + THOUGHT: I need to understand the structure of the repository first. Let me check what files are in the current directory to get a better understanding of the codebase. + + ls -la + + + Example of an INCORRECT response: + + + THOUGHT: I need to examine the codebase and then look at a specific file. I'll run multiple commands to do this. + + ls -la + + Now I'll read the file: + + cat file.txt + + + If you need to run multiple commands, either: + + 1. Combine them in one block using && or || + + command1 && command2 || echo "Error occurred" + + 2. Wait for the first command to complete, see its output, then issue the next command in your following response. + + ## Environment Details + + - You have a full Linux shell environment + - Always use non-interactive flags (-y, -f) for commands + - Avoid interactive tools like vi, nano, or any that require user input + - You can use bash commands or invoke any tool that is available in the environment + - You can also create new tools or scripts to help you with the task + - If a tool isn't available, you can also install it + + ## Submission + + When you've completed your work, you MUST submit your changes as a git patch. + Follow these steps IN ORDER, with SEPARATE commands: + + Step 1: Create the patch file + Run `git diff -- path/to/file1 path/to/file2 > patch.txt` listing only the source files you modified. + Do NOT commit your changes. + + + The patch must only contain changes to the specific source files you modified to fix the issue. + Do not submit file creations or changes to any of the following files: + + - test and reproduction files + - helper scripts, tests, or tools that you created + - installation, build, packaging, configuration, or setup scripts unless they are directly part of the issue you were fixing (you can assume that the environment is already set up for your client) + - binary or compiled files + + + Step 2: Verify your patch + Inspect patch.txt to confirm it only contains your intended changes and headers show `--- a/` and `+++ b/` paths. + + Step 3: Submit (EXACT command required) + You MUST use this EXACT command to submit: + + echo COMPLETE_TASK_AND_SUBMIT_FINAL_OUTPUT && cat patch.txt + + If the command fails (nonzero exit status), it will not submit. + + + - Creating/viewing the patch and submitting it MUST be separate commands (not combined with &&). + - If you modify patch.txt after verifying, you SHOULD verify again before submitting. + - You CANNOT continue working (reading, editing, testing) in any way on this task after submitting. + + + cost_limit: 0. # no limit + +environment: + cwd: "/testbed" + timeout: 60 + env: + PAGER: cat + MANPAGER: cat + LESS: -R + PIP_PROGRESS_BAR: 'off' + TQDM_DISABLE: '1' + environment_class: docker + +model: + observation_template: | + {% if output.exception_info -%} + {{output.exception_info}} + {% endif -%} + {{output.returncode}} + {% if output.output | length < 10000 -%} + + {{ output.output -}} + + {%- else -%} + + The output of your last command was too long. + Please try a different command that produces less output. + If you're looking at a file you can try use head, tail or sed to view a smaller number of lines selectively. + If you're using grep or find and it produced too much output, you can use a more selective search pattern. + If you really need to see something from the full command's output, you can redirect output to a file and then search in that file. + + {%- set elided_chars = output.output | length - 10000 -%} + + {{ output.output[:5000] }} + + + {{ elided_chars }} characters elided + + + {{ output.output[-5000:] }} + + {%- endif -%} + action_regex: (.*?) + format_error_template: | + Please always provide EXACTLY ONE action in the `` block, found {{actions|length}} actions. + + Please format your action in a `` block as shown in . + + + Here are some thoughts about why you want to perform the action. + + ls -la + + + If you have completed your assignment, please consult the first message about how to + submit your solution (you will not be able to continue working on this task after that). + cost_tracking: "ignore_errors"