Skip to content

Add SWE-bench inference & evaluation#671

Merged
Kipok merged 72 commits intomainfrom
ludwig-n/openhands
Aug 15, 2025
Merged

Add SWE-bench inference & evaluation#671
Kipok merged 72 commits intomainfrom
ludwig-n/openhands

Conversation

@ludwig-n
Copy link
Collaborator

@ludwig-n ludwig-n commented Aug 14, 2025

This PR adds SWE-bench inference (featuring 2 agentic frameworks: SWE-agent and OpenHands) and evaluation using the official SWE-bench harness.

Sample evaluation

Here's how to evaluate Qwen3-Coder-30B-A3B-Instruct with OpenHands on a Slurm cluster.
First, prepare the data (SWE-bench Verified by default) by running

ns prepare_data swe-bench

The inference and evaluation runs inside of prebuilt container images from the SWE-bench team. By default, this will configure them to be downloaded from Dockerhub every time you run ns eval.
If you have the SWE-bench images downloaded somewhere on the cluster, add that folder to the mounts in your cluster config and use the option --container_formatter to specify the mounted path to the images, e.g.

ns prepare_data swe-bench \
    --container_formatter "/swe-bench-images/swebench_sweb.eval.x86_64.{instance_id}.sif"

Then, to run the agent on all instances and evaluate the generated patches, run

ns eval \
    --cluster=<CLUSTER_NAME> \
    --model=Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --server_type=vllm \
    --server_args="--enable-auto-tool-choice --tool-call-parser qwen3_coder" \
    --server_nodes=1 \
    --server_gpus=8 \
    --benchmarks=swe-bench \
    --expname=<EXPNAME> \
    --output_dir=<OUTPUT_DIR> \
    --num_chunks=2 \
    ++agent_framework=openhands \
    ++inference.temperature=0.7 \
    ++inference.top_p=0.8 \
    ++inference.top_k=20

replacing the <...> with your desired parameters. Of course, if you have the model downloaded on the cluster, you can modify the --model parameter to specify the path to the weights.

I ran 5 identical evaluations with this command and got the following scores on SWE-bench Verified: 50.0, 49.8, 49.6, 51.8, 51.0, with an average of 50.4. The official reported score for this model is 51.6.

Kipok added 30 commits July 25, 2025 10:58
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Kipok and others added 11 commits July 29, 2025 19:12
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@ludwig-n ludwig-n requested a review from Kipok August 14, 2025 18:19
Kipok and others added 11 commits August 14, 2025 11:35
Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Kipok
Copy link
Collaborator

Kipok commented Aug 15, 2025

@ludwig-n when you get a chance, can you please update description of this PR with an example eval command and expected score? We will eventually port it to the evaluation docs we are going to have at some point in the near future

@Kipok Kipok merged commit c3750ce into main Aug 15, 2025
3 of 4 checks passed
SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Aug 18, 2025
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
wedu-nvidia pushed a commit that referenced this pull request Aug 22, 2025
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
@ludwig-n ludwig-n deleted the ludwig-n/openhands branch September 4, 2025 11:52
wasiahmad pushed a commit that referenced this pull request Oct 1, 2025
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants