Add SWE-bench inference & evaluation by ludwig-n · Pull Request #671 · NVIDIA-NeMo/Skills

ludwig-n · 2025-08-14T18:19:09Z

This PR adds SWE-bench inference (featuring 2 agentic frameworks: SWE-agent and OpenHands) and evaluation using the official SWE-bench harness.

Sample evaluation

Here's how to evaluate Qwen3-Coder-30B-A3B-Instruct with OpenHands on a Slurm cluster.
First, prepare the data (SWE-bench Verified by default) by running

ns prepare_data swe-bench

The inference and evaluation runs inside of prebuilt container images from the SWE-bench team. By default, this will configure them to be downloaded from Dockerhub every time you run ns eval.
If you have the SWE-bench images downloaded somewhere on the cluster, add that folder to the mounts in your cluster config and use the option --container_formatter to specify the mounted path to the images, e.g.

ns prepare_data swe-bench \
    --container_formatter "/swe-bench-images/swebench_sweb.eval.x86_64.{instance_id}.sif"

Then, to run the agent on all instances and evaluate the generated patches, run

ns eval \
    --cluster=<CLUSTER_NAME> \
    --model=Qwen/Qwen3-Coder-30B-A3B-Instruct \
    --server_type=vllm \
    --server_args="--enable-auto-tool-choice --tool-call-parser qwen3_coder" \
    --server_nodes=1 \
    --server_gpus=8 \
    --benchmarks=swe-bench \
    --expname=<EXPNAME> \
    --output_dir=<OUTPUT_DIR> \
    --num_chunks=2 \
    ++agent_framework=openhands \
    ++inference.temperature=0.7 \
    ++inference.top_p=0.8 \
    ++inference.top_k=20

replacing the <...> with your desired parameters. Of course, if you have the model downloaded on the cluster, you can modify the --model parameter to specify the path to the weights.

I ran 5 identical evaluations with this command and got the following scores on SWE-bench Verified: 50.0, 49.8, 49.6, 51.8, 51.0, with an average of 50.4. The official reported score for this model is 51.6.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok

Thanks!

Kipok · 2025-08-15T23:12:37Z

@ludwig-n when you get a chance, can you please update description of this PR with an example eval command and expected score? We will eventually port it to the evaluation docs we are going to have at some point in the near future

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Kipok added 30 commits July 25, 2025 10:58

Add swe-bench dataset

9020605

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Support for multiple sandbox containers

cdad007

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Initial implementation for swe-agnet

6a3f14a

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Switching to apptainer

c40a333

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Switch to mounted trajectories dir

9c3a432

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Roll-back sandbox changes

7e268f7

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Remove output

8e84856

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Descriptive error

027fedf

Signed-off-by: Igor Gitman <igitman@nvidia.com>

More logs

4cba9e0

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Hardcode model name

0f5e795

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Change to sif

abf3663

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix

59a6efa

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix and retry

d2d0e81

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Tmp code for evals

f5e1b96

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into igitman/swe-bench-v2

c9ea228

Add eval

ffe5590

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Remove trajectories dir

4266903

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Evaluation type

71fc8ef

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Metrics

0eebb44

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix metrics

4e22208

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Correct to .json

1b0890f

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Correct to .json

4e706c3

Signed-off-by: Igor Gitman <igitman@nvidia.com>

More fixes

1e5062b

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Clean up logs

d5776de

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fixes

8832795

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix

f406307

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fixes

e067e5f

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Cleaning up

ffb4e14

Signed-off-by: Igor Gitman <igitman@nvidia.com>

More cleanups

07d249d

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Move PROMPT_CONFIG to generation args

5294019

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok and others added 11 commits July 29, 2025 19:12

Rollback

a47d398

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add default config

702cdc6

Support more sampling parameters

b9b84c7

Update port/host

1ee04d5

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into igitman/swe-bench-v3

565ebd9

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Make async

0f97170

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix prepare.py

fb6138e

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update with proper async subprocess

54ba57c

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix evaluation issues caused by localhost not resolving to 127.0.0.1

1b045af

Support OpenHands for SWE-bench

0b02f25

Add max turns option

9497e5b

ludwig-n requested a review from Kipok August 14, 2025 18:19

Kipok and others added 11 commits August 14, 2025 11:35

Merge branch 'main' into ludwig-n/openhands

7d082c3

Fix prompt config

2069d93

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Merge branch 'main' into ludwig-n/openhands

e60c143

Cleanup

8f09956

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Set privileged through env var

2f74137

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add log file path

ceeaac3

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update docs

4c68c7c

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Install python from conda-forge

e36bf03

Install everything from conda-forge

f1e109b

Rename swe-agent to swe_agent to fix hydra error

a83386a

Merge branch 'main' into ludwig-n/openhands

d1c6c7c

Kipok approved these changes Aug 15, 2025

View reviewed changes

Kipok merged commit c3750ce into main Aug 15, 2025
3 of 4 checks passed

wedu-nvidia pushed a commit that referenced this pull request Aug 22, 2025

Add SWE-bench inference & evaluation (#671)

77f160b

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

ludwig-n deleted the ludwig-n/openhands branch September 4, 2025 11:52

wasiahmad pushed a commit that referenced this pull request Oct 1, 2025

Add SWE-bench inference & evaluation (#671)

3022a9b

Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SWE-bench inference & evaluation#671

Add SWE-bench inference & evaluation#671
Kipok merged 72 commits intomainfrom
ludwig-n/openhands

ludwig-n commented Aug 14, 2025 •

edited

Loading

Uh oh!

Kipok left a comment

Uh oh!

Kipok commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ludwig-n commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sample evaluation

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok commented Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ludwig-n commented Aug 14, 2025 •

edited

Loading