Skip to content

SWE-bench: add Slurm tests & other improvements#920

Merged
Kipok merged 15 commits intomainfrom
ludwig-n/swe-bench-slurm-tests
Oct 9, 2025
Merged

SWE-bench: add Slurm tests & other improvements#920
Kipok merged 15 commits intomainfrom
ludwig-n/swe-bench-slurm-tests

Conversation

@ludwig-n
Copy link
Collaborator

@ludwig-n ludwig-n commented Oct 9, 2025

This PR adds Slurm tests for SWE-bench. Specifically, the Qwen3-Coder 30B model is evaluated with OpenHands and SWE-agent.

It also adds the following features & fixes:

  1. When retrying an Apptainer command, there is now an interval of 1-3 minutes between retries. This is because it's common for instances to fail due to network issues (e.g. failing to download Miniforge), and when the retry happens immediately, it usually fails with the same problem. When the job waits before retrying, it is much more likely to succeed.
    • The retry interval is selected randomly between the new options min_retry_interval and max_retry_interval. The randomness is an attempt to spread out the retries to ease the load on the network. You can also configure the number of retries.
  2. SWE-bench evaluation now uses the uploaded dataset instead of redownloading it from HF. This is more efficient and also prevents HF rate limiting issues.
  3. You can now configure the repo URL and commit for the eval harness the same way you can already do it for OpenHands/SWE-agent. The default behavior does not change.
  4. Output search paths no longer use the globstar (**). When running on large datasets, it can be very slow, because the output folders have a lot of files and the glob has to traverse every subfolder recursively.

Summary by CodeRabbit

  • New Features

    • Configurable retry behavior with min/max intervals and updated defaults.
    • Ability to specify evaluation harness repository and commit.
    • More resilient evaluation runs with randomized backoff and clearer logs.
  • Documentation

    • Refined evaluation parameter descriptions and checkout semantics.
    • Added guidance for SWE-bench tests, image mounts, and container options.
    • Updated sample result metrics.
  • Tests

    • New Slurm workflow for Qwen3-Coder 30B with automated result validation.
    • Test runner extended to include the new workflow.

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 9, 2025

Walkthrough

Adds configurable retry logic and harness repo/commit options to SWE-bench evaluation code, updates documentation accordingly, and introduces Slurm test orchestration for Qwen3-Coder 30B with metric checks. Also updates Slurm README and run_all.sh to include the new test.

Changes

Cohort / File(s) Summary
SWE-bench evaluation core
nemo_skills/inference/eval/swebench.py
Adds eval_harness_repo/commit and retry settings to config; refactors command execution to use cfg-driven retries with randomized backoff; updates clone/checkout commands and path patterns; adjusts method signature to remove per-call max_retries.
Documentation updates
docs/evaluation/code.md
Documents new/changed parameters (agent framework commit semantics, eval_harness_repo/commit, retry settings); updates sample metric values.
Slurm tests: Qwen3-Coder 30B suite
tests/slurm-tests/qwen3coder_30b_swebench/check_results.py, tests/slurm-tests/qwen3coder_30b_swebench/run_test.py
Adds metric range checker and Slurm orchestration script running SWE-bench for Qwen3-Coder 30B across OpenHands and SWE-agent, with dependent result checks.
Slurm tests: infra/docs
tests/slurm-tests/README.md
Notes default SWE-bench images path, mount requirements, and container formatter guidance.
Slurm tests: orchestrator
tests/slurm-tests/run_all.sh
Invokes the new qwen3coder_30b_swebench test, adding a delay before launch and integrating into the existing run sequence.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Task as SweBenchGenerationTask
  participant C as Container
  participant Git as Git Repos

  rect rgba(230,245,255,0.5)
    note over Task: Setup repos per cfg
    Task->>Git: clone SWE-bench / agent / eval harness (cfg repos)
    Task->>Git: checkout cfg commits/tags/branches
  end

  loop For each command
    rect rgba(240,255,240,0.6)
      note over Task: Retry with randomized backoff
      Task->>C: Execute command (timeout=T)
      alt Success
        C-->>Task: Expected file(s) match pattern
      else Failure or Missing output
        Task->>Task: Sleep U[min_retry_interval, max_retry_interval]
        note over Task: Retry until cfg.max_retries exhausted
      end
    end
  end

  note over Task: Proceed to evaluation using cfg harness and paths
Loading
sequenceDiagram
  autonumber
  participant CLI as run_test.py
  participant Prep as prepare_data
  participant Slurm as Slurm Scheduler
  participant Eval as eval (per framework)
  participant Check as check_results.py

  CLI->>Prep: prepare_data (container_formatter)
  loop For agent_framework in [openhands, swe_agent]
    CLI->>Slurm: submit Eval job (expname_prefix, model/server args)
    Slurm-->>Eval: run
    CLI->>Slurm: submit Check job (after: Eval)
    Slurm-->>Check: run (loads metrics.json, assert ranges)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A hop, a skip, retries in tow,
I nibble logs where failures grow.
New harness paths, a gentle nudge,
Slurm queues hum—results won’t budge.
Qwen’s code sings, benchmarks gleam,
I thump: “All green!”—a rabbit’s dream. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly highlights the addition of Slurm tests for SWE-bench and acknowledges miscellaneous improvements without unnecessary detail, clearly reflecting the PR’s primary purpose and scope.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ludwig-n/swe-bench-slurm-tests

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4310d7 and b7fb5a2.

📒 Files selected for processing (1)
  • tests/slurm-tests/run_all.sh (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (1)
tests/slurm-tests/run_all.sh (1)

15-15: Container placeholder preserved

Quoting --container_formatter keeps {instance_id} intact, so each SWE-bench job resolves the proper image path. Nice catch.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (2)

36-36: Clarify the comment about dependent jobs.

The comment states "automatically rerun 2 times," but dependent_jobs=2 schedules 2 dependent jobs, resulting in 3 total runs (1 original + 2 dependents). Consider revising the comment to: "automatically schedule 2 dependent reruns (3 total runs)."


37-37: Update comment for clarity.

The comment mentions "second run (swe-agent)" but this function is invoked for both openhands and swe_agent frameworks. The described issue with reading config files from absolute cluster paths applies to both frameworks, not specifically the second framework.

Apply this diff:

-        reuse_code=False,  # otherwise the second run (swe-agent) tries to read the config file from the absolute cluster path and fails
+        reuse_code=False,  # otherwise subsequent runs try to read the config file from the absolute cluster path and fail
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e5877c and d4310d7.

📒 Files selected for processing (6)
  • docs/evaluation/code.md (3 hunks)
  • nemo_skills/inference/eval/swebench.py (10 hunks)
  • tests/slurm-tests/README.md (1 hunks)
  • tests/slurm-tests/qwen3coder_30b_swebench/check_results.py (1 hunks)
  • tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (1 hunks)
  • tests/slurm-tests/run_all.sh (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/slurm-tests/qwen3coder_30b_swebench/check_results.py (2)
tests/slurm-tests/utils.py (4)
  • assert_all (46-58)
  • get_nested_value (24-30)
  • load_json (18-21)
  • soft_assert (36-43)
tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (1)
  • main (45-86)
nemo_skills/inference/eval/swebench.py (1)
nemo_skills/inference/chat_interface/core.py (1)
  • cfg (181-182)
tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (3)
nemo_skills/pipeline/run_cmd.py (1)
  • run_cmd (40-211)
nemo_skills/pipeline/cli.py (1)
  • wrap_arguments (45-54)
tests/slurm-tests/qwen3coder_30b_swebench/check_results.py (1)
  • main (44-52)
🪛 Ruff (0.13.3)
tests/slurm-tests/qwen3coder_30b_swebench/check_results.py

22-22: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

nemo_skills/inference/eval/swebench.py

253-253: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


264-266: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🪛 Shellcheck (0.11.0)
tests/slurm-tests/run_all.sh

[warning] 15-15: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 15-15: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

🔇 Additional comments (1)
tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (1)

85-85: To inspect how add_task handles run_after, run:

#!/bin/bash
rg -nP "def add_task" -C3 nemo_skills/pipeline

Comment on lines +38 to +41
for category_tuple, expected_range in METRIC_RANGES[agent_framework].items():
val = float(get_nested_value(data, category_tuple))
lo, hi = expected_range
soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]")
Copy link
Contributor

@coderabbitai coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle missing or non-numeric metrics without crashing

get_nested_value returns None when the metric is absent (e.g., evaluation failed), so float(None) raises and the script aborts before calling assert_all(). We should soft-assert in that case to report a clear failure instead of throwing a traceback.

-        val = float(get_nested_value(data, category_tuple))
-        lo, hi = expected_range
-        soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]")
+        raw_val = get_nested_value(data, category_tuple)
+        if raw_val is None:
+            soft_assert(False, f"swe-bench ({agent_framework}) {category_tuple} missing")
+            continue
+        try:
+            val = float(raw_val)
+        except (TypeError, ValueError):
+            soft_assert(False, f"swe-bench ({agent_framework}) {category_tuple}={raw_val} not numeric")
+            continue
+        lo, hi = expected_range
+        soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for category_tuple, expected_range in METRIC_RANGES[agent_framework].items():
val = float(get_nested_value(data, category_tuple))
lo, hi = expected_range
soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]")
for category_tuple, expected_range in METRIC_RANGES[agent_framework].items():
raw_val = get_nested_value(data, category_tuple)
if raw_val is None:
soft_assert(False, f"swe-bench ({agent_framework}) {category_tuple} missing")
continue
try:
val = float(raw_val)
except (TypeError, ValueError):
soft_assert(False, f"swe-bench ({agent_framework}) {category_tuple}={raw_val} not numeric")
continue
lo, hi = expected_range
soft_assert(lo <= val <= hi,
f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't really matter here, we're only checking one metric value, so it can only fail once regardless.

also we don't bother with this in the other check_results.py scripts...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ludwig-n, that's fair. Given there's only one metric being checked and the existing pattern in other check_results.py scripts, the additional error handling isn't necessary. I appreciate the clarification!

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
@ludwig-n ludwig-n requested a review from Kipok October 9, 2025 17:01
benchmarks="swe-bench",
num_chunks=8,
dependent_jobs=2, # automatically rerun 2 times because it's common for some instances to fail
reuse_code=False, # otherwise the second run (swe-agent) tries to read the config file from the absolute cluster path and fails
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you share an error please? This shouldn't be happening

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging for now, but we should address this in another pr

@Kipok Kipok merged commit 6ba7655 into main Oct 9, 2025
7 checks passed
@Kipok Kipok deleted the ludwig-n/swe-bench-slurm-tests branch October 9, 2025 21:45
Glorf pushed a commit that referenced this pull request Oct 10, 2025
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Oct 29, 2025
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants