-
Notifications
You must be signed in to change notification settings - Fork 163
SWE-bench: add Slurm tests & other improvements #920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
33c43c5
Add options to customize eval harness repo & commit
ludwig-n 21478af
Use uploaded dataset for eval
ludwig-n a75396b
Add configurable interval between Apptainer retries
ludwig-n 7a10ba0
Don't use ** in search paths for efficiency
ludwig-n e1982f3
Fix comment
ludwig-n 3059601
Add Slurm tests for SWE-bench
ludwig-n 77d9b68
Add docs for new params
ludwig-n a1c36f9
Add test to run_all.sh
ludwig-n 8c1238c
Rename test
ludwig-n 6bfb04c
Adjust metric ranges for new envs
ludwig-n 1b19ab2
Update expected score for sample run in docs
ludwig-n 1664717
Add agent framework name to expname
ludwig-n 89b471f
reuse_code=False in test
ludwig-n d4310d7
Merge branch 'main' into ludwig-n/swe-bench-slurm-tests
ludwig-n b7fb5a2
Put container_formatter in quotes
ludwig-n File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
56 changes: 56 additions & 0 deletions
56
tests/slurm-tests/qwen3coder_30b_swebench/check_results.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
|
|
||
| import argparse | ||
| import os | ||
| import sys | ||
| from pathlib import Path | ||
|
|
||
| sys.path.append(str(Path(__file__).resolve().parent.parent)) # for utils.py | ||
| from utils import assert_all, get_nested_value, load_json, soft_assert # noqa: E402 | ||
|
|
||
| METRIC_RANGES = { | ||
| # +/- 3 pts from scores measured on 2025-10-08 (avg of 6 runs for OpenHands, 3 for SWE-agent) | ||
| "openhands": { | ||
| ("swe-bench", "pass@1", "issues_resolved"): (41.9, 47.9), | ||
| }, | ||
| "swe_agent": { | ||
| ("swe-bench", "pass@1", "issues_resolved"): (46.4, 52.4), | ||
| }, | ||
| } | ||
|
|
||
|
|
||
| def check_results(eval_dir: str, agent_framework: str): | ||
| f = os.path.join(eval_dir, "eval-results", "swe-bench", "metrics.json") | ||
| data = load_json(f) | ||
| for category_tuple, expected_range in METRIC_RANGES[agent_framework].items(): | ||
| val = float(get_nested_value(data, category_tuple)) | ||
| lo, hi = expected_range | ||
| soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]") | ||
|
|
||
|
|
||
| def main(): | ||
| ap = argparse.ArgumentParser() | ||
| ap.add_argument("--workspace", required=True, help="Workspace directory containing eval results") | ||
| ap.add_argument("--agent_framework", required=True, help="Agent framework used for the run") | ||
| args = ap.parse_args() | ||
|
|
||
| check_results(args.workspace, args.agent_framework) | ||
|
|
||
| assert_all() | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,90 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import argparse | ||
|
|
||
| from nemo_skills.pipeline.cli import eval, prepare_data, run_cmd, wrap_arguments | ||
|
|
||
|
|
||
| def eval_qwen3coder(workspace, cluster, expname_prefix, wandb_project, agent_framework): | ||
| eval( | ||
| ctx=wrap_arguments( | ||
| f"++agent_framework={agent_framework} " | ||
| f"++inference.temperature=0.7 " | ||
| f"++inference.top_p=0.8 " | ||
| f"++inference.top_k=20 " | ||
| ), | ||
| cluster=cluster, | ||
| model="Qwen/Qwen3-Coder-30B-A3B-Instruct", | ||
| server_type="vllm", | ||
| server_args="--enable-auto-tool-choice --tool-call-parser qwen3_coder", | ||
| server_nodes=1, | ||
| server_gpus=8, | ||
| benchmarks="swe-bench", | ||
| num_chunks=8, | ||
| dependent_jobs=2, # automatically rerun 2 times because it's common for some instances to fail | ||
| reuse_code=False, # otherwise the second run (swe-agent) tries to read the config file from the absolute cluster path and fails | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could you share an error please? This shouldn't be happening
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Merging for now, but we should address this in another pr |
||
| output_dir=workspace, | ||
| expname=expname_prefix, | ||
| wandb_project=wandb_project, | ||
| wandb_name=expname_prefix, | ||
| ) | ||
|
|
||
|
|
||
| def main(): | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument("--workspace", required=True, help="Workspace directory containing all experiment data") | ||
| parser.add_argument("--cluster", required=True, help="Cluster name, e.g. oci") | ||
| parser.add_argument("--expname_prefix", required=True, help="Experiment name prefix") | ||
| parser.add_argument("--wandb_project", default="nemo-skills-slurm-ci", help="W&B project name") | ||
| parser.add_argument("--container_formatter", default=None, help="Container formatter for SWE-bench") | ||
|
|
||
| args = parser.parse_args() | ||
|
|
||
| if args.container_formatter is None: | ||
| prepare_data_args = "swe-bench" | ||
| else: | ||
| prepare_data_args = f"swe-bench --container_formatter {args.container_formatter}" | ||
| prepare_data(ctx=wrap_arguments(prepare_data_args)) | ||
|
|
||
| for agent_framework in ["openhands", "swe_agent"]: | ||
| workspace = f"{args.workspace}/{agent_framework}" | ||
| expname_prefix = f"{args.expname_prefix}_{agent_framework}" | ||
|
|
||
| eval_qwen3coder( | ||
| workspace=workspace, | ||
| cluster=args.cluster, | ||
| expname_prefix=expname_prefix, | ||
| wandb_project=args.wandb_project, | ||
| agent_framework=agent_framework, | ||
| ) | ||
|
|
||
| # schedule a dependent check job on the cluster and check if the results are as expected | ||
| checker_cmd = ( | ||
| f"python tests/slurm-tests/qwen3coder_30b_swebench/check_results.py " | ||
| f" --workspace {workspace} " | ||
| f" --agent_framework {agent_framework} " | ||
| ) | ||
|
|
||
| run_cmd( | ||
| ctx=wrap_arguments(checker_cmd), | ||
| cluster=args.cluster, | ||
| expname=f"{expname_prefix}-check-results", | ||
| log_dir=f"{workspace}/check-results-logs", | ||
| run_after=expname_prefix, | ||
| ) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle missing or non-numeric metrics without crashing
get_nested_valuereturnsNonewhen the metric is absent (e.g., evaluation failed), sofloat(None)raises and the script aborts before callingassert_all(). We should soft-assert in that case to report a clear failure instead of throwing a traceback.📝 Committable suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't really matter here, we're only checking one metric value, so it can only fail once regardless.
also we don't bother with this in the other check_results.py scripts...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ludwig-n, that's fair. Given there's only one metric being checked and the existing pattern in other check_results.py scripts, the additional error handling isn't necessary. I appreciate the clarification!