Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
9020605
Add swe-bench dataset
Kipok Jul 25, 2025
cdad007
Support for multiple sandbox containers
Kipok Jul 25, 2025
6a3f14a
Initial implementation for swe-agnet
Kipok Jul 25, 2025
c40a333
Switching to apptainer
Kipok Jul 25, 2025
9c3a432
Switch to mounted trajectories dir
Kipok Jul 26, 2025
7e268f7
Roll-back sandbox changes
Kipok Jul 26, 2025
8e84856
Remove output
Kipok Jul 26, 2025
027fedf
Descriptive error
Kipok Jul 26, 2025
4cba9e0
More logs
Kipok Jul 26, 2025
0f5e795
Hardcode model name
Kipok Jul 27, 2025
abf3663
Change to sif
Kipok Jul 27, 2025
59a6efa
Fix
Kipok Jul 27, 2025
d2d0e81
Fix and retry
Kipok Jul 28, 2025
f5e1b96
Tmp code for evals
Kipok Jul 28, 2025
c9ea228
Merge branch 'main' into igitman/swe-bench-v2
Kipok Jul 28, 2025
ffe5590
Add eval
Kipok Jul 28, 2025
4266903
Remove trajectories dir
Kipok Jul 28, 2025
71fc8ef
Evaluation type
Kipok Jul 28, 2025
0eebb44
Metrics
Kipok Jul 28, 2025
4e22208
Fix metrics
Kipok Jul 28, 2025
1b0890f
Correct to .json
Kipok Jul 28, 2025
4e706c3
Correct to .json
Kipok Jul 28, 2025
1e5062b
More fixes
Kipok Jul 28, 2025
d5776de
Clean up logs
Kipok Jul 28, 2025
8832795
Fixes
Kipok Jul 28, 2025
f406307
Fix
Kipok Jul 28, 2025
e067e5f
Fixes
Kipok Jul 28, 2025
ffb4e14
Cleaning up
Kipok Jul 28, 2025
07d249d
More cleanups
Kipok Jul 28, 2025
5294019
Move PROMPT_CONFIG to generation args
Kipok Jul 28, 2025
aae62d0
More fixes
Kipok Jul 28, 2025
0c23ef9
More fixes
Kipok Jul 29, 2025
7c59e9a
Fix typo
Kipok Jul 29, 2025
9be0726
Add timeout parameter, enable logs
Kipok Jul 29, 2025
579fd28
Stream subprocess logs
Kipok Jul 29, 2025
5509e8d
Properly handle missing patches
Kipok Jul 29, 2025
3f7cfbd
Fix
Kipok Jul 29, 2025
694a7aa
Debugging
Kipok Jul 29, 2025
8f193c8
Fix asyn gen for tasks
Kipok Jul 29, 2025
82c69b1
Remove debugging
Kipok Jul 29, 2025
36710f1
Fix model name?
Kipok Jul 29, 2025
8e93059
Pass in model to client
Kipok Jul 29, 2025
bf97e26
Add model arg
Kipok Jul 29, 2025
d6cdaef
More checks on timeout
Kipok Jul 29, 2025
e61ed0c
Fixes
Kipok Jul 30, 2025
78f30ba
Change to server host/port from config
Kipok Jul 30, 2025
633f2b7
Change to server host/port from config
Kipok Jul 30, 2025
6bca7d3
Fix typo
Kipok Jul 30, 2025
4ece1d3
Debug
Kipok Jul 30, 2025
3ec1f63
Update localhost
Kipok Jul 30, 2025
a47d398
Rollback
Kipok Jul 30, 2025
702cdc6
Add default config
ludwig-n Jul 30, 2025
b9b84c7
Support more sampling parameters
ludwig-n Jul 31, 2025
1ee04d5
Update port/host
Kipok Aug 2, 2025
565ebd9
Merge branch 'main' into igitman/swe-bench-v3
Kipok Aug 2, 2025
0f97170
Make async
Kipok Aug 2, 2025
fb6138e
Fix prepare.py
Kipok Aug 2, 2025
54ba57c
Update with proper async subprocess
Kipok Aug 2, 2025
1b045af
Fix evaluation issues caused by localhost not resolving to 127.0.0.1
ludwig-n Aug 12, 2025
0b02f25
Support OpenHands for SWE-bench
ludwig-n Aug 14, 2025
9497e5b
Add max turns option
ludwig-n Aug 14, 2025
7d082c3
Merge branch 'main' into ludwig-n/openhands
Kipok Aug 14, 2025
2069d93
Fix prompt config
Kipok Aug 14, 2025
e60c143
Merge branch 'main' into ludwig-n/openhands
Kipok Aug 14, 2025
8f09956
Cleanup
Kipok Aug 14, 2025
2f74137
Set privileged through env var
Kipok Aug 14, 2025
ceeaac3
Add log file path
Kipok Aug 14, 2025
4c68c7c
Update docs
Kipok Aug 14, 2025
e36bf03
Install python from conda-forge
ludwig-n Aug 15, 2025
f1e109b
Install everything from conda-forge
ludwig-n Aug 15, 2025
a83386a
Rename swe-agent to swe_agent to fix hydra error
ludwig-n Aug 15, 2025
d1c6c7c
Merge branch 'main' into ludwig-n/openhands
Kipok Aug 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Here are some of the features we support:
- Evaluate your models on many popular benchmarks.
- Math problem solving: hmmt_feb25, brumo25, aime24, aime25, omni-math (and many more)
- Formal proofs in Lean: minif2f, proofnet
- Coding skills: scicode, livecodebench, human-eval, mbpp
- Coding skills: swe-bench, scicode, livecodebench, human-eval, mbpp
- Chat/instruction following: ifbench, ifeval, arena-hard
- General knowledge: mmlu, mmlu-pro, gpqa
- Long context: ruler, mrcr
Expand Down
2 changes: 1 addition & 1 deletion dockerfiles/Dockerfile.nemo-skills
Original file line number Diff line number Diff line change
Expand Up @@ -29,4 +29,4 @@ RUN mkdir -p /opt/NeMo-Skills/requirements
COPY pyproject.toml README.md /opt/NeMo-Skills/
COPY nemo_skills /opt/NeMo-Skills/nemo_skills/
COPY requirements /opt/NeMo-Skills/requirements/
RUN cd /opt/NeMo-Skills && pip install -e .[all]
RUN cd /opt/NeMo-Skills && pip install -e .[all]
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Here are some of the features we support:
- Evaluate your models on many popular benchmarks.
- Math problem solving: hmmt_feb25, brumo25, aime24, aime25, omni-math (and many more)
- Formal proofs in Lean: minif2f, proofnet
- Coding skills: scicode, livecodebench, human-eval, mbpp
- Coding skills: swe-bench, scicode, livecodebench, human-eval, mbpp
- Chat/instruction following: ifbench, ifeval, arena-hard
- General knowledge: mmlu, mmlu-pro, gpqa
- Long context: ruler
Expand Down
3 changes: 1 addition & 2 deletions docs/pipelines/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,11 +182,10 @@ Inside [nemo_skills/dataset/gsm8k/\_\_init\_\_.py](https://github.com/NVIDIA/NeM

```python
# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/math"
```

The prompt config and default generation arguments are passed to the
Expand Down
3 changes: 1 addition & 2 deletions nemo_skills/dataset/aime24/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = f"++prompt_config=generic/math"
3 changes: 1 addition & 2 deletions nemo_skills/dataset/aime25/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/math"
3 changes: 1 addition & 2 deletions nemo_skills/dataset/algebra222/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/math"
3 changes: 1 addition & 2 deletions nemo_skills/dataset/amc23/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/math"
6 changes: 3 additions & 3 deletions nemo_skills/dataset/answer-judge/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'judge/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "answer-judgement"
EVAL_ARGS = "++eval_type=answer_judgement ++generation_key=judgement"
GENERATION_ARGS = "++generation_key=judgement"
# using judgement directly in metrics, no need for special evaluation
EVAL_ARGS = "++eval_type=no-op ++generation_key=judgement"
GENERATION_ARGS = "++prompt_config=judge/math ++generation_key=judgement"
5 changes: 2 additions & 3 deletions nemo_skills/dataset/arena-hard/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,10 @@


# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/default'
DATASET_GROUP = 'chat'
METRICS_TYPE = "arena"
EVAL_ARGS = "++eval_type=arena"
GENERATION_ARGS = ""
EVAL_ARGS = "++eval_type=no-op" # using judgement directly in metrics, no need for special evaluation
GENERATION_ARGS = "++prompt_config=generic/default"

JUDGE_PIPELINE_ARGS = {
"generation_module": "nemo_skills.inference.eval.arena_judge",
Expand Down
3 changes: 1 addition & 2 deletions nemo_skills/dataset/asdiv/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/math"
72 changes: 39 additions & 33 deletions nemo_skills/dataset/bfcl_v3/prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,27 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import subprocess
import os
import argparse
import glob
import tempfile
import json
import logging
import os
import shutil
from nemo_skills.dataset.bfcl_v3.utils import func_doc_language_specific_pre_processing, convert_to_tool, is_multi_turn, load_file
import subprocess
import tempfile
from pathlib import Path
from nemo_skills.dataset.bfcl_v3.constants import DATA_FOLDER_PATH, MULTI_TURN_FUNC_DOC_PATH, MULTI_TURN_FUNC_DOC_FILE_MAPPING
import argparse
import logging

from nemo_skills.dataset.bfcl_v3.constants import (
DATA_FOLDER_PATH,
MULTI_TURN_FUNC_DOC_FILE_MAPPING,
MULTI_TURN_FUNC_DOC_PATH,
)
from nemo_skills.dataset.bfcl_v3.utils import (
convert_to_tool,
func_doc_language_specific_pre_processing,
is_multi_turn,
load_file,
)
from nemo_skills.utils import get_logger_name

LOG = logging.getLogger(get_logger_name(__file__))
Expand All @@ -34,7 +44,6 @@

# Define the configuration as a dictionary
DEFAULT_SETTINGS = """
PROMPT_CONFIG = "null"
DATASET_GROUP = "tool"
METRICS_TYPE = "bfcl"
EVAL_ARGS = "++eval_type=bfcl"
Expand All @@ -48,7 +57,7 @@ def process_multi_turn_test_case(instance, repo_root_dir):
"""
Multi-turn test cases don't have the function doc in the prompt. We need to add them here.
"""
# Mark whether the instance is single-turn or multi-turn.
# Mark whether the instance is single-turn or multi-turn.
# This is used to determine if the inference should be done in a single turn or multiple turns.
if not is_multi_turn(instance["id"]):
instance["single_turn"] = True
Expand Down Expand Up @@ -92,54 +101,54 @@ def process_file(repo_root_dir, input_file, output_file, model_type="llama-nemot
test_category = instance["id"].rsplit("_", 1)[0]
if idx == 0:
LOG.info(f"Processing {test_category}")

# TODO: Current preprocessing can be model dependent. This could be moved to inference time as well
# Convert class-based method calls to function calls
instance = process_multi_turn_test_case(instance, repo_root_dir)

# Convert function calls to tools format and add them to the system prompt
if "function" in instance:
# Add the tools to the system prompt
instance["function"] = func_doc_language_specific_pre_processing(instance["function"], test_category)
instance["tools"] = convert_to_tool(instance["function"])

f_out.write(json.dumps(instance) + "\n")


def download_and_process_bfcl_data(repo_url, subfolder_path, output_dir, file_prefix="BFCL_v3", model_type="nemotron"):
"""
Download JSON files from the BFCL GitHub repo via cloning

Args:
repo_url: GitHub repository URL
subfolder_path: Path to the data subfolder in case of BFCL
output_dir: Directory to save the processed JSONL files
file_prefix: Only process files starting with this prefix
model_type: Formatting of functions and tools can be model dependent.
model_type: Formatting of functions and tools can be model dependent.
"""
with tempfile.TemporaryDirectory() as temp_dir:
try:
# Clone repository with minimal depth
print(f"Cloning repository {repo_url} to {temp_dir}")
subprocess.run([
"git", "clone", "--depth=1", repo_url, temp_dir
], check=True, capture_output=True)

subprocess.run(["git", "clone", "--depth=1", repo_url, temp_dir], check=True, capture_output=True)

# Find the target folder
target_folder = Path(temp_dir) / subfolder_path

if not os.path.exists(target_folder):
print(f"Folder {subfolder_path} not found in repository")
raise FileNotFoundError(f"Folder {subfolder_path} not found in {repo_url} cloned to {temp_dir}. The structure of BFCL has changed!")

raise FileNotFoundError(
f"Folder {subfolder_path} not found in {repo_url} cloned to {temp_dir}. The structure of BFCL has changed!"
)

# Find JSON files matching criteria
json_pattern = os.path.join(target_folder, f"{file_prefix}*.json")
json_files = glob.glob(json_pattern)

print(f"Found {len(json_files)} JSON files matching pattern")

if not os.path.exists(output_dir):
os.makedirs(output_dir)
os.makedirs(output_dir)

processed_files = 0
for input_file in json_files:
Expand All @@ -157,21 +166,21 @@ def download_and_process_bfcl_data(repo_url, subfolder_path, output_dir, file_pr
# Copy the original json file to the split directory
shutil.copy(input_file, os.path.join(split_dirname, filename))
processed_files += 1

print(f"Successfully processed {processed_files} JSON files to {output_dir}")

except subprocess.CalledProcessError as e:
print(f"Git command failed: {e}")
print("Make sure git is installed and the repository URL is correct")


def main(args):
LOG.warning("Currently processing according to the OpenAI model style which works for most models, including Qwen/Llama-Nemotron/DeepSeek.")
LOG.warning(
"Currently processing according to the OpenAI model style which works for most models, including Qwen/Llama-Nemotron/DeepSeek."
)

download_and_process_bfcl_data(
REPO_URL, DATA_FOLDER_PATH,
output_dir=os.path.join(os.path.dirname(__file__)),
model_type=args.model_type
REPO_URL, DATA_FOLDER_PATH, output_dir=os.path.join(os.path.dirname(__file__)), model_type=args.model_type
)


Expand All @@ -181,6 +190,3 @@ def main(args):
args = parser.parse_args()

main(args)



3 changes: 1 addition & 2 deletions nemo_skills/dataset/brumo25/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/math"
3 changes: 1 addition & 2 deletions nemo_skills/dataset/college_math/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/math"
3 changes: 1 addition & 2 deletions nemo_skills/dataset/comp-math-24-25/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/math"
3 changes: 1 addition & 2 deletions nemo_skills/dataset/gaokao2023en/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/math"
3 changes: 1 addition & 2 deletions nemo_skills/dataset/gpqa/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,8 @@

# settings that define how evaluation should be done by default (all can be changed from cmdline)

PROMPT_CONFIG = "eval/aai/mcq-4choices-boxed"
DATASET_GROUP = "multichoice"
METRICS_TYPE = "multichoice"
EVAL_ARGS = "++eval_type=multichoice"
EVAL_SPLIT = "diamond"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=eval/aai/mcq-4choices-boxed"
3 changes: 1 addition & 2 deletions nemo_skills/dataset/gsm-plus/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/math"
3 changes: 1 addition & 2 deletions nemo_skills/dataset/gsm8k/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/math"
3 changes: 1 addition & 2 deletions nemo_skills/dataset/hle/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,10 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/hle'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/hle"
EVAL_SPLIT = "text"

# Some answers are not possible to compare symbolically, so have to use a judge model
Expand Down
3 changes: 1 addition & 2 deletions nemo_skills/dataset/hmmt_feb25/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/math'
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/math"
5 changes: 2 additions & 3 deletions nemo_skills/dataset/human-eval/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/codegen'
DATASET_GROUP = 'code'
METRICS_TYPE = "code"
METRICS_TYPE = "evalplus"
EVAL_ARGS = "++eval_type=evalplus ++eval_config.dataset=humaneval"
GENERATION_ARGS = ""
GENERATION_ARGS = "++prompt_config=generic/codegen"
3 changes: 1 addition & 2 deletions nemo_skills/dataset/ifbench/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/default'
DATASET_GROUP = 'chat'
METRICS_TYPE = "if"
EVAL_ARGS = "++eval_type=ifbench ++generation_key=response"
GENERATION_ARGS = "++generation_key=response"
GENERATION_ARGS = "++generation_key=response ++prompt_config=generic/default"
3 changes: 1 addition & 2 deletions nemo_skills/dataset/ifeval/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@
# limitations under the License.

# settings that define how evaluation should be done by default (all can be changed from cmdline)
PROMPT_CONFIG = 'generic/default'
DATASET_GROUP = 'chat'
METRICS_TYPE = "if"
EVAL_ARGS = "++eval_type=if ++generation_key=response"
GENERATION_ARGS = "++generation_key=response"
GENERATION_ARGS = "++prompt_config=generic/default ++generation_key=response"
Loading