-
Notifications
You must be signed in to change notification settings - Fork 163
Evaluation on OJBench #848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
60 commits
Select commit
Hold shift + click to select a range
1d09bbf
init commit for adding data
wasiahmad dc8a6eb
init commit for adding data
wasiahmad c4d8084
init commit for adding data
wasiahmad 7f03e2a
init commit for adding data
wasiahmad 4349f8d
init commit for adding data
wasiahmad 5ed5198
init commit for adding data
wasiahmad 06e114f
updating docker file
wasiahmad 26efae6
updating docker file
wasiahmad 2a94a89
updating docker file
wasiahmad 45a5faa
dataset prep update
wasiahmad a3c890d
evaluation logic implemented
wasiahmad 2dd18f9
fixing lcb-pro eval args
wasiahmad e2ea44b
ojbench eval updated
wasiahmad 36cd285
fixing syntax Error
wasiahmad fa86fb8
modified prepare.py
wasiahmad 30862b8
removed comments
wasiahmad 514b457
fixing RuntimeWarning: coroutine 'eval_ojbench_async.<locals>.install…
wasiahmad 21de2c1
fixing minor bug
wasiahmad 22cb58f
fixing git url for pip install
wasiahmad b009ed5
fixing pip install issue
wasiahmad 9a35a9e
update dmoj version in sandbox
wasiahmad b2912ca
updating num_workers
wasiahmad 8c59638
Merge remote-tracking branch 'origin/main' into feat/ojbench
wasiahmad 2abc050
adding subset_for_metrics for ojbench
wasiahmad 42e13f2
minor fixes
wasiahmad 096e325
moving ojbench to a separate script
wasiahmad fa606cc
Merge branch 'main' into feat/ojbench
wasiahmad 38cf70d
fix typo
wasiahmad 3d40dd6
fix typo
wasiahmad 4705a38
use repr to pass filepath
wasiahmad 5d87a9f
use repr to pass filepath
wasiahmad ee01d3a
debugging
wasiahmad 04917a8
container fails to get paths
wasiahmad ee89703
logging filepaths
wasiahmad 58d4170
support to add mount paths for sandbox
wasiahmad f1ef980
support to add mount paths for sandbox
wasiahmad c1391bc
adding keep_mounts_for_sandbox for dataset init file
wasiahmad df9f475
fix: Prevent permission error in OJBench judger
wasiahmad 24b413f
fix: Prevent permission error in OJBench judger
wasiahmad bbc5eed
fix: Prevent permission error in OJBench judger
wasiahmad 79ea7fb
fix: Prevent permission error in OJBench judger
wasiahmad 789ba61
final fixes
wasiahmad a261851
init file is removed debugging
wasiahmad f647002
init file is removed fixed
wasiahmad 7b07d0f
Merge remote-tracking branch 'origin/main' into feat/ojbench
wasiahmad ec7ca31
splitting data into python/c++
wasiahmad e19d7bf
splitting data into python/c++
wasiahmad de322f6
splitting data into python/c++
wasiahmad 807503e
subset_for_metrics calculation is roll backed
wasiahmad 9741f2d
splitting data into python/c++
wasiahmad 1f492d8
added to docs
wasiahmad 6e4fc58
replacing .jsonl with .json
wasiahmad d527518
Merge branch 'main' into feat/ojbench
wasiahmad 730efbf
updating eval docs
wasiahmad a2b86b6
addressing comments raised in PR
wasiahmad bff6d70
addressing comments raised in PR
wasiahmad 3d310d2
addressing comments raised in PR
wasiahmad 58037b5
addressed all comments
wasiahmad 7709ed4
Merge branch 'main' into feat/ojbench
wasiahmad cc803a9
fixing a minor error
wasiahmad File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # settings that define how evaluation should be done by default (all can be changed from cmdline) | ||
| DATASET_GROUP = "code" | ||
| METRICS_TYPE = "ojbench" | ||
| EVAL_SPLIT = "test_python" | ||
| EVAL_ARGS = "++eval_type=ojbench" | ||
| REQUIRES_SANDBOX = True | ||
| KEEP_MOUNTS_FOR_SANDBOX = True | ||
| GENERATION_ARGS = "++prompt_config=generic/default" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,91 @@ | ||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | ||
wasiahmad marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import json | ||
| import os | ||
| import shutil | ||
| import subprocess | ||
| import sys | ||
| from pathlib import Path | ||
|
|
||
| REPO_URL = "https://huggingface.co/datasets/He-Ren/OJBench_testdata" | ||
| HF_TOKEN = os.environ.get("HF_TOKEN") | ||
coderabbitai[bot] marked this conversation as resolved.
Show resolved
Hide resolved
wasiahmad marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| if not HF_TOKEN: | ||
| print("❌ Error: Hugging Face token not found.", file=sys.stderr) | ||
| print(" Please set the HF_TOKEN environment variable with your access token.", file=sys.stderr) | ||
| print(" You can create a token at: https://huggingface.co/settings/tokens", file=sys.stderr) | ||
| sys.exit(1) | ||
|
|
||
|
|
||
| def clone_dataset_repo(url, destination): | ||
| if not shutil.which("git"): | ||
| print("❌ Error: Git executable not found. Please install Git.", file=sys.stderr) | ||
| sys.exit(1) | ||
|
|
||
| try: | ||
| if destination.exists() or destination.is_symlink(): | ||
| print(f"Destination '{destination}' already exists. Removing it...") | ||
| if destination.is_dir(): | ||
| shutil.rmtree(destination) | ||
| else: | ||
| destination.unlink() | ||
|
|
||
| auth_url = url.replace("https://huggingface.co/", f"https://user:{HF_TOKEN}@huggingface.co/", 1) | ||
wasiahmad marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| print(f"Cloning {url} into {destination}...") | ||
| subprocess.run(["git", "clone", auth_url, destination], check=True, capture_output=True) | ||
|
|
||
| print("✅ Git clone is successful.") | ||
|
|
||
| except subprocess.CalledProcessError as e: | ||
| print("❌ Git command failed:", file=sys.stderr) | ||
| cmd = [url if i == 2 else arg for i, arg in enumerate(e.cmd)] | ||
| print(f" Command: {' '.join(map(str, cmd))}", file=sys.stderr) | ||
| stderr = e.stderr.decode().strip() | ||
| stderr = stderr.replace(HF_TOKEN, "***") if HF_TOKEN else stderr | ||
| print(f" Stderr: {stderr}", file=sys.stderr) | ||
| sys.exit(1) | ||
wasiahmad marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| data_dir = Path(__file__).absolute().parent | ||
| data_dir.mkdir(exist_ok=True) | ||
| destination = data_dir / "OJBench_testdata" | ||
| clone_dataset_repo(REPO_URL, destination) | ||
|
|
||
| source_file = destination / "prompts" / "full.jsonl" | ||
| python_target_file = data_dir / "test_python.jsonl" | ||
| cpp_target_file = data_dir / "test_cpp.jsonl" | ||
|
|
||
| print(f"Processing '{source_file}' and splitting into Python and C++ subsets...") | ||
| processed_lines = 0 | ||
| try: | ||
| with ( | ||
| source_file.open("r", encoding="utf-8") as infile, | ||
| python_target_file.open("w", encoding="utf-8") as outfile_py, | ||
| cpp_target_file.open("w", encoding="utf-8") as outfile_cpp, | ||
| ): | ||
| for line in infile: | ||
| data = json.loads(line) | ||
| data["question"] = data.pop("prompt") | ||
| data["subset_for_metrics"] = data["difficulty"] | ||
| if data["language"] == "python": | ||
| outfile_py.write(json.dumps(data) + "\n") | ||
| elif data["language"] == "cpp": | ||
| outfile_cpp.write(json.dumps(data) + "\n") | ||
| processed_lines += 1 | ||
| print(f"✅ Successfully processed {processed_lines} lines.") | ||
|
|
||
| except (FileNotFoundError, json.JSONDecodeError, OSError) as e: | ||
| print(f"❌ Error during file processing: {e}", file=sys.stderr) | ||
| sys.exit(1) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,138 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
|
|
||
| import asyncio | ||
| import json | ||
| import logging | ||
| import shlex | ||
| import textwrap | ||
| from contextlib import asynccontextmanager | ||
| from dataclasses import field | ||
| from pathlib import Path | ||
|
|
||
| from nemo_skills.code_execution.sandbox import get_sandbox | ||
| from nemo_skills.evaluation.evaluator.code import preprocess_code | ||
| from nemo_skills.utils import get_logger_name, nested_dataclass, unroll_files | ||
|
|
||
| LOG = logging.getLogger(get_logger_name(__file__)) | ||
|
|
||
|
|
||
| @nested_dataclass(kw_only=True) | ||
| class OJBenchConfig: | ||
| sandbox: dict = field(default_factory=lambda: {"sandbox_type": "local"}) | ||
| timeout: int = 6 | ||
|
|
||
|
|
||
| @asynccontextmanager | ||
| async def sandbox_context(config: dict): | ||
| sandbox = get_sandbox(**config) | ||
| try: | ||
| yield sandbox | ||
| finally: | ||
| LOG.info("Closing sandbox...") | ||
| await sandbox.close() | ||
|
|
||
|
|
||
| async def install_packages(eval_config: OJBenchConfig) -> bool: | ||
| """Helper to install packages inside the sandbox.""" | ||
|
|
||
| async with sandbox_context(eval_config.sandbox) as sandbox: | ||
| LOG.info("Installing required packages for ojbench evaluation...") | ||
|
|
||
| clone_cmd = "git clone https://github.com/He-Ren/OJBench.git" | ||
| result, _ = await sandbox.execute_code(clone_cmd, language="shell", timeout=300) | ||
| if result["process_status"] != "completed": | ||
| stderr = result.get("stderr", "Unknown error") | ||
| raise RuntimeError(f"Failed to clone OJBench repo: {stderr}") | ||
|
|
||
| install_cmd = "pip install -e OJBench" | ||
| result, _ = await sandbox.execute_code(install_cmd, language="shell", timeout=300) | ||
| if result["process_status"] != "completed": | ||
| stderr = result.get("stderr", "Unknown error") | ||
| raise RuntimeError(f"Failed to install ojbench. Stderr: {stderr}") | ||
|
|
||
| LOG.info("Successfully installed ojbench.") | ||
|
|
||
|
|
||
| async def eval_ojbench_async(cfg): | ||
| eval_config = OJBenchConfig(**cfg.eval_config) | ||
| problem_dirs = [ | ||
| Path(cfg.data_dir, "ojbench/OJBench_testdata/NOI"), | ||
| Path(cfg.data_dir, "ojbench/OJBench_testdata/ICPC"), | ||
| ] | ||
|
|
||
| await install_packages(eval_config) | ||
|
|
||
| async with sandbox_context(eval_config.sandbox) as sandbox: | ||
| for jsonl_file_str in unroll_files(cfg.input_files): | ||
| jsonl_file = Path(jsonl_file_str) | ||
| with open(jsonl_file, encoding="utf-8") as f_in: | ||
| samples = [] | ||
| for line in f_in: | ||
| sample = json.loads(line) | ||
| sample = preprocess_code(sample, sample["language"], strip_whitespace=True) | ||
| sample["prompt"] = sample.pop("question") | ||
| sample["content"] = f"```{sample['language']}\n{sample['completion']}\n```" | ||
| sample.pop("completion") | ||
| samples.append(sample) | ||
|
|
||
| input_filename = jsonl_file.name.replace("output-", "eval-input-", 1) | ||
| eval_input_file = jsonl_file.with_name(input_filename) | ||
| results_filename = jsonl_file.name.replace("output-", "eval-results-", 1) | ||
| eval_results_file = jsonl_file.with_name(results_filename) | ||
|
|
||
| with open(eval_input_file, "w", encoding="utf-8") as f_out: | ||
| f_out.writelines(json.dumps(sample) + "\n" for sample in samples) | ||
coderabbitai[bot] marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| eval_code = textwrap.dedent(f""" | ||
| import ojbench | ||
| ojbench.init(problem_dirs={repr([str(p) for p in problem_dirs])}) | ||
| ojbench.judge_jsonl( | ||
| input_path={repr(str(eval_input_file))}, | ||
| output_path={repr(str(eval_results_file))}, | ||
| num_workers=16 | ||
| ) | ||
| """) | ||
|
|
||
| cmd = f'env -i PATH="/usr/local/bin:/usr/bin:/bin" python3 -c {shlex.quote(eval_code)}' | ||
| output, _ = await sandbox.execute_code( | ||
| cmd, | ||
| language="shell", | ||
| timeout=eval_config.timeout * len(samples) + 60, | ||
| max_output_characters=100_000, | ||
| ) | ||
|
|
||
| if output.get("process_status") != "completed": | ||
| raise RuntimeError(f"Evaluation failed for {jsonl_file}. Stderr: {output.get('stderr')}") | ||
|
|
||
| with open(eval_results_file, "rt", encoding="utf-8") as fin: | ||
| results = [json.loads(line) for line in fin] | ||
|
|
||
| if len(results) != len(samples): | ||
| LOG.error(f"Result count mismatch for {jsonl_file}: {len(results)} results vs {len(samples)} samples") | ||
| continue | ||
|
|
||
| for sample, result in zip(samples, results, strict=True): | ||
| sample["verdict"] = result["verdict"] | ||
| sample["is_passed"] = result["is_passed"] | ||
|
|
||
| with open(jsonl_file, "w", encoding="utf-8") as f: | ||
| for sample in samples: | ||
| f.write(json.dumps(sample) + "\n") | ||
|
|
||
|
|
||
| def eval_ojbench(cfg): | ||
| """Synchronous wrapper to run the async evaluation.""" | ||
| asyncio.run(eval_ojbench_async(cfg)) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.