-
Notifications
You must be signed in to change notification settings - Fork 163
PyPy3 execution support for LiveCodeBench evaluation #614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9dc70bd
27f5e44
62d0845
3806646
446ed09
33a8baa
4cfd725
5cd44ab
8755ab5
4e931e6
ed68025
45ac117
1bd199d
58af382
6a12754
c9d361d
642b92a
f200cac
34b90a5
eab2d58
8c27f28
eba855e
be737a0
e88c084
e600c78
3022a9b
5b00e48
86ee4e4
ff19286
f075214
f1fb1b3
9a47605
5df438b
7dace78
b178cc9
e020abd
71e4fa3
397943f
4cd131d
a78f0e7
e2a687a
cc69fd4
0dab4a8
47f7fe7
c672325
722422b
2e21e9b
b5c289a
1d8b332
85a66ce
096f52a
db8f2e5
2fa7da7
7685b2f
e6f6a89
fcaf706
6eecc91
fc1c13f
4b9b2d0
b26df04
4bcdcf9
6c1166e
34d961b
095f1aa
afd373d
1d3fd35
120af1e
4cf1f48
d921a1a
c0b5022
53f83c7
07f1163
75e84d1
31bcbfd
bd998ff
a5f3bcc
d922da3
173f811
1899ad1
07ad14c
06cdacb
0367809
e256ffd
3ac390f
85baa02
2fecec3
c8c6382
02b6c78
a740b54
69ff32b
6f1a687
9e7774c
f902d45
d2ee015
391652d
38de272
2c5f31d
b092a9c
984e00c
36f9c91
7eed4be
b80d807
6a3a0b0
383b428
61ac015
e2f23a1
0fee3e4
27e4800
367ae37
c6ad546
51ee492
2c372b6
d463ba3
5e94850
3be5070
af217f6
0c26b5a
aaffceb
1a27645
988130c
f59b6c0
fccb153
60430e5
75fd5b8
52d544a
3413dc6
a9f31e5
0364637
fda85a3
d9dc1b9
2729707
1aca76e
26f591d
db07c74
cb4c0be
a02d527
e6c27aa
5730a9e
47740e8
6780b53
e03d046
f5614ad
a9b3ca5
4d3f817
ed25dda
c74bdeb
1f45d0c
174d271
58f3d55
678cced
a2a68ed
6decc47
e2bd5dd
f8897ac
c6a656f
75c1148
00ebe52
e0a26f7
6492a70
478765d
8160c6f
906246d
bb21b6b
b15b3a0
3a8b9ad
2d8a042
df93e20
3163ea8
c2ba406
68d6144
ebd5bab
2da40f1
69b501f
2afdbfe
9b7ce0c
66dbaa6
6b7af16
93ec8b4
e1842df
a680edc
b3d4612
4153e04
d64fa43
abb4805
e6d6b02
6af340e
1abbb9c
2d294ff
d2388c4
6e68efa
5196289
b25de59
ba480a6
20888db
a9c0bf2
b887bc7
1f2985c
976e6da
0fb1b8e
c10e26a
bf6231a
1d64b6c
27f886b
268d971
31db2de
7798899
0c73f0e
694c107
aa6c209
3c3a73d
3df9308
c759a74
758333b
2829c20
07a13cb
c3f6a23
a6eab74
193c940
47798dc
8597857
b096d9b
9e2e0bd
7c61cf1
59430b6
8fa3fc2
1ed6bb1
5e9eaf7
b849bfb
9ee8ea4
1a59903
d2c5863
680d00b
2650a75
a5bede1
0751a89
dd27f01
0ccde9f
cc875fd
bdf6b37
65e99b2
1050ecf
60087ac
ca09081
9028635
4970d6e
4042286
27fde4f
3729e9a
29dad0f
d050d9f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -178,6 +178,88 @@ all you need to do is replace `openhands` with `swe_agent` in the command above. | |
| - Benchmark is defined in [`nemo_skills/dataset/livecodebench/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/livecodebench/__init__.py) | ||
| - Original benchmark source is [here](https://github.com/LiveCodeBench/LiveCodeBench). | ||
|
|
||
| #### Data Preparation | ||
|
|
||
| First, prepare the dataset by running the `ns prepare_data` command. The arguments below will generate `test_v6_2408_2505.jsonl`. | ||
|
|
||
| ``` | ||
| ns prepare_data livecodebench --release_version v6 --start_date 2024-08 --end_date 2025-05 | ||
| ``` | ||
|
|
||
| ##### For Pypy3 Evaluation: | ||
| If you plan to evaluate using the Pypy3 interpreter, you must add the `--keep_all_columns` flag during data preparation. This will download a larger dataset (~1.9GB) containing the necessary test cases. So, we recommend downloading the dataset into a Slurm cluster location. | ||
|
|
||
| ``` | ||
| ns prepare_data livecodebench --release_version v6 --start_date 2024-08 --end_date 2025-05 --keep_all_columns --cluster=<CLUSTER_NAME> --data_dir=<DATA_DIR> | ||
| ``` | ||
|
|
||
wasiahmad marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| #### Running the Evaluation | ||
|
|
||
| Once the data is prepared, you can run the evaluation. Replace `<...>` placeholders with your cluster and directory paths. | ||
|
|
||
| ##### Standard Python Evaluation | ||
|
|
||
| This command runs an evaluation of [OpenReasoning-Nemotron-32B](https://huggingface.co/nvidia/OpenReasoning-Nemotron-32B) on a Slurm cluster. | ||
|
|
||
| ``` | ||
| ns eval \ | ||
| --cluster=<CLUSTER_NAME> \ | ||
| --model=nvidia/OpenReasoning-Nemotron-32B \ | ||
| --server_type=vllm \ | ||
| --server_args="--async-scheduling" \ | ||
| --server_nodes=1 \ | ||
| --server_gpus=8 \ | ||
| --benchmarks=livecodebench \ | ||
| --split=test_v6_2408_2505 \ | ||
| --data_dir=<DATA_DIR> \ | ||
| --output_dir=<OUTPUT_DIR> \ | ||
| --extra_eval_args="++eval_config.interpreter=python" \ | ||
| --with_sandbox \ | ||
| ++inference.temperature=0.6 \ | ||
| ++inference.top_p=0.95 \ | ||
| ++inference.tokens_to_generate=65536 | ||
| ``` | ||
wasiahmad marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ##### Pypy3 Evaluation | ||
|
|
||
| To run with the Pypy3 interpreter, modify the `--extra_eval_args` flag as shown below. | ||
| ``` | ||
| --extra_eval_args="++eval_config.interpreter=pypy3 ++eval_config.test_file=<DATA_DIR>/livecodebench/test_v6_2408_2505.jsonl" | ||
| ``` | ||
|
Comment on lines
+227
to
+228
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tag the interpreter override snippet as bash. Prevents MD040 lint failures. -```
+```bash
--extra_eval_args="++eval_config.interpreter=pypy3 ++eval_config.test_file=<DATA_DIR>/livecodebench/test_v6_2408_2505.jsonl"
-```
+```🤖 Prompt for AI Agents |
||
|
|
||
| ##### Verifying Results | ||
|
|
||
| After all jobs are complete, you can check the results in `<OUTPUT_DIR>/eval-results/livecodebench/metrics.json`. You can also take a look at `<OUTPUT_DIR>/eval-results/livecodebench/summarized-results/main_*` They should look something like this: | ||
|
|
||
| ``` | ||
wasiahmad marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| -------------------------- livecodebench -------------------------- | ||
| evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy | ||
| pass@1 | 454 | 15995 | 2188 | 71.15% | ||
|
|
||
|
|
||
| ------------------------ livecodebench-easy ----------------------- | ||
| evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy | ||
| pass@1 | 110 | 5338 | 1806 | 99.09% | ||
|
|
||
|
|
||
| ------------------------ livecodebench-hard ----------------------- | ||
| evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy | ||
| pass@1 | 203 | 23031 | 2188 | 46.31% | ||
|
|
||
|
|
||
| ----------------------- livecodebench-medium ---------------------- | ||
| evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy | ||
| pass@1 | 141 | 14178 | 1889 | 85.11% | ||
| ``` | ||
|
Comment on lines
+235
to
+253
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Declare the metrics block as plain text. The ASCII table isn’t JSON; marking it as text satisfies MD040 and keeps formatting intact. -```
+```text
-------------------------- livecodebench --------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | accuracy
@@
pass@1 | 141 | 14178 | 1889 | 85.11%
-```
+```🤖 Prompt for AI Agents |
||
|
|
||
| ##### Advanced: Averaging Multiple Runs | ||
|
|
||
| Due to variance between runs, you can automatically repeat the evaluation and average the results. To run the evaluation 3 times, for example, set the `--benchmarks` flag as follows: | ||
|
|
||
| ``` | ||
| --benchmarks=livecodebench:3 | ||
| ``` | ||
wasiahmad marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### livecodebench-pro | ||
|
|
||
| - Benchmark is defined in [`nemo_skills/dataset/livecodebench-pro/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/livecodebench-pro/__init__.py) | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.