[WIP] Bump lighteval with proper pass@1 #584

lewtun · 2025-04-07T13:55:29Z

This PR bumps lighteval to give us access to a new pass@1 metric that estimates the accuracy from N samples per prompt.

From 10 different seeds, I found that N=16 gave the best tradeoff in terms of variance / compute cost: huggingface/lighteval#661

I am running the evals to sanity check nothing is changed dramatically, but the code should be good for review.

TODO

Run all AIME evals to sanity check

lewtun · 2025-04-08T07:54:22Z

README.md

 ```shell
 MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
-MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
+MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"


This is now needed to avoid lighteval errors

lewtun · 2025-04-08T07:55:07Z

slurm/evaluate.slurm

 echo "Running lighteval script ..."
 echo "Eval results will be saved to $OUTPUT_DIR"
-# Check if "custom" is a substring of TASKS
-if [[ $TASKS == *"custom"* ]]; then


We no longer have custom evals, so this can be removed

Bump lighteval with proper pass@1

7ad7618

lewtun commented Apr 8, 2025

View reviewed changes

lewtun requested a review from edbeeching April 8, 2025 08:06

edbeeching approved these changes Apr 8, 2025

View reviewed changes

lewtun added 2 commits April 8, 2025 14:55

Bump lighteval

49a7e28

Update AIME24

ef3ffe8

lewtun merged commit bf08f56 into main Apr 8, 2025
1 check passed

lewtun deleted the lighteval-pass@1 branch April 8, 2025 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Bump lighteval with proper pass@1 #584

[WIP] Bump lighteval with proper pass@1 #584

lewtun commented Apr 7, 2025 •

edited

Loading

Uh oh!

lewtun Apr 8, 2025

Uh oh!

lewtun Apr 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] Bump lighteval with proper pass@1 #584

[WIP] Bump lighteval with proper pass@1 #584

Conversation

lewtun commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

lewtun Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

lewtun Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lewtun commented Apr 7, 2025 •

edited

Loading