Skip to content

Conversation

lewtun
Copy link
Member

@lewtun lewtun commented Apr 7, 2025

This PR bumps lighteval to give us access to a new pass@1 metric that estimates the accuracy from N samples per prompt.

From 10 different seeds, I found that N=16 gave the best tradeoff in terms of variance / compute cost: huggingface/lighteval#661

I am running the evals to sanity check nothing is changed dramatically, but the code should be good for review.

TODO

  • Run all AIME evals to sanity check

```shell
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now needed to avoid lighteval errors

echo "Running lighteval script ..."
echo "Eval results will be saved to $OUTPUT_DIR"
# Check if "custom" is a substring of TASKS
if [[ $TASKS == *"custom"* ]]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer have custom evals, so this can be removed

@lewtun lewtun requested a review from edbeeching April 8, 2025 08:06
@lewtun lewtun merged commit bf08f56 into main Apr 8, 2025
1 check passed
@lewtun lewtun deleted the lighteval-pass@1 branch April 8, 2025 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants