Expose a few model predictions / gold answers in the logs

For generative benchmarks like MATH / GSM8k / IFEval, it would be great to have some visibility in the logs on how the prompts are formatted, what the generations look like, what the gold answer is etc.

Currently, the best approach I've found is to first run the benchmark with `--max_samples` and then manually inspect the details Parquet file. However this is rather cumbersome, especially when launching many evals in parallel :)

Perhaps we can store the first N examples in the logs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expose a few model predictions / gold answers in the logs #164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Expose a few model predictions / gold answers in the logs #164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions