diff --git a/README.md b/README.md index ba5f698b8..d5200f557 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ Documentation - + Open Benchmark Index

@@ -44,7 +44,7 @@ sample-by-sample results* to debug and see how your models stack-up. Lighteval supports **1000+ evaluation tasks** across multiple domains and languages. Use [this -space](https://huggingface.co/spaces/SaylorTwift/benchmark_finder) to find what +space](https://huggingface.co/spaces/OpenEvals/open_benchmark_index) to find what you need, or, here's an overview of some *popular benchmarks*: @@ -107,6 +107,7 @@ huggingface-cli login Lighteval offers the following entry points for model evaluation: +- `lighteval eval`: Evaluation models using [inspect-ai](https://inspect.aisi.org.uk/) as a backend (prefered). - `lighteval accelerate`: Evaluate models on CPU or one or more GPUs using [🤗 Accelerate](https://github.com/huggingface/accelerate) - `lighteval nanotron`: Evaluate models in distributed settings using [⚡️ @@ -126,9 +127,7 @@ Did not find what you need ? You can always make your custom model API by follow Here's a **quick command** to evaluate using the *Accelerate backend*: ```shell -lighteval accelerate \ - "model_name=gpt2" \ - "leaderboard|truthfulqa:mc|0" +lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0" ``` Or use the **Python API** to run a model *already loaded in memory*! diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index d3b9c9d9b..d3c33cdab 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -7,6 +7,8 @@ title: Quicktour title: Getting started - sections: + - local: inspect-ai + title: Examples using Inspect-AI - local: saving-and-reading-results title: Save and read results - local: caching diff --git a/docs/source/available-tasks.mdx b/docs/source/available-tasks.mdx index 450b7ed49..57605577a 100644 --- a/docs/source/available-tasks.mdx +++ b/docs/source/available-tasks.mdx @@ -1,6 +1,8 @@ +# Available tasks +Browse and inspect tasks available in LightEval. -### Save Results +#### Run your benchmark and push details to the hub ```bash -# Save locally -lighteval accelerate \ - "model_name=openai-community/gpt2" \ - "leaderboard|truthfulqa:mc|0" \ - --output-dir ./results - -# Push to Hugging Face Hub -lighteval accelerate \ - "model_name=openai-community/gpt2" \ - "leaderboard|truthfulqa:mc|0" \ - --push-to-hub \ - --results-org your-username +lighteval eval "hf-inference-providers/openai/gpt-oss-20b" \ + "lighteval|gpqa:diamond|0" \ + --bundle-dir gpt-oss-bundle \ + --repo-id OpenEvals/evals ``` + +Resulting Space: + + diff --git a/docs/source/inspect-ai.mdx b/docs/source/inspect-ai.mdx new file mode 100644 index 000000000..9cdeb8802 --- /dev/null +++ b/docs/source/inspect-ai.mdx @@ -0,0 +1,120 @@ +# Evaluate your model with Inspect-AI + +Pick the right benchmarks with our benchmark finder: +Search by language, task type, dataset name, or keywords. + +> [!WARNING] +> Not all tasks are compatible with inspect-ai's API as of yet, we are working on converting all of them ! + + + + +Once you've chosen a benchmark, run it with `lighteval eval`. Below are examples for common setups. + +### Examples + +1. Evaluate a model via Hugging Face Inference Providers. + +```bash +lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0" +``` + +2. Run multiple evals at the same time. + +```bash +lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0,lighteval|aime25|0" +``` + +3. Compare providers for the same model. + +```bash +lighteval eval \ + hf-inference-providers/openai/gpt-oss-20b:fireworks-ai \ + hf-inference-providers/openai/gpt-oss-20b:together \ + hf-inference-providers/openai/gpt-oss-20b:nebius \ + "lighteval|gpqa:diamond|0" +``` + +4. Evaluate a vLLM or SGLang model. + +```bash +lighteval eval vllm/HuggingFaceTB/SmolLM-135M-Instruct "lighteval|gpqa:diamond|0" +``` + +5. See the impact of few-shot on your model. + +```bash +lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0,lighteval|gsm8k|5" +``` + +6. Optimize custom server connections. + +```bash +lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0" \ + --max-connections 50 \ + --timeout 30 \ + --retry-on-error 1 \ + --max-retries 1 \ + --max-samples 10 +``` + +7. Use multiple epochs for more reliable results. + +```bash +lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --epochs 16 --epochs-reducer "pass_at_4" +``` + +8. Push to the Hub to share results. + +```bash +lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|hle|0" \ + --bundle-dir gpt-oss-bundle \ + --repo-id OpenEvals/evals \ + --max-samples 100 +``` + +Resulting Space: + + + +9. Change model behaviour + +You can use any argument defined in inspect-ai's API. + +```bash +lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --temperature 0.1 +``` + +10. Use model-args to use any inference provider specific argument. + +```bash +lighteval eval google/gemini-2.5-pro "lighteval|aime25|0" --model-args location=us-east5 +``` + +```bash +lighteval eval openai/gpt-4o "lighteval|gpqa:diamond|0" --model-args service_tier=flex,client_timeout=1200 +``` + + +LightEval prints a per-model results table: + +``` +Completed all tasks in 'lighteval-logs' successfully + +| Model |gpqa|gpqa:diamond| +|---------------------------------------|---:|-----------:| +|vllm/HuggingFaceTB/SmolLM-135M-Instruct|0.01| 0.01| + +results saved to lighteval-logs +run "inspect view --log-dir lighteval-logs" to view the results +``` diff --git a/docs/source/quicktour.mdx b/docs/source/quicktour.mdx index e22ed3223..a8cf504b5 100644 --- a/docs/source/quicktour.mdx +++ b/docs/source/quicktour.mdx @@ -11,7 +11,7 @@ Lighteval can be used with several different commands, each optimized for differ ## Find your benchmark