This repository provides an evaluation system for the SimpleQA benchmark, comparing different search providers.
- Evaluation of different search providers
- Customizable configuration for each provider
- Parallel independent evaluation
- Resume the evaluation from the point of failure if an error occurs
The table below presents evaluation results across various search providers and LLMs on the SimpleQA benchmark. NOTE: For transparency and accuracy, we present the higher score between our internal evaluation results and officially reported scores for supported providers. For other providers, we display their publicly reported results.
Provider | Accuracy |
---|---|
Tavily | 93.3% |
Perplexity Sonar-Pro | 88.8% |
Serper Search | 82.2% |
Brave Search | 76.1% |
Exa Search (link) | 90.04% |
OpenAI Web Search (link) | 90% |
GPT 4.5 (link) | 62.5% |
Gemini 2.5 Pro (link) | 50.8% |
-
Clone the repository:
git clone https://github.com/simpleQA-eval.git cd simpleQA-eval
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
Create a.env
file in the root directory and add the following:TAVILY_API_KEY=XXX OPENAI_API_KEY=XXX EXA_API_KEY=XXX PERPLEXITY_API_KEY=XXX SERPER_API_KEY=XXX BRAVE_API_KEY=XXX
-
Run:
python run_evaluation.py
--csv_path
: Path to CSV file with questions and answers (default: datasets/simple_qa_test_set.csv)--config
: Path to JSON config file with provider parameters (default: configs/config.json)--start_index
: Starting index for examples (inclusive, default: 0)--end_index
: Ending index for examples (exclusive, default: all examples)--random_sample
: Number of random samples to select (overrides start/end index)--post_process_model
: Model for post-processing (default: gpt-4.1-mini)--output_dir
: Directory to save results (default: results)--sequential
: Run providers sequentially instead of in parallel--rerun
: Continue evaluation on existing results directory, output_dir must exist
Configuration file config.json
might look like:
{
"tavily": {
"search_depth": "advanced",
"include_raw_content": true,
"max_results": 10,
},
"perplexity": {
"model": "sonar-pro",
}
}
The script generates two types of output files in the specified output directory:
- Detailed results CSV for each provider (questions, answers, and evaluation grades)
- Summary CSV with accuracy metrics for all providers
If your evaluation is interrupted, you can continue from where it stopped using the --rerun
flag (output_dir
folder must exist with the previous run's partial results):
python run_evaluation.py --output_dir results/my_evaluation --rerun
This will:
- Load existing results from the specified output directory
- Skip questions that have already been evaluated
- Continue with the remaining questions in the dataset
- Update the summary statistics with all results when complete
The current supported search providers are:
tavily
perplexity
gptr
exa
serper
brave
You can extend the system to evaluate additional search providers by following these steps:
-
Create a new handler file in the
handlers
directory (e.g.,handlers/new_provider_handler.py
). -
Add your provider to the handler registry:
- Update
handlers/__init__.py
to import and expose your new handler. - Update the
get_search_handlers
function inapp.py
andrun_benchmark.py
to include your new provider.
- Update environment variables, add your provider's API key to the
.env
file:
NEW_PROVIDER_API_KEY=your_api_key_here
- Use your provider in evaluation config:
{
"new_provider": {
"custom_param1": "value1",
"custom_param2": "value2"
}
}
Remember to implement appropriate error handling and respect any rate limits or API constraints for your new provider.