Search Providers Evaluation System

Overview

This repository provides an evaluation system for the SimpleQA benchmark, comparing different search providers.

Features

Evaluation of different search providers
Customizable configuration for each provider
Parallel independent evaluation
Resume the evaluation from the point of failure if an error occurs

Evaluation Results

The table below presents evaluation results across various search providers and LLMs on the SimpleQA benchmark. NOTE: For transparency and accuracy, we present the higher score between our internal evaluation results and officially reported scores for supported providers. For other providers, we display their publicly reported results.

Provider	Accuracy
Tavily	93.3%
Perplexity Sonar-Pro	88.8%
Serper Search	82.2%
Brave Search	76.1%
Exa Search (link)	90.04%
OpenAI Web Search (link)	90%
GPT 4.5 (link)	62.5%
Gemini 2.5 Pro (link)	50.8%

Running Locally

Clone the repository:

git clone https://github.com/simpleQA-eval.git
cd simpleQA-eval

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables:
Create a .env file in the root directory and add the following:

TAVILY_API_KEY=XXX
OPENAI_API_KEY=XXX
EXA_API_KEY=XXX
PERPLEXITY_API_KEY=XXX
SERPER_API_KEY=XXX
BRAVE_API_KEY=XXX

Run:

python run_evaluation.py

Command Line Options

--csv_path: Path to CSV file with questions and answers (default: datasets/simple_qa_test_set.csv)
--config: Path to JSON config file with provider parameters (default: configs/config.json)
--start_index: Starting index for examples (inclusive, default: 0)
--end_index: Ending index for examples (exclusive, default: all examples)
--random_sample: Number of random samples to select (overrides start/end index)
--post_process_model: Model for post-processing (default: gpt-4.1-mini)
--output_dir: Directory to save results (default: results)
--sequential: Run providers sequentially instead of in parallel
--rerun: Continue evaluation on existing results directory, output_dir must exist

Config Example

Configuration file config.json might look like:

{
  "tavily": {
    "search_depth": "advanced",
    "include_raw_content": true,
    "max_results": 10,
  },
  "perplexity": {
    "model": "sonar-pro",
  }
}

Results Output

The script generates two types of output files in the specified output directory:

Detailed results CSV for each provider (questions, answers, and evaluation grades)
Summary CSV with accuracy metrics for all providers

Resume Evaluation

If your evaluation is interrupted, you can continue from where it stopped using the --rerun flag (output_dir folder must exist with the previous run's partial results):

python run_evaluation.py --output_dir results/my_evaluation --rerun

This will:

Load existing results from the specified output directory
Skip questions that have already been evaluated
Continue with the remaining questions in the dataset
Update the summary statistics with all results when complete

Adding a New Search Provider to the Evaluation

Supported Search Providers

The current supported search providers are:

tavily
perplexity
gptr
exa
serper
brave

You can extend the system to evaluate additional search providers by following these steps:

Create a new handler file in the handlers directory (e.g., handlers/new_provider_handler.py).
Add your provider to the handler registry:

Update handlers/__init__.py to import and expose your new handler.
Update the get_search_handlers function in app.py and run_benchmark.py to include your new provider.

Update environment variables, add your provider's API key to the .env file:

NEW_PROVIDER_API_KEY=your_api_key_here

Use your provider in evaluation config:

{
  "new_provider": {
    "custom_param1": "value1",
    "custom_param2": "value2"
  }
}

Remember to implement appropriate error handling and respect any rate limits or API constraints for your new provider.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
configs		configs
datasets		datasets
evaluators		evaluators
handlers		handlers
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_evaluation.py		run_evaluation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Search Providers Evaluation System

Overview

Features

Evaluation Results

Running Locally

Command Line Options

Config Example

Results Output

Resume Evaluation

Adding a New Search Provider to the Evaluation

Supported Search Providers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

tavily-ai/tavily-SimpleQA

Folders and files

Latest commit

History

Repository files navigation

Search Providers Evaluation System

Overview

Features

Evaluation Results

Running Locally

Command Line Options

Config Example

Results Output

Resume Evaluation

Adding a New Search Provider to the Evaluation

Supported Search Providers

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages