Important
This repository is forked and modified from OpenAI's simple evals github repository and used to complete evaluation for projects such as Agent TARS.
This repository contains the implementation of BrowseComp benchmark for evaluating various AI systems, including language models, Python-based agents, and executable agents in binary format.
BrowseComp is designed to evaluate the browsing capabilities of language models. It tests the model's ability to understand web content, answer questions accurately, and provide confidence scores for its answers.
By default, this benchmark evaluates examples in parallel using multiple threads. This significantly speeds up evaluation when dealing with many examples, especially for API-based models.
- Default: Parallel execution with thread count equal to your system's CPU core count
- Debug Mode: To switch to sequential (serial) execution for debugging purposes:
# Run in sequential mode for debugging debug=True python browsecomp.py [other arguments]
# Install dependencies
pip install -r requirements.txt
You also need to set your OpenAI API key:
export OPENAI_API_KEY=your_openai_api_key_here
# For Windows
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# When you're done, deactivate the environment
deactivate
# For macOS/Linux
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# When you're done, deactivate the environment
deactivate
# Using Conda (cross-platform)
# Create a conda environment
conda create -n browsecomp python=3.10
# Activate the conda environment
conda activate browsecomp
# Install dependencies
pip install -r requirements.txt
# When you're done, deactivate the environment
conda deactivate
# Evaluate a specific OpenAI model
python browsecomp.py --model-name gpt-4 --examples 5
# Using a custom Python script implementation
python browsecomp.py --python-script /path/to/your/script.py --model-name gpt-4-turbo --examples 5
# Using a custom CLI command
python browsecomp.py --command "your-cli-tool run" --model-name your-model-id --examples 5
--python-script
: Path to Python script runner (default: model_runner.py)--command
: CLI command format string (e.g., "agent-tars run")--model-name
: Model name to pass to the runner (optional, required for certain configurations)--examples
: Number of examples to evaluate (default: 10)--grader-model-name
: Model name to use for grading (default: gpt-4)--grader-api-key
: Custom API key for grader model--grader-base-url
: Custom base URL for grader API endpoint
Note: The --python-script
and --command
are mutually exclusive execution modes:
--python-script
: Executes a Python script with your Python interpreter--command
: Executes a shell command directly, useful for compiled programs or complex CLI tools
The evaluation will generate an HTML report with the results in the current directory.
To create a custom runner, you need to create either:
-
A Python script that:
- Accepts an
--input
argument with the prompt text - Outputs the model's response to stdout
- Accepts an
-
Or a CLI command that:
- Accepts
--input "prompt"
and optionally--model "model_name"
parameters - Outputs the model's response to stdout
- Accepts
See model_runner.py
for a reference implementation of a Python script runner.
Your CLI tool should accept arguments in this format:
your-cli-tool run --model "model-name" --input "prompt text here"
The evaluation will output:
- Accuracy score (percentage of correctly answered questions)
- Detailed metrics on correct and incorrect answers
- HTML representation of results
The browsecomp.py
implementation includes:
-
Encryption/decryption of test data for security
-
Template-based querying of language models
-
Automatic grading of responses using a grader model
-
Result aggregation and reporting
-
Command Line Arguments
-
--python-script
: Path to Python script runner (default: model_runner.py) -
--command
: CLI command format string (e.g., "agent-tars run") -
--model-name
: Model name to pass to the runner (optional, required for certain configurations) -
--examples
: Number of examples to evaluate (default: 10) -
--grader-model-name
: Model name to use for grading (default: gpt-4) -
--grader-api-key
: Custom API key for grader model -
--grader-base-url
: Custom base URL for grader API endpoint
Note: The --python-script
and --command
are mutually exclusive execution modes:
--python-script
: Executes a Python script with your Python interpreter--command
: Executes a shell command directly, useful for compiled programs or complex CLI tools
If you're experiencing rate limits or need to use a different OpenAI API endpoint for grading:
python browsecomp.py --command "your-cli-tool run" --model-name your-model-id \
--grader-model-name gpt-4 \
--grader-api-key "your-api-key" \
--grader-base-url "https://your-custom-endpoint/v1"
This allows you to use alternative API endpoints or different API keys for the grader model.