BrowseComp Evaluation

Important

This repository is forked and modified from OpenAI's simple evals github repository and used to complete evaluation for projects such as Agent TARS.

This repository contains the implementation of BrowseComp benchmark for evaluating various AI systems, including language models, Python-based agents, and executable agents in binary format.

About BrowseComp

BrowseComp is designed to evaluate the browsing capabilities of language models. It tests the model's ability to understand web content, answer questions accurately, and provide confidence scores for its answers.

Execution Mode

By default, this benchmark evaluates examples in parallel using multiple threads. This significantly speeds up evaluation when dealing with many examples, especially for API-based models.

Default: Parallel execution with thread count equal to your system's CPU core count

Debug Mode: To switch to sequential (serial) execution for debugging purposes:

# Run in sequential mode for debugging
debug=True python browsecomp.py [other arguments]

Setup Requirements

# Install dependencies
pip install -r requirements.txt

You also need to set your OpenAI API key:

export OPENAI_API_KEY=your_openai_api_key_here

Virtual Environment Setup

# For Windows
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# When you're done, deactivate the environment
deactivate

# For macOS/Linux
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# When you're done, deactivate the environment
deactivate

# Using Conda (cross-platform)
# Create a conda environment
conda create -n browsecomp python=3.10
# Activate the conda environment
conda activate browsecomp
# Install dependencies
pip install -r requirements.txt
# When you're done, deactivate the environment
conda deactivate

Running the Evaluation

Using the Default OpenAI Model Runner

# Evaluate a specific OpenAI model
python browsecomp.py --model-name gpt-4 --examples 5

Using a Custom Python Script Runner

# Using a custom Python script implementation
python browsecomp.py --python-script /path/to/your/script.py --model-name gpt-4-turbo --examples 5

Using a CLI Command

# Using a custom CLI command
python browsecomp.py --command "your-cli-tool run" --model-name your-model-id --examples 5

Command Line Arguments

--python-script: Path to Python script runner (default: model_runner.py)
--command: CLI command format string (e.g., "agent-tars run")
--model-name: Model name to pass to the runner (optional, required for certain configurations)
--examples: Number of examples to evaluate (default: 10)
--grader-model-name: Model name to use for grading (default: gpt-4)
--grader-api-key: Custom API key for grader model
--grader-base-url: Custom base URL for grader API endpoint

Note: The --python-script and --command are mutually exclusive execution modes:

--python-script: Executes a Python script with your Python interpreter
--command: Executes a shell command directly, useful for compiled programs or complex CLI tools

The evaluation will generate an HTML report with the results in the current directory.

Creating a Custom Runner

To create a custom runner, you need to create either:

A Python script that:
- Accepts an --input argument with the prompt text
- Outputs the model's response to stdout
Or a CLI command that:
- Accepts --input "prompt" and optionally --model "model_name" parameters
- Outputs the model's response to stdout

Python Script Example

See model_runner.py for a reference implementation of a Python script runner.

CLI Command Example

Your CLI tool should accept arguments in this format:

your-cli-tool run --model "model-name" --input "prompt text here"

Understanding Results

The evaluation will output:

Accuracy score (percentage of correctly answered questions)
Detailed metrics on correct and incorrect answers
HTML representation of results

Implementation Details

The browsecomp.py implementation includes:

Encryption/decryption of test data for security
Template-based querying of language models
Automatic grading of responses using a grader model
Result aggregation and reporting
Command Line Arguments
--python-script: Path to Python script runner (default: model_runner.py)
--command: CLI command format string (e.g., "agent-tars run")
--model-name: Model name to pass to the runner (optional, required for certain configurations)
--examples: Number of examples to evaluate (default: 10)
--grader-model-name: Model name to use for grading (default: gpt-4)
--grader-api-key: Custom API key for grader model
--grader-base-url: Custom base URL for grader API endpoint

Note: The --python-script and --command are mutually exclusive execution modes:

--python-script: Executes a Python script with your Python interpreter
--command: Executes a shell command directly, useful for compiled programs or complex CLI tools

Using Custom Grader API Endpoints

If you're experiencing rate limits or need to use a different OpenAI API endpoint for grading:

python browsecomp.py --command "your-cli-tool run" --model-name your-model-id \
  --grader-model-name gpt-4 \
  --grader-api-key "your-api-key" \
  --grader-base-url "https://your-custom-endpoint/v1"

This allows you to use alternative API endpoints or different API keys for the grader model.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
browsecomp.py		browsecomp.py
browsecomp_eval.py		browsecomp_eval.py
interface.py		interface.py
model_runner.py		model_runner.py
requirements.txt		requirements.txt
samplers.py		samplers.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BrowseComp Evaluation

About BrowseComp

Execution Mode

Setup Requirements

Virtual Environment Setup

Running the Evaluation

Using the Default OpenAI Model Runner

Using a Custom Python Script Runner

Using a CLI Command

Command Line Arguments

Creating a Custom Runner

Python Script Example

CLI Command Example

Understanding Results

Implementation Details

Using Custom Grader API Endpoints

About

Uh oh!

Releases

Packages

Uh oh!

Languages

agent-infra/browsecomp

Folders and files

Latest commit

History

Repository files navigation

BrowseComp Evaluation

About BrowseComp

Execution Mode

Setup Requirements

Virtual Environment Setup

Running the Evaluation

Using the Default OpenAI Model Runner

Using a Custom Python Script Runner

Using a CLI Command

Command Line Arguments

Creating a Custom Runner

Python Script Example

CLI Command Example

Understanding Results

Implementation Details

Using Custom Grader API Endpoints

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages