DrBench Enterprise Research Benchmark

DRBench is the first of its kind benchmark designed to evaluate deep research agents on complex, open-ended enterprise deep research tasks.

It tests an agent’s ability to conduct multi-hop, insight-driven research across public and private data sources,just like a real enterprise analyst.

Data Overview

Explore the DR Questions: DR Questions CSV

Discover the Facts for each DR Question: Facts Directory

Quick Start

Install Requirements

uv pip install -e .

(1) Quick Run (Without Docker)

python minimal_local.py

This loads task SANITY0, generates a basic report and saves the results under results/minimal_local

(2) Quick Run (With Docker)

Install Docker (https://www.docker.com/get-started/)

cd services
make local-build

this takes around 30 minutes and only has to be done once

Run agent on Docker Environment

python minimal.py

This loads task DR0001, generates a basic report and saves the results under results/minimal

(3) Test Your Own Agent

Build and evaluate your own research agent in just 4 steps!

(a) Load a Task

First, pick a task to work with:

from drbench import task_loader
task = task_loader.get_task_from_id("DR0001")

See what the task is about:

print(task.summary())
print(task.get_dr_question())

(b) Create Your Agent

Your agent needs a generate_report method that takes a question and returns insights:

class MyAgent:
    def generate_report(self, query, env):
        # Your research logic here
        # report_text is the raw report text
        # insights is the list of atomic insights from the report
        return {"report_insights": insights, "report_text": report_text}

Refer to BasicAgent for a simple example in drbench/agents/basic_agent.py

Or use the full DrBenchAgent in drbench/agents/drbench_agent/drbench_agent.py:

(d) Evaluate Your Report

See how well your agent did:

from drbench.score_report import score_report
scores = score_report(
    predicted_report=report,
    task=task,
    savedir="my_results"
)

print(f"Insights Recall: {scores['insights_recall']:.3f}")

🧠 Why `drbench`?

🔎 Real Deep Research Tasks
Not simple fact lookups. It has tasks like "What changes should we make to our product roadmap to ensure compliance?" which require multi-step reasoning, synthesis, and reporting.
🏢 Enterprise Context Grounding
Each task is rooted in realistic user personas (e.g., Product Developer) and organizational settings (e.g., ServiceNow), for deep understanding and contextual awareness.
🧩 Multi-Modal, Multi-Source Reasoning
Agents must search, retrieve, and reason across:
- Internal chat logs 💬
- Cloud file systems 📂
- Spreadsheets 📊
- PDFs 📄
- Websites 🌐
- Emails 📧
🧠 Insight-Centric Evaluation
Reports are scored based on whether agents extract the most critical insights and properly cite their sources.

📦 What You have

✅ The first benchmark for deep research across hybrid enterprise environments
✅ A suite of real-world tasks across Enterprise UseCases like CRM ✅ A realistic simulated enterprise stack (chat, docs, email, web, etc.)
✅ A task generation framework blending web-based facts and local context
✅ A lightweight, scalable evaluation mechanism for insightfulness and citation

🤝 Get Involved

Interested in early access, collaboration, or feedback?

Reach out via [[email protected]]
Join our Discord Channel [https://discord.gg/9rQ6HgBbkd]

🤝 Core Contributers

Amirhossein Abaskohi – [email protected]
Tianyi Chen – [email protected]
Miguel Muñoz – [email protected]
Curtis Fox - [email protected]
Alex Drioun – [email protected]
Issam Laradji – [email protected]

Citation

@article{abaskohi2025drbench,
  title={DRBench: A Realistic Benchmark for Enterprise Deep Research},
  author={Abaskohi, Amirhossein and Chen, Tianyi and Mu{\~n}oz-M{\'a}rmol, Miguel and Fox, Curtis and Ramesh, Amrutha Varshini and Marcotte, {\'E}tienne and L{\`u}, Xing Han and Chapados, Nicolas and Gella, Spandana and Pal, Christopher and others},
  journal={arXiv preprint arXiv:2510.00172},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
docs		docs
drbench		drbench
services		services
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
minimal.py		minimal.py
minimal_local.py		minimal_local.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DrBench Enterprise Research Benchmark

Data Overview

Quick Start

Install Requirements

(1) Quick Run (Without Docker)

(2) Quick Run (With Docker)

Install Docker (https://www.docker.com/get-started/)

Run agent on Docker Environment

(3) Test Your Own Agent

(a) Load a Task

(b) Create Your Agent

(d) Evaluate Your Report

🧠 Why `drbench`?

📦 What You have

🤝 Get Involved

🤝 Core Contributers

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

ServiceNow/drbench

Folders and files

Latest commit

History

Repository files navigation

DrBench Enterprise Research Benchmark

Data Overview

Quick Start

Install Requirements

(1) Quick Run (Without Docker)

(2) Quick Run (With Docker)

Install Docker (https://www.docker.com/get-started/)

Run agent on Docker Environment

(3) Test Your Own Agent

(a) Load a Task

(b) Create Your Agent

(d) Evaluate Your Report

🧠 Why drbench?

📦 What You have

🤝 Get Involved

🤝 Core Contributers

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

🧠 Why `drbench`?

Packages