DRBench
is the first of its kind benchmark designed to evaluate deep research agents on complex, open-ended enterprise deep research tasks.
It tests an agent’s ability to conduct multi-hop, insight-driven research across public and private data sources,just like a real enterprise analyst.
Explore the DR Questions: DR Questions CSV
Discover the Facts for each DR Question: Facts Directory
uv pip install -e .
python minimal_local.py
This loads task SANITY0, generates a basic report and saves the results under results/minimal_local
Install Docker (https://www.docker.com/get-started/)
cd services
make local-build
this takes around 30 minutes and only has to be done once
python minimal.py
This loads task DR0001, generates a basic report and saves the results under results/minimal
Build and evaluate your own research agent in just 4 steps!
First, pick a task to work with:
from drbench import task_loader
task = task_loader.get_task_from_id("DR0001")
See what the task is about:
print(task.summary())
print(task.get_dr_question())
Your agent needs a generate_report
method that takes a question and returns insights:
class MyAgent:
def generate_report(self, query, env):
# Your research logic here
# report_text is the raw report text
# insights is the list of atomic insights from the report
return {"report_insights": insights, "report_text": report_text}
Refer to BasicAgent
for a simple example in drbench/agents/basic_agent.py
Or use the full DrBenchAgent
in drbench/agents/drbench_agent/drbench_agent.py
:
See how well your agent did:
from drbench.score_report import score_report
scores = score_report(
predicted_report=report,
task=task,
savedir="my_results"
)
print(f"Insights Recall: {scores['insights_recall']:.3f}")
-
🔎 Real Deep Research Tasks
Not simple fact lookups. It has tasks like "What changes should we make to our product roadmap to ensure compliance?" which require multi-step reasoning, synthesis, and reporting. -
🏢 Enterprise Context Grounding
Each task is rooted in realistic user personas (e.g., Product Developer) and organizational settings (e.g., ServiceNow), for deep understanding and contextual awareness. -
🧩 Multi-Modal, Multi-Source Reasoning
Agents must search, retrieve, and reason across:- Internal chat logs 💬
- Cloud file systems 📂
- Spreadsheets 📊
- PDFs 📄
- Websites 🌐
- Emails 📧
-
🧠 Insight-Centric Evaluation
Reports are scored based on whether agents extract the most critical insights and properly cite their sources.
✅ The first benchmark for deep research across hybrid enterprise environments
✅ A suite of real-world tasks across Enterprise UseCases like CRM
✅ A realistic simulated enterprise stack (chat, docs, email, web, etc.)
✅ A task generation framework blending web-based facts and local context
✅ A lightweight, scalable evaluation mechanism for insightfulness and citation
Interested in early access, collaboration, or feedback?
- Reach out via [[email protected]]
- Join our Discord Channel [https://discord.gg/9rQ6HgBbkd]
- Amirhossein Abaskohi – [email protected]
- Tianyi Chen – [email protected]
- Miguel Muñoz – [email protected]
- Curtis Fox - [email protected]
- Alex Drioun – [email protected]
- Issam Laradji – [email protected]
@article{abaskohi2025drbench,
title={DRBench: A Realistic Benchmark for Enterprise Deep Research},
author={Abaskohi, Amirhossein and Chen, Tianyi and Mu{\~n}oz-M{\'a}rmol, Miguel and Fox, Curtis and Ramesh, Amrutha Varshini and Marcotte, {\'E}tienne and L{\`u}, Xing Han and Chapados, Nicolas and Gella, Spandana and Pal, Christopher and others},
journal={arXiv preprint arXiv:2510.00172},
year={2025}
}