Skip to content

ServiceNow/drbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DrBench Enterprise Research Benchmark

drbench_banner.png

Read the Paper Join our Discord

DRBench is the first of its kind benchmark designed to evaluate deep research agents on complex, open-ended enterprise deep research tasks.

It tests an agent’s ability to conduct multi-hop, insight-driven research across public and private data sources,just like a real enterprise analyst.

Data Overview

Explore the DR Questions: DR Questions CSV

Discover the Facts for each DR Question: Facts Directory

Quick Start

Install Requirements

uv pip install -e .

(1) Quick Run (Without Docker)

python minimal_local.py 

This loads task SANITY0, generates a basic report and saves the results under results/minimal_local

(2) Quick Run (With Docker)

cd services
make local-build

this takes around 30 minutes and only has to be done once

Run agent on Docker Environment

python minimal.py 

This loads task DR0001, generates a basic report and saves the results under results/minimal

(3) Test Your Own Agent

Build and evaluate your own research agent in just 4 steps!

(a) Load a Task

First, pick a task to work with:

from drbench import task_loader
task = task_loader.get_task_from_id("DR0001")

See what the task is about:

print(task.summary())
print(task.get_dr_question())

(b) Create Your Agent

Your agent needs a generate_report method that takes a question and returns insights:

class MyAgent:
    def generate_report(self, query, env):
        # Your research logic here
        # report_text is the raw report text
        # insights is the list of atomic insights from the report
        return {"report_insights": insights, "report_text": report_text}

Refer to BasicAgent for a simple example in drbench/agents/basic_agent.py

Or use the full DrBenchAgent in drbench/agents/drbench_agent/drbench_agent.py:

(d) Evaluate Your Report

See how well your agent did:

from drbench.score_report import score_report
scores = score_report(
    predicted_report=report,
    task=task,
    savedir="my_results"
)

print(f"Insights Recall: {scores['insights_recall']:.3f}")

🧠 Why drbench?

  • 🔎 Real Deep Research Tasks
    Not simple fact lookups. It has tasks like "What changes should we make to our product roadmap to ensure compliance?" which require multi-step reasoning, synthesis, and reporting.

  • 🏢 Enterprise Context Grounding
    Each task is rooted in realistic user personas (e.g., Product Developer) and organizational settings (e.g., ServiceNow), for deep understanding and contextual awareness.

  • 🧩 Multi-Modal, Multi-Source Reasoning
    Agents must search, retrieve, and reason across:

    • Internal chat logs 💬
    • Cloud file systems 📂
    • Spreadsheets 📊
    • PDFs 📄
    • Websites 🌐
    • Emails 📧
  • 🧠 Insight-Centric Evaluation
    Reports are scored based on whether agents extract the most critical insights and properly cite their sources.


📦 What You have

✅ The first benchmark for deep research across hybrid enterprise environments
✅ A suite of real-world tasks across Enterprise UseCases like CRM ✅ A realistic simulated enterprise stack (chat, docs, email, web, etc.)
✅ A task generation framework blending web-based facts and local context
✅ A lightweight, scalable evaluation mechanism for insightfulness and citation


🤝 Get Involved

Interested in early access, collaboration, or feedback?


🤝 Core Contributers

Citation

@article{abaskohi2025drbench,
  title={DRBench: A Realistic Benchmark for Enterprise Deep Research},
  author={Abaskohi, Amirhossein and Chen, Tianyi and Mu{\~n}oz-M{\'a}rmol, Miguel and Fox, Curtis and Ramesh, Amrutha Varshini and Marcotte, {\'E}tienne and L{\`u}, Xing Han and Chapados, Nicolas and Gella, Spandana and Pal, Christopher and others},
  journal={arXiv preprint arXiv:2510.00172},
  year={2025}
}