An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).
MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.
- Evaluate real tool usage across multiple MCP services:
Notion
,GitHub
,Filesystem
,Postgres
,Playwright
. - Use ready-to-run tasks covering practical workflows, each with strict automated verification.
- Reliable and reproducible: isolated environments that do not pollute your accounts/data; failed tasks auto-retry and resume.
- Unified metrics and aggregation: single/multi-run (pass@k, avg@k, etc.) with automated results aggregation.
- Flexible deployment: local or Docker; fully validated on macOS and Linux.
git clone https://github.com/eval-sys/mcpmark.git
cd mcpmark
Only set what you need. Add service credentials when running tasks for that service.
# Example: OpenAI
OPENAI_BASE_URL="https://api.openai.com/v1"
OPENAI_API_KEY="sk-..."
# Optional: Notion (only for Notion tasks)
SOURCE_NOTION_API_KEY="your-source-notion-api-key"
EVAL_NOTION_API_KEY="your-eval-notion-api-key"
EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub"
PLAYWRIGHT_BROWSER="chromium" # chromium | firefox
PLAYWRIGHT_HEADLESS="True"
# Optional: GitHub (only for GitHub tasks)
GITHUB_TOKENS="token1,token2" # token pooling for rate limits
GITHUB_EVAL_ORG="your-eval-org"
# Optional: Postgres (only for Postgres tasks)
POSTGRES_HOST="localhost"
POSTGRES_PORT="5432"
POSTGRES_USERNAME="postgres"
POSTGRES_PASSWORD="password"
See docs/introduction.md
and the service guides below for more details.
Local (Recommended)
pip install -e .
# If you'll use browser-based tasks, install Playwright browsers first
playwright install
Docker
./build-docker.sh
Run a filesystem task (no external accounts required):
python -m pipeline \
--mcp filesystem \
--k 1 \ # run once to quick start
--models gpt-5 \ # or any model you configured
--tasks file_property/size_classification
Results are saved to ./results/{exp_name}/{model}__{mcp}/run-*/...
(e.g., ./results/test-run/gpt-5__filesystem/run-1/...
).
# Run ALL tasks for a service
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL --k 1
# Run a task group
python -m pipeline --exp-name exp --mcp notion --tasks online_resume --models MODEL --k 1
# Run a specific task
python -m pipeline --exp-name exp --mcp notion --tasks online_resume/daily_itinerary_overview --models MODEL --k 1
# Evaluate multiple models
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL1,MODEL2,MODEL3 --k 1
# Run k=4 to compute stability metrics (requires --exp-name to aggregate final results)
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL
# Aggregate results (pass@1 / pass@k / pass^k / avg@k)
python -m src.aggregators.aggregate_results --exp-name exp
# Run all tasks for a service
./run-task.sh --mcp notion --models MODEL --exp-name exp --tasks all
# Cross-service benchmark
./run-benchmark.sh --models MODEL --exp-name exp --docker
Please visit docs/introduction.md
for choices of MODEL.
Tip: MCPMark supports auto-resume. When re-running, only unfinished tasks will execute. Failures matching our retryable patterns (see RETRYABLE_PATTERNS) are retried automatically. Models may emit different error strings—if you encounter a new resumable error, please open a PR or issue.
Service | Setup summary | Docs |
---|---|---|
Notion | Environment isolation (Source Hub / Eval Hub), integration creation and grants, browser login verification. | Guide |
GitHub | Multi-account token pooling recommended; import pre-exported repo state if needed. | Guide |
Postgres | Start via Docker and import sample databases. | Setup |
Playwright | Install browsers before first run; defaults to chromium . |
Setup |
Filesystem | Zero-configuration, run directly. | Config |
You can also follow Quickstart for the shortest end-to-end path.
- Results are organized under
./results/{exp_name}/{model}__{mcp}/run-*/
(JSON + CSV per task). - Generate a summary with:
# Basic usage
python -m src.aggregators.aggregate_results --exp-name exp
# For k-run experiments with single-run models
python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-models claude-opus-4-1
- Only models with complete results across all tasks and runs are included in the final summary.
- Includes multi-run metrics (pass@k, pass^k) for stability comparisons when k > 1.
- Model support: MCPMark calls models via LiteLLM — see the LiteLLM docs:
LiteLLM Doc
. For Anthropic (Claude) extended thinking mode (enabled via--reasoning-effort
), we use Anthropic’s native API. - See
docs/introduction.md
for details and configuration of supported models in MCPMark. - To add a new model, edit
src/model_config.py
. Before adding, check LiteLLM supported models/providers. SeeLiteLLM Doc
. - Task design principles in
docs/datasets/task.md
. Each task ships with an automatedverify.py
for objective, reproducible evaluation, seedocs/task.md
for details.
Contributions are welcome:
- Add a new task under
tasks/<category_id>/<task_id>/
withmeta.json
,description.md
andverify.py
. - Ensure local checks pass and open a PR.
- See
docs/contributing/make-contribution.md
.
If you find our works useful for your research, please consider citing:
@misc{wu2025mcpmark,
title={MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use},
author={Zijian Wu and Xiangyan Liu and Xinyuan Zhang and Lingjun Chen and Fanqing Meng and Lingxiao Du and Yiran Zhao and Fanshi Zhang and Yaoqi Ye and Jiawei Wang and Zirui Wang and Jinjie Ni and Yufan Yang and Arvin Xu and Michael Qizhe Shieh},
year={2025},
eprint={2509.24002},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.24002},
}
This project is licensed under the Apache License 2.0 — see LICENSE
.