Skip to content

MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.

License

Notifications You must be signed in to change notification settings

eval-sys/mcpmark

Repository files navigation

MCPMark: Stress-Testing Comprehensive MCP Use

Website arXiv Discord Docs Hugging Face

An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).

MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.

MCPMark

What you can do with MCPMark

  • Evaluate real tool usage across multiple MCP services: Notion, GitHub, Filesystem, Postgres, Playwright.
  • Use ready-to-run tasks covering practical workflows, each with strict automated verification.
  • Reliable and reproducible: isolated environments that do not pollute your accounts/data; failed tasks auto-retry and resume.
  • Unified metrics and aggregation: single/multi-run (pass@k, avg@k, etc.) with automated results aggregation.
  • Flexible deployment: local or Docker; fully validated on macOS and Linux.

Quickstart (5 minutes)

1) Clone the repository

git clone https://github.com/eval-sys/mcpmark.git
cd mcpmark

2) Set environment variables (create .mcp_env at repo root)

Only set what you need. Add service credentials when running tasks for that service.

# Example: OpenAI
OPENAI_BASE_URL="https://api.openai.com/v1"
OPENAI_API_KEY="sk-..."

# Optional: Notion (only for Notion tasks)
SOURCE_NOTION_API_KEY="your-source-notion-api-key"
EVAL_NOTION_API_KEY="your-eval-notion-api-key"
EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub"
PLAYWRIGHT_BROWSER="chromium"   # chromium | firefox
PLAYWRIGHT_HEADLESS="True"

# Optional: GitHub (only for GitHub tasks)
GITHUB_TOKENS="token1,token2"   # token pooling for rate limits
GITHUB_EVAL_ORG="your-eval-org"

# Optional: Postgres (only for Postgres tasks)
POSTGRES_HOST="localhost"
POSTGRES_PORT="5432"
POSTGRES_USERNAME="postgres"
POSTGRES_PASSWORD="password"

See docs/introduction.md and the service guides below for more details.

3) Install and run a minimal example

Local (Recommended)

pip install -e .
# If you'll use browser-based tasks, install Playwright browsers first
playwright install

Docker

./build-docker.sh

Run a filesystem task (no external accounts required):

python -m pipeline \
  --mcp filesystem \
  --k 1 \ # run once to quick start
  --models gpt-5  \ # or any model you configured
  --tasks file_property/size_classification

Results are saved to ./results/{exp_name}/{model}__{mcp}/run-*/... (e.g., ./results/test-run/gpt-5__filesystem/run-1/...).


Run your evaluations

Single run (k=1)

# Run ALL tasks for a service
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL --k 1

# Run a task group
python -m pipeline --exp-name exp --mcp notion --tasks online_resume --models MODEL --k 1

# Run a specific task
python -m pipeline --exp-name exp --mcp notion --tasks online_resume/daily_itinerary_overview --models MODEL --k 1

# Evaluate multiple models
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL1,MODEL2,MODEL3 --k 1

Multiple runs (k>1) for pass@k

# Run k=4 to compute stability metrics (requires --exp-name to aggregate final results)
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL

# Aggregate results (pass@1 / pass@k / pass^k / avg@k)
python -m src.aggregators.aggregate_results --exp-name exp

Run with Docker

# Run all tasks for a service
./run-task.sh --mcp notion --models MODEL --exp-name exp --tasks all

# Cross-service benchmark
./run-benchmark.sh --models MODEL --exp-name exp --docker

Please visit docs/introduction.md for choices of MODEL.

Tip: MCPMark supports auto-resume. When re-running, only unfinished tasks will execute. Failures matching our retryable patterns (see RETRYABLE_PATTERNS) are retried automatically. Models may emit different error strings—if you encounter a new resumable error, please open a PR or issue.


Service setup and authentication

Service Setup summary Docs
Notion Environment isolation (Source Hub / Eval Hub), integration creation and grants, browser login verification. Guide
GitHub Multi-account token pooling recommended; import pre-exported repo state if needed. Guide
Postgres Start via Docker and import sample databases. Setup
Playwright Install browsers before first run; defaults to chromium. Setup
Filesystem Zero-configuration, run directly. Config

You can also follow Quickstart for the shortest end-to-end path.


Results and metrics

  • Results are organized under ./results/{exp_name}/{model}__{mcp}/run-*/ (JSON + CSV per task).
  • Generate a summary with:
# Basic usage
python -m src.aggregators.aggregate_results --exp-name exp

# For k-run experiments with single-run models
python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-models claude-opus-4-1
  • Only models with complete results across all tasks and runs are included in the final summary.
  • Includes multi-run metrics (pass@k, pass^k) for stability comparisons when k > 1.

Model and Tasks

  • Model support: MCPMark calls models via LiteLLM — see the LiteLLM docs: LiteLLM Doc. For Anthropic (Claude) extended thinking mode (enabled via --reasoning-effort), we use Anthropic’s native API.
  • See docs/introduction.md for details and configuration of supported models in MCPMark.
  • To add a new model, edit src/model_config.py. Before adding, check LiteLLM supported models/providers. See LiteLLM Doc.
  • Task design principles in docs/datasets/task.md. Each task ships with an automated verify.py for objective, reproducible evaluation, see docs/task.md for details.

Contributing

Contributions are welcome:

  1. Add a new task under tasks/<category_id>/<task_id>/ with meta.json, description.md and verify.py.
  2. Ensure local checks pass and open a PR.
  3. See docs/contributing/make-contribution.md.

Citation

If you find our works useful for your research, please consider citing:

@misc{wu2025mcpmark,
      title={MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use}, 
      author={Zijian Wu and Xiangyan Liu and Xinyuan Zhang and Lingjun Chen and Fanqing Meng and Lingxiao Du and Yiran Zhao and Fanshi Zhang and Yaoqi Ye and Jiawei Wang and Zirui Wang and Jinjie Ni and Yufan Yang and Arvin Xu and Michael Qizhe Shieh},
      year={2025},
      eprint={2509.24002},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.24002}, 
}

License

This project is licensed under the Apache License 2.0 — see LICENSE.

About

MCPMark is a comprehensive, stress-testing MCP benchmark designed to evaluate model and agent capabilities in real-world MCP use.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages