Skip to content

Commit a971322

Browse files
authored
feat: MCP tool calling evaluations in CI/CD (#313)
1 parent ea429b3 commit a971322

16 files changed

+5741
-126
lines changed

.env.example

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
11
APIFY_TOKEN=
2-
# ANTHROPIC_API_KEY is only required when you want to run examples/clientStdioChat.js
3-
ANTHROPIC_API_KEY=
2+
3+
# EVALS
4+
PHOENIX_API_KEY=
5+
PHOENIX_HOST=
6+
7+
OPENROUTER_API_KEY=
8+
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

.github/workflows/evaluations.yaml

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# This workflow runs MCP tool calling evaluations on master branch merges
2+
# It evaluates AI models' ability to correctly identify and call MCP tools.
3+
4+
name: MCP tool calling evaluations
5+
6+
on:
7+
# Run evaluations on master branch merges
8+
push:
9+
branches:
10+
- 'master'
11+
# Also run on PRs with 'evals' label for testing
12+
pull_request:
13+
types: [labeled, synchronize, reopened]
14+
15+
jobs:
16+
evaluations:
17+
name: MCP tool calling evaluations
18+
runs-on: ubuntu-latest
19+
# Run on master pushes or PRs with 'evals' label
20+
if: github.event_name == 'push' || contains(github.event.pull_request.labels.*.name, 'validated')
21+
22+
steps:
23+
- name: Checkout code
24+
uses: actions/checkout@v4
25+
26+
- name: Use Node.js 22
27+
uses: actions/setup-node@v4
28+
with:
29+
node-version: 22
30+
cache: 'npm'
31+
cache-dependency-path: 'package-lock.json'
32+
33+
- name: Install Node dependencies
34+
run: npm ci --include=dev
35+
36+
- name: Build project
37+
run: npm run build
38+
39+
- name: Run evaluations
40+
run: npm run evals:run
41+
env:
42+
GITHUB_PR_NUMBER: ${{ github.event_name == 'pull_request' && github.event.number || 'master' }}
43+
PHOENIX_API_KEY: ${{ secrets.PHOENIX_API_KEY }}
44+
PHOENIX_BASE_URL: ${{ secrets.PHOENIX_BASE_URL }}
45+
OPENROUTER_BASE_URL: ${{ secrets.OPENROUTER_BASE_URL }}
46+
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,10 @@ key.pem
2828

2929
# Ignore MCP config for Opencode client
3030
opencode.json
31+
32+
# Python cache files
33+
__pycache__/
34+
*.pyc
35+
*.pyo
36+
*.pyd
37+
.Python

eslint.config.mjs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ import apifyTypeScriptConfig from '@apify/eslint-config/ts.js';
22

33
// eslint-disable-next-line import/no-default-export
44
export default [
5-
{ ignores: ['**/dist'] }, // Ignores need to happen first
5+
{ ignores: ['**/dist', '**/.venv', 'evals/**'] }, // Ignores need to happen first
66
...apifyTypeScriptConfig,
77
{
88
languageOptions: {

evals/README.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# MCP tool selection evaluation
2+
3+
Evaluates MCP server tool selection. Phoenix used only for storing results and visualization.
4+
5+
## CI Workflow
6+
7+
The evaluation workflow runs automatically on:
8+
- **Master branch pushes** - for production evaluations (saves CI cycles)
9+
- **PRs with `validated` label** - for testing evaluation changes before merging
10+
11+
To trigger evaluations on a PR, add the `validated` label to your pull request.
12+
13+
## Two evaluation methods
14+
15+
1. **exact match** (`tool-exact-match`) - binary tool name validation
16+
2. **LLM judge** (`tool-selection-llm`) - Phoenix classifier with structured prompt
17+
18+
## Why OpenRouter?
19+
20+
unified API for Gemini, Claude, GPT. no separate integrations needed.
21+
22+
## Judge model
23+
24+
- model: `openai/gpt-4o-mini`
25+
- prompt: structured eval with context + tool definitions
26+
- output: "correct"/"incorrect" → 1.0/0.0 score (and explanation)
27+
28+
## Config (`config.ts`)
29+
30+
```typescript
31+
MODELS_TO_EVALUATE = ['openai/gpt-4o-mini', 'anthropic/claude-3.5-haiku', 'google/gemini-2.5-flash']
32+
PASS_THRESHOLD = 0.6
33+
TOOL_SELECTION_EVAL_MODEL = 'openai/gpt-4o-mini'
34+
```
35+
36+
## Setup
37+
38+
```bash
39+
export PHOENIX_BASE_URL="your_url"
40+
export PHOENIX_API_KEY="your_key"
41+
export OPENROUTER_API_KEY="your_key"
42+
export OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
43+
44+
npm ci
45+
npm run evals:create-dataset # one-time
46+
npm run evals:run
47+
```
48+
49+
## Test cases
50+
51+
40+ cases across 7 tool categories: `fetch-actor-details`, `search-actors`, `apify-slash-rag-web-browser`, `search-apify-docs`, `call-actor`, `get-actor-output`, `fetch-apify-docs`
52+
53+
## Output
54+
55+
- Phoenix dashboard with detailed results
56+
- console: pass/fail per model + evaluator
57+
- exit code: 0 = success, 1 = failure
58+
59+
## Adding new test cases
60+
61+
### How to contribute?
62+
63+
1. **Create an issue or PR** with your new test cases
64+
2. **Explain why it should pass** - add a `reference` field with clear reasoning
65+
3. **Test locally** before submitting
66+
4. **Publish** - we'll review and merge
67+
68+
### Test case structure
69+
70+
Each test case in `test-cases.json` has this structure:
71+
72+
```json
73+
{
74+
"id": "unique-test-id",
75+
"category": "tool-category",
76+
"query": "user query text",
77+
"expectedTools": ["tool-name"],
78+
"reference": "explanation of why this should pass (optional)",
79+
"context": [/* conversation history (optional) */]
80+
}
81+
```
82+
83+
### Simple examples
84+
85+
**Basic tool selection:**
86+
```json
87+
{
88+
"id": "fetch-actor-details-1",
89+
"category": "fetch-actor-details",
90+
"query": "What are the details of apify/instagram-scraper?",
91+
"expectedTools": ["fetch-actor-details"]
92+
}
93+
```
94+
95+
**With reference explanation:**
96+
```json
97+
{
98+
"id": "fetch-actor-details-3",
99+
"category": "fetch-actor-details",
100+
"query": "Scrape details of apify/google-search-scraper",
101+
"expectedTools": ["fetch-actor-details"],
102+
"reference": "It should call the fetch-actor-details with the actor ID 'apify/google-search-scraper' and return the actor's documentation."
103+
}
104+
```
105+
106+
### Advanced examples with context
107+
108+
**Multi-step conversation flow:**
109+
```json
110+
{
111+
"id": "weather-mcp-search-then-call-1",
112+
"category": "flow",
113+
"query": "Now, use the mcp to check the weather in Prague, Czechia?",
114+
"expectedTools": ["call-actor"],
115+
"context": [
116+
{ "role": "user", "content": "Search for weather MCP server" },
117+
{ "role": "assistant", "content": "I'll help you to do that" },
118+
{ "role": "tool_use", "tool": "search-actors", "input": {"search": "weather mcp", "limit": 5} },
119+
{ "role": "tool_result", "tool_use_id": 12, "content": "Tool 'search-actors' successful, Actor found: jiri.spilka/weather-mcp-server" }
120+
]
121+
}
122+
```

evals/config.ts

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
/**
2+
* Configuration for Apify MCP Server evaluations.
3+
*/
4+
5+
import { readFileSync } from 'node:fs';
6+
import { dirname, join } from 'node:path';
7+
import { fileURLToPath } from 'node:url';
8+
9+
// Read version from test-cases.json
10+
function getTestCasesVersion(): string {
11+
const currentFilename = fileURLToPath(import.meta.url);
12+
const currentDirname = dirname(currentFilename);
13+
const testCasesPath = join(currentDirname, 'test-cases.json');
14+
const testCasesContent = readFileSync(testCasesPath, 'utf-8');
15+
const testCases = JSON.parse(testCasesContent);
16+
return testCases.version;
17+
}
18+
19+
// Evaluator names
20+
export const EVALUATOR_NAMES = {
21+
TOOLS_EXACT_MATCH: 'tool-exact-match',
22+
TOOL_SELECTION_LLM: 'tool-selection-llm',
23+
} as const;
24+
25+
export type EvaluatorName = typeof EVALUATOR_NAMES[keyof typeof EVALUATOR_NAMES];
26+
27+
// Models to evaluate
28+
export const MODELS_TO_EVALUATE = [
29+
'openai/gpt-4o-mini',
30+
'anthropic/claude-3.5-haiku',
31+
'google/gemini-2.5-flash',
32+
];
33+
34+
export const TOOL_SELECTION_EVAL_MODEL = 'openai/gpt-4o-mini';
35+
36+
export const PASS_THRESHOLD = 0.7;
37+
38+
export const DATASET_NAME = `mcp_server_dataset_v${getTestCasesVersion()}`;
39+
40+
// System prompt
41+
export const SYSTEM_PROMPT = 'You are a helpful assistant';
42+
43+
export const TOOL_CALLING_BASE_TEMPLATE = `
44+
You are an evaluation assistant evaluating user queries and tool calls to
45+
determine whether a tool was chosen and if it was a right tool.
46+
47+
The tool calls have been generated by a separate agent, and chosen from the list of
48+
tools provided below. It is your job to decide whether that agent chose
49+
the right tool to call.
50+
51+
[BEGIN DATA]
52+
************
53+
[User's previous interaction with the assistant]: {{context}}
54+
[User query]: {{query}}
55+
************
56+
[LLM decided to call these tools]: {{tool_calls}}
57+
[LLM response]: {{llm_response}}
58+
************
59+
[END DATA]
60+
61+
DECISION: [correct or incorrect]
62+
EXPLANATION: [Super short explanation of why the tool choice was correct or incorrect]
63+
64+
Your response must be single word, either "correct" or "incorrect",
65+
and should not contain any text or characters aside from that word.
66+
67+
"correct" means the correct tool call was chosen, the correct parameters
68+
were extracted from the query, the tool call generated is runnable and correct,
69+
and that no outside information not present in the query was used
70+
in the generated query.
71+
72+
"incorrect" means that the chosen tool was not correct
73+
or that the tool signature includes parameter values that don't match
74+
the formats specified in the tool signatures below.
75+
76+
You must not use any outside information or make assumptions.
77+
Base your decision solely on the information provided in [BEGIN DATA] ... [END DATA],
78+
the [Tool Definitions], and the [Reference instructions] (if provided).
79+
Reference instructions are optional and are intended to help you understand the use case and make your decision.
80+
81+
[Reference instructions]: {{reference}}
82+
83+
[Tool definitions]: {{tool_definitions}}
84+
`
85+
export function getRequiredEnvVars(): Record<string, string | undefined> {
86+
return {
87+
PHOENIX_BASE_URL: process.env.PHOENIX_BASE_URL,
88+
PHOENIX_API_KEY: process.env.PHOENIX_API_KEY,
89+
OPENROUTER_API_KEY: process.env.OPENROUTER_API_KEY,
90+
OPENROUTER_BASE_URL: process.env.OPENROUTER_BASE_URL,
91+
};
92+
}
93+
94+
// Removes newlines and trims whitespace. Useful for Authorization header values
95+
// because CI secrets sometimes include trailing newlines or quotes.
96+
export function sanitizeHeaderValue(value?: string): string | undefined {
97+
if (value == null) return value;
98+
return value.replace(/[\r\n]/g, '').trim().replace(/^"|"$/g, '');
99+
}
100+
101+
export function validateEnvVars(): boolean {
102+
const envVars = getRequiredEnvVars();
103+
const missing = Object.entries(envVars)
104+
.filter(([, value]) => !value)
105+
.map(([key]) => key);
106+
107+
if (missing.length > 0) {
108+
// eslint-disable-next-line no-console
109+
console.error(`Missing required environment variables: ${missing.join(', ')}`);
110+
return false;
111+
}
112+
113+
return true;
114+
}

0 commit comments

Comments
 (0)