Ultra-fast semantic tool filtering for MCP (Model Context Protocol) servers using embedding similarity. Reduce your tool context from 1000+ tools down to the most relevant 10-20 tools in under 10ms.
- β‘ Lightning Fast: <10ms filtering latency for 1000+ tools with built-in optimizations
- π Performance Optimized: 6-8x faster dot product, smart top-K selection, true LRU cache
- π― Semantic Understanding: Uses embeddings for intelligent tool matching
- π¦ Zero Dependencies on Runtime: Only requires an embedding provider API
- π Flexible Input: Accepts chat completion messages or raw strings
- πΎ Smart Caching: Caches embeddings and context for optimal performance
- ποΈ Configurable: Tune scoring thresholds, top-k, and always-include tools
- π Performance Metrics: Built-in timing for optimization
npm install @portkey-ai/mcp-tool-filterimport { MCPToolFilter } from '@portkey-ai/mcp-tool-filter';
// 1. Initialize the filter (choose embedding provider)
// Option A: Local Embeddings (RECOMMENDED for low latency < 5ms)
const filter = new MCPToolFilter({
embedding: {
provider: 'local',
}
});
// Option B: API Embeddings (for highest accuracy)
const filter = new MCPToolFilter({
embedding: {
provider: 'openai',
apiKey: process.env.OPENAI_API_KEY,
}
});
// 2. Load your MCP servers (one-time setup)
await filter.initialize(mcpServers);
// 3. Filter tools based on context
const result = await filter.filter(
"Search my emails for the Q4 budget discussion"
);
// 4. Use the filtered tools in your LLM request
console.log(result.tools); // Top 20 most relevant tools
console.log(result.metrics.totalTime); // e.g., "2ms" for local, "500ms" for APIPros:
- β‘ Ultra-fast: 1-5ms latency
- π Private: No data sent to external APIs
- π° Free: No API costs
- π Offline: Works without internet
Cons:
- Slightly lower accuracy than API models
- First initialization downloads model (~25MB)
const filter = new MCPToolFilter({
embedding: {
provider: 'local',
model: 'Xenova/all-MiniLM-L6-v2', // Optional: default model
quantized: true, // Optional: use quantized model for speed (default: true)
}
});Available Models:
Xenova/all-MiniLM-L6-v2(default) - 384 dimensions, very fastXenova/all-MiniLM-L12-v2- 384 dimensions, more accurateXenova/bge-small-en-v1.5- 384 dimensions, good balanceXenova/bge-base-en-v1.5- 768 dimensions, higher quality
Performance:
- Initialization: 100ms-4s (one-time, downloads model)
- Filter request: 1-5ms
- Cached request: <1ms
For highest accuracy, use OpenAI or other API providers:
const filter = new MCPToolFilter({
embedding: {
provider: 'openai',
apiKey: process.env.OPENAI_API_KEY,
model: 'text-embedding-3-small', // Optional
dimensions: 384, // Optional: match local model for fair comparison
}
});Pros:
- π― Highest accuracy: 5-15% better than local
- π Easy to switch models
- π No local resources needed
Cons:
- π Slow: 400-800ms per request
- π° Costs money: ~$0.02 per 1M tokens
- π Data sent to external API
- πΆ Requires internet connection
Performance:
- Initialization: 200ms-60s (depends on tool count)
- Filter request: 400-800ms
- Cached request: 1-3ms
| Aspect | Local | API | Winner |
|---|---|---|---|
| Speed | 1-5ms | 400-800ms | π Local (200x faster) |
| Accuracy | Good (85-90%) | Best (100%) | π API |
| Cost | Free | ~$0.02/1M tokens | π Local |
| Privacy | Fully local | Data sent to API | π Local |
| Offline | β Works offline | β Needs internet | π Local |
| Setup | Zero config | Needs API key | π Local |
π See TRADEOFFS.md for detailed analysis
The library expects an array of MCP servers with the following structure:
[
{
"id": "gmail",
"name": "Gmail MCP Server",
"description": "Email management tools",
"categories": ["email", "communication"],
"tools": [
{
"name": "search_gmail_messages",
"description": "Search and find email messages in Gmail inbox. Use when user wants to find, search, look up emails...",
"keywords": ["email", "search", "inbox", "messages"],
"category": "email-search",
"inputSchema": {
"type": "object",
"properties": {
"q": { "type": "string" }
}
}
}
]
}
]Required Fields:
id: Unique identifier for the servername: Human-readable server nametools: Array of tool definitionsname: Unique tool namedescription: Rich description of what the tool does and when to use it
Optional but Recommended:
description: Server-level descriptioncategories: Array of category tags for hierarchical filteringkeywords: Array of synonym/related terms for better matchingcategory: Tool-level categoryinputSchema: JSON schema for parameters (parameter names are used for matching)
-
Rich Descriptions: Write detailed descriptions with use cases
"description": "Search emails in Gmail. Use when user wants to find, lookup, or retrieve messages, correspondence, or mail."
-
Add Keywords: Include synonyms and variations
"keywords": ["email", "mail", "inbox", "messages", "correspondence"]
-
Mention Use Cases: Explicitly state when to use the tool
"description": "... Use when user wants to draft, compose, write, or prepare an email to send later."
Main class for tool filtering.
new MCPToolFilter(config: MCPToolFilterConfig)Config Options:
{
embedding: {
// Local embeddings (recommended)
provider: 'local',
model?: string, // Default: 'Xenova/all-MiniLM-L6-v2'
quantized?: boolean, // Default: true
// OR API embeddings
provider: 'openai' | 'voyage' | 'cohere',
apiKey: string,
model?: string, // Default: 'text-embedding-3-small'
dimensions?: number, // Default: 1536 (or 384 for local)
baseURL?: string, // For custom endpoints
},
defaultOptions?: {
topK?: number, // Default: 20
minScore?: number, // Default: 0.3
contextMessages?: number, // Default: 3
alwaysInclude?: string[], // Always include these tools
exclude?: string[], // Never include these tools
maxContextTokens?: number, // Default: 500
},
includeServerDescription?: boolean, // Default: false (see below)
debug?: boolean // Enable debug logging
}About includeServerDescription:
When enabled, this option includes the MCP server description in the tool embeddings, providing additional context about the domain/category of tools.
// Enable server descriptions in embeddings
const filter = new MCPToolFilter({
embedding: { provider: 'local' },
includeServerDescription: true // Default: false
});Tradeoffs:
- β Helps: General intent queries like "manage my local files" (+25% improvement)
- β Hurts: Specific tool queries like "Execute this SQL query" (-50% degradation)
- β Neutral: Overall impact is neutral (0% change)
Recommendation: Keep this disabled (default: false) unless your use case primarily involves high-level intent queries. See examples/benchmark-server-description.ts for detailed benchmarks.
Initialize the filter with MCP servers. This precomputes and caches all tool embeddings.
Note: Call this once during startup. It's an async operation that may take a few seconds depending on the number of tools.
await filter.initialize(servers);Filter tools based on the input context.
Input Types:
// String input
await filter.filter("Search my emails about the project");
// Chat messages
await filter.filter([
{ role: 'user', content: 'What meetings do I have today?' },
{ role: 'assistant', content: 'Let me check your calendar.' }
]);Options (all optional, override defaults):
{
topK?: number, // Max tools to return
minScore?: number, // Minimum similarity score (0-1)
contextMessages?: number, // How many recent messages to use
alwaysInclude?: string[], // Tool names to always include
exclude?: string[], // Tool names to exclude
maxContextTokens?: number, // Max context size
}Returns:
{
tools: ScoredTool[], // Filtered and ranked tools
metrics: {
totalTime: number, // Total time in ms
embeddingTime: number, // Time to embed context
similarityTime: number, // Time to compute similarities
toolsEvaluated: number, // Total tools evaluated
}
}Get statistics about the filter state.
const stats = filter.getStats();
// {
// initialized: true,
// toolCount: 25,
// cacheSize: 5,
// embeddingDimensions: 1536
// }Clear the context embedding cache.
filter.clearCache();The library includes several performance optimizations out of the box:
- π Loop-Unrolled Dot Product - Vector similarity computation is 6-8x faster through CPU pipeline optimization
- π Smart Top-K Selection - Hybrid algorithm uses fast built-in sort for typical workloads, switches to heap-based selection for 500+ tools
- πΎ True LRU Cache - Intelligent cache eviction based on access patterns, not just insertion order
- π― In-Place Operations - Reduced memory allocations through in-place vector normalization
- β‘ Set-Based Lookups - O(1) exclusion checking instead of O(n) array scanning
These optimizations are automatic and transparent - no configuration needed!
Typical performance for 1000 tools:
Building context: <1ms
Embedding API call: 3-5ms (cached: 0ms)
Similarity computation: 1-2ms (6-8x faster with optimizations)
Sorting/filtering: <1ms (hybrid algorithm)
βββββββββββββββββββββββββββββ
Total: 5-9ms
-
Use Smaller Embeddings: 512 or 1024 dimensions for faster computation
embedding: { provider: 'openai', model: 'text-embedding-3-small', dimensions: 512 // Faster than 1536 }
-
Reduce Context Size: Fewer messages = faster embedding
defaultOptions: { contextMessages: 2, // Instead of 3-5 maxContextTokens: 300 }
-
Leverage Caching: Identical contexts reuse cached embeddings (0ms)
-
Tune topK: Request fewer tools if you don't need 20
await filter.filter(input, { topK: 10 });
Micro-benchmarks showing optimization improvements:
Dot Product (1536 dims): 0.001ms vs 0.006ms (6x faster)
Vector Normalization: 0.003ms vs 0.006ms (2x faster)
Top-K Selection (<500 tools): Uses optimized built-in sort
Top-K Selection (500+ tools): O(n log k) heap-based selection
LRU Cache Access: True access-order tracking
See the existing benchmark examples for end-to-end performance testing:
npx ts-node examples/benchmark.tsimport Portkey from 'portkey-ai';
import { MCPToolFilter } from '@portkey-ai/mcp-tool-filter';
const portkey = new Portkey({ apiKey: '...' });
const filter = new MCPToolFilter({ /* ... */ });
await filter.initialize(mcpServers);
// Filter tools based on conversation
const { tools } = await filter.filter(messages);
// Convert to OpenAI tool format
const openaiTools = tools.map(t => ({
type: 'function',
function: {
name: t.toolName,
description: t.tool.description,
parameters: t.tool.inputSchema,
}
}));
// Make LLM request with filtered tools
const completion = await portkey.chat.completions.create({
model: 'gpt-4',
messages: messages,
tools: openaiTools,
});import { ChatOpenAI } from 'langchain/chat_models/openai';
import { MCPToolFilter } from '@portkey-ai/mcp-tool-filter';
const filter = new MCPToolFilter({ /* ... */ });
await filter.initialize(mcpServers);
// Create a custom tool selector
async function selectTools(messages) {
const { tools } = await filter.filter(messages);
return tools.map(t => convertToLangChainTool(t));
}
// Use in your agent
const model = new ChatOpenAI();
const tools = await selectTools(messages);
const response = await model.invoke(messages, { tools });// Recommended: Initialize once at startup
let filterInstance: MCPToolFilter;
async function getFilter() {
if (!filterInstance) {
filterInstance = new MCPToolFilter({ /* ... */ });
await filterInstance.initialize(mcpServers);
}
return filterInstance;
}
// Use in request handlers
app.post('/chat', async (req, res) => {
const filter = await getFilter();
const result = await filter.filter(req.body.messages);
// ... use filtered tools
});Performance on various tool counts (M1 Max):
Local Embeddings (Xenova/all-MiniLM-L6-v2):
| Tools | Initialization | Filter (Cold) | Filter (Cached) |
|---|---|---|---|
| 10 | ~100ms | 2ms | <1ms |
| 100 | ~500ms | 3ms | <1ms |
| 500 | ~2s | 4ms | 1ms |
| 1000 | ~4s | 5ms | 1ms |
| 5000 | ~20s | 8ms | 2ms |
API Embeddings (OpenAI text-embedding-3-small):
| Tools | Initialization | Filter (Cold) | Filter (Cached) |
|---|---|---|---|
| 10 | ~200ms | 500ms | 1ms |
| 100 | ~1.5s | 550ms | 2ms |
| 500 | ~6s | 600ms | 2ms |
| 1000 | ~12s | 650ms | 3ms |
| 5000 | ~60s | 800ms | 4ms |
Key Takeaways:
- π Local embeddings are 200-300x faster for filter requests
- β Local embeddings meet the <50ms target easily
- π° Local embeddings have no API costs
- π API embeddings may have slightly higher accuracy
- β‘ Both benefit significantly from caching
Note: Initialization is a one-time cost. Choose local embeddings for low latency, API embeddings for maximum accuracy.
Use Local Embeddings when:
- β‘ You need ultra-low latency (<10ms)
- π Privacy is important (no external API calls)
- π° You want zero API costs
- π You need offline operation
- π "Good enough" accuracy is acceptable
Use API Embeddings when:
- π― You need maximum accuracy
- π You have good internet connectivity
- π΅ API costs are not a concern
- π You're dealing with complex/nuanced queries
Recommendation: Start with local embeddings. Only switch to API if accuracy is insufficient.
Compare performance for your use case:
npx ts-node examples/test-local-embeddings.tsThis will benchmark both providers and show you:
- Initialization time
- Average filter time
- Cached filter time
- Speed comparison
To see detailed timing logs for each request, enable debug mode:
const filter = new MCPToolFilter({
embedding: { /* ... */ },
debug: true // Enable detailed timing logs
});This will output detailed logs for each filter request:
=== Starting filter request ===
[1/5] Options merged: 0.12ms
[2/5] Context built (156 chars): 0.34ms
[3/5] Cache MISS (lookup: 0.08ms)
β Embedding generated: 1247.56ms
[4/5] Similarities computed: 1.23ms (25 tools, 0.049ms/tool)
[5/5] Tools selected & ranked: 0.15ms (5 tools returned)
=== Total filter time: 1249.48ms ===
Breakdown: merge=0.12ms, context=0.34ms, cache=0.08ms, embedding=1247.56ms, similarity=1.23ms, selection=0.15ms
Each filter request logs 5 steps:
- Options Merging (
merge): Merge provided options with defaults - Context Building (
context): Build the context string from input messages - Cache Lookup & Embedding (
cache+embedding):- Cache HIT: 0ms embedding time (reuses cached embedding)
- Cache MISS: Calls embedding API (typically 200-2000ms depending on provider)
- Similarity Computation (
similarity): Compute cosine similarity for all tools- Also shows per-tool average time
- Tool Selection (
selection): Filter by score and select top-K tools
See examples/test-timings.ts for a complete example:
export OPENAI_API_KEY=your-key-here
npx ts-node examples/test-timings.tsThis will run multiple filter requests showing:
- Cache miss vs cache hit performance
- Different query types
- Chat message context handling
Every filter request returns detailed metrics:
const result = await filter.filter(input);
console.log(result.metrics);
// {
// totalTime: 1249.48, // Total request time in ms
// embeddingTime: 1247.56, // Time spent on embedding API
// similarityTime: 1.23, // Time computing similarities
// toolsEvaluated: 25 // Number of tools evaluated
// }const result = await filter.filter(messages);
// Log metrics for monitoring
logger.info('Tool filter performance', {
totalTime: result.metrics.totalTime,
embeddingTime: result.metrics.embeddingTime,
cached: result.metrics.embeddingTime === 0,
toolsReturned: result.tools.length,
});
// Alert if too slow
if (result.metrics.totalTime > 5000) {
logger.warn('Slow filter request', result.metrics);
}For very large tool sets, use hierarchical filtering:
// Stage 1: Filter by server categories
const relevantServers = mcpServers.filter(server =>
server.categories?.some(cat => userIntent.includes(cat))
);
// Stage 2: Filter tools within relevant servers
const result = await filter.filter(messages);Combine embedding similarity with keyword matching:
const { tools } = await filter.filter(input);
// Boost tools with exact keyword matches
const boostedTools = tools.map(tool => {
const hasKeywordMatch = tool.tool.keywords?.some(kw =>
input.toLowerCase().includes(kw.toLowerCase())
);
return {
...tool,
score: hasKeywordMatch ? tool.score * 1.2 : tool.score
};
}).sort((a, b) => b.score - a.score);Always include certain essential tools:
const filter = new MCPToolFilter({
// ...
defaultOptions: {
alwaysInclude: [
'web_search', // Always useful
'conversation_search', // Access to context
],
}
});Problem: First filter call is slow.
Solution: The embedding API call takes 3-5ms. Subsequent calls with similar context are cached and much faster.
// Warm up the cache
await filter.filter("hello"); // ~5ms
await filter.filter("hello"); // ~1ms (cached)Problem: Wrong tools are being selected.
Solutions:
- Improve tool descriptions with more keywords and use cases
- Lower the
minScorethreshold - Increase
topKto include more tools - Add important tools to
alwaysInclude
Problem: High memory usage with many tools.
Solution: Use smaller embedding dimensions:
embedding: {
dimensions: 512 // Instead of 1536
}This reduces memory by ~66% with minimal accuracy loss.
MIT
Contributions welcome! Please open an issue or PR.
- GitHub Issues: github.com/portkey-ai/mcp-tool-filter
- Email: [email protected]