diff --git a/docs/toolhive/concepts/tool-optimization.mdx b/docs/toolhive/concepts/tool-optimization.mdx index aa8d4e3a..35b5c85b 100644 --- a/docs/toolhive/concepts/tool-optimization.mdx +++ b/docs/toolhive/concepts/tool-optimization.mdx @@ -257,3 +257,4 @@ how to configure them: - [Customize tools (Kubernetes)](../guides-k8s/customize-tools.mdx) - [MCPToolConfig CRD reference](../reference/crd-spec.md) - [Virtual MCP Server tool aggregation](../guides-vmcp/tool-aggregation.mdx) +- [Optimize tool discovery in vMCP](../guides-vmcp/optimizer.mdx) diff --git a/docs/toolhive/concepts/vmcp.mdx b/docs/toolhive/concepts/vmcp.mdx index 62155767..4e53d216 100644 --- a/docs/toolhive/concepts/vmcp.mdx +++ b/docs/toolhive/concepts/vmcp.mdx @@ -31,6 +31,8 @@ vMCP delivers four key benefits: 3. **Improve security**: Centralized authentication and authorization with a two-boundary model 4. **Enable reusability**: Define workflows once, use them everywhere +5. **Optimize tool discovery**: Reduce token usage by replacing all tool + definitions with two lightweight search-and-call primitives ## Key capabilities @@ -165,4 +167,5 @@ teams managing multiple MCP servers. - [Configure authentication](../guides-vmcp/authentication.mdx) - [Tool aggregation and conflict resolution](../guides-vmcp/tool-aggregation.mdx) - [Composite tools and workflows](../guides-vmcp/composite-tools.mdx) +- [Optimize tool discovery](../guides-vmcp/optimizer.mdx) - [Proxy remote MCP servers](../guides-k8s/remote-mcp-proxy.mdx) diff --git a/docs/toolhive/guides-vmcp/configuration.mdx b/docs/toolhive/guides-vmcp/configuration.mdx index 7ff87537..3c2f6505 100644 --- a/docs/toolhive/guides-vmcp/configuration.mdx +++ b/docs/toolhive/guides-vmcp/configuration.mdx @@ -279,6 +279,8 @@ Backend discovery guide. ## Next steps +- [Optimize tool discovery](./optimizer.mdx) by adding an `embeddingServerRef` + to reduce token usage across many backends - Review [scaling and performance guidance](./scaling-and-performance.mdx) for resource planning - Discover your deployed MCP servers automatically using the @@ -292,6 +294,7 @@ Backend discovery guide. - [Scaling and Performance](./scaling-and-performance.mdx) - [Backend discovery modes](./backend-discovery.mdx) - [Tool aggregation](./tool-aggregation.mdx) +- [Optimize tool discovery](./optimizer.mdx) - [Composite tools](./composite-tools.mdx) - [Authentication](./authentication.mdx) - [Proxy remote MCP servers](../guides-k8s/remote-mcp-proxy.mdx) diff --git a/docs/toolhive/guides-vmcp/intro.mdx b/docs/toolhive/guides-vmcp/intro.mdx index 236aba1a..87c4f630 100644 --- a/docs/toolhive/guides-vmcp/intro.mdx +++ b/docs/toolhive/guides-vmcp/intro.mdx @@ -36,6 +36,10 @@ for details on the current limitations. - **Centralized authentication**: Single sign-on with per-backend token exchange - **Composite workflows**: Multi-step operations across backend MCP servers with parallel execution, approval gates, and error handling +- **Tool optimization**: Replace all individual tool definitions with two + lightweight primitives (`find_tool` and `call_tool`) to reduce token usage and + improve tool selection. See [Optimize tool discovery](./optimizer.mdx) and the + underlying [concepts](../concepts/tool-optimization.mdx) ## When to use vMCP @@ -46,6 +50,8 @@ for details on the current limitations. - You have centralized authentication and authorization requirements - You need reusable workflow definitions - You want to aggregate external SaaS MCP servers with internal tools +- You want to reduce token usage and improve tool selection accuracy across many + backends with the [optimizer](./optimizer.mdx) ### Not needed @@ -87,9 +93,21 @@ flowchart TB 5. Clients connect to the VirtualMCPServer endpoint and see a unified view of all tools from both local and remote backends +## Optimize tool discovery + +As the number of aggregated backends grows, clients receive a large number of +tool definitions that consume tokens and can degrade tool selection accuracy. +The vMCP optimizer addresses this by replacing all individual tool definitions +with two lightweight primitives (`find_tool` and `call_tool`) and using hybrid +semantic and keyword search to surface only the most relevant tools per request. +To enable the optimizer, add an `embeddingServerRef` to your VirtualMCPServer +resource. See [Optimize tool discovery](./optimizer.mdx) for the full setup +guide. + ## Related information - [Quickstart: Virtual MCP Server](./quickstart.mdx) - [Understanding Virtual MCP Server](../concepts/vmcp.mdx) +- [Optimize tool discovery](./optimizer.mdx) - [Scaling and Performance](./scaling-and-performance.mdx) - [Proxy remote MCP servers](../guides-k8s/remote-mcp-proxy.mdx) diff --git a/docs/toolhive/guides-vmcp/optimizer.mdx b/docs/toolhive/guides-vmcp/optimizer.mdx new file mode 100644 index 00000000..fe159aaf --- /dev/null +++ b/docs/toolhive/guides-vmcp/optimizer.mdx @@ -0,0 +1,317 @@ +--- +title: Optimize tool discovery +description: + Enable the optimizer in vMCP to reduce token usage and improve tool selection + across aggregated backends. +--- + +When Virtual MCP Server (vMCP) aggregates many backend MCP servers, the total +number of tools exposed to clients can grow quickly. The optimizer addresses +this by filtering tools per request, reducing token usage and improving tool +selection accuracy. + +For the desktop/CLI approach using the MCP Optimizer container, see the +[MCP Optimizer tutorial](../tutorials/mcp-optimizer.mdx). This guide covers the +Kubernetes operator approach using VirtualMCPServer and EmbeddingServer CRDs. + +## Benefits + +- **Reduced token usage**: Only relevant tools are included in context, not the + entire toolset +- **Improved tool selection**: The right tools surface for each query. With + fewer tools to reason over, agents are more likely to choose correctly + +## How it works + +1. You send a prompt that requires tool assistance +2. The AI calls `find_tool` with keywords extracted from the prompt +3. vMCP performs hybrid semantic and keyword search across all backend tools +4. Only the most relevant tools (up to 8 by default) are returned +5. The AI calls `call_tool` to execute the selected tool, and vMCP routes the + request to the appropriate backend + +```mermaid +flowchart TB + subgraph vmcpGroup["VirtualMCPServer"] + direction TB + vmcp["vMCP (optimizer enabled)"] + end + subgraph embedding["EmbeddingServer"] + direction TB + tei["Text Embeddings Inference"] + end + subgraph backends["MCPGroup backends"] + direction TB + mcp1["MCP server"] + mcp2["MCP server"] + mcp3["MCP server"] + end + + client(["Client"]) <-- "find_tool / call_tool" --> vmcpGroup + vmcp <-. "semantic search" .-> embedding + vmcp <-. "discovers / routes" .-> backends +``` + +:::info[How search works internally] + +The optimizer uses an internal SQLite database for both keyword search (using +full-text search) and storing semantic vectors. Keyword search runs locally +against this database; semantic search uses vectors generated by an embedding +server. You can control how results from these two sources are blended — see the +[parameter reference](#parameter-reference) for details. + +::: + +## Quick start + +### Step 1: Create an EmbeddingServer + +Create an EmbeddingServer with default settings. This deploys a text embeddings +inference (TEI) server using the `BAAI/bge-small-en-v1.5` model: + +```yaml title="embedding-server.yaml" +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: EmbeddingServer +metadata: + name: my-embedding + namespace: toolhive-system +spec: {} +``` + +:::tip + +Wait for the EmbeddingServer to reach the `Running` phase before proceeding. The +first startup may take a few minutes while the model downloads. + +```bash +kubectl get embeddingserver my-embedding -n toolhive-system -w +``` + +::: + +### Step 2: Add the embedding reference to VirtualMCPServer + +Update your existing VirtualMCPServer to include `embeddingServerRef`. **This is +the only change needed to enable the optimizer.** When you set +`embeddingServerRef`, the operator automatically enables the optimizer with +sensible defaults. You only need to add an explicit `optimizer` block if you +want to [tune the parameters](#tune-the-optimizer). + +```yaml title="VirtualMCPServer resource" +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: VirtualMCPServer +metadata: + name: my-vmcp + namespace: toolhive-system +spec: + # highlight-start + embeddingServerRef: + name: my-embedding + # highlight-end + config: + groupRef: my-group + incomingAuth: + type: anonymous +``` + +### Step 3: Verify + +Check that the VirtualMCPServer is ready: + +```bash +kubectl get virtualmcpserver my-vmcp -n toolhive-system +``` + +Look for `READY: True` in the output. Once ready, clients connecting to the vMCP +endpoint see only `find_tool` and `call_tool` instead of the full backend +toolset. + +## EmbeddingServer resource + +The EmbeddingServer CRD manages the lifecycle of a TEI server. An empty +`spec: {}` uses all defaults. The two most important fields you can customize +are: + +- **`model`**: The Hugging Face embedding model to use. The default + (`BAAI/bge-small-en-v1.5`) is the tested and recommended model. You can + substitute any embedding model available on Hugging Face — see the + [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) to compare + options. +- **`image`**: The container image for + [text-embeddings-inference](https://github.com/huggingface/text-embeddings-inference) + (TEI). The default is the CPU-only image + (`ghcr.io/huggingface/text-embeddings-inference:cpu-latest`). Swap this for a + CUDA-enabled image if you have GPU nodes available. + +For the complete field reference, see the +[EmbeddingServer CRD specification](../reference/crd-spec.md#apiv1alpha1embeddingserver). + +:::warning[ARM64 compatibility] + +The default TEI CPU images depend on Intel MKL, which is x86_64-only. No +official ARM64 images exist yet. On ARM64 nodes (including Apple Silicon with +kind), you can run the amd64 image under emulation as a workaround. + +First, pull the amd64 image and load it into your cluster: + +```bash +docker pull --platform linux/amd64 \ + ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 +kind load docker-image \ + ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 +``` + +The `kind load` command is specific to kind. For other cluster distributions, +use the equivalent image-loading mechanism (for example, `ctr images import` for +containerd, or push the image to a registry your cluster can pull from). + +Then, pin the image in your EmbeddingServer so the operator uses the pre-pulled +tag instead of the default `cpu-latest`: + +```yaml title="embedding-server.yaml" +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: EmbeddingServer +metadata: + name: my-embedding + namespace: toolhive-system +spec: + image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.7 +``` + +Native ARM64 support is in progress upstream. Track the +[TEI GitHub repository](https://github.com/huggingface/text-embeddings-inference) +for updates. + +::: + +## Tune the optimizer + +To customize optimizer behavior, add the `optimizer` block under `spec.config` +in your VirtualMCPServer resource: + +```yaml title="VirtualMCPServer resource" +spec: + config: + groupRef: my-group + # highlight-start + optimizer: + embeddingServiceTimeout: 30s + maxToolsToReturn: 8 + hybridSearchSemanticRatio: '0.5' + semanticDistanceThreshold: '1.0' + # highlight-end +``` + +### Parameter reference + +| Parameter | Description | Default | +| --------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | +| `embeddingServiceTimeout` | HTTP request timeout for calls to the embedding service | `30s` | +| `maxToolsToReturn` | Maximum number of tools returned per search (1-50) | `8` | +| `hybridSearchSemanticRatio` | Balance between semantic and keyword search. `0.0` = all keyword, `1.0` = all semantic. Default gives equal weight to both. | `"0.5"` | +| `semanticDistanceThreshold` | Maximum distance from the search term for semantic results. `0` = identical, `2` = completely unrelated. Results beyond this threshold are filtered out. | `"1.0"` | + +:::note + +`hybridSearchSemanticRatio` and `semanticDistanceThreshold` are string-encoded +floats (for example, `"0.5"` not `0.5`). This is a Kubernetes CRD limitation, as +CRDs do not support float types portably. + +::: + +:::info[EmbeddingServer is always required] + +Even if you set `hybridSearchSemanticRatio` to `"0.0"` (all keyword search), the +optimizer still requires a configured EmbeddingServer. The EmbeddingServer won't +be used at runtime when the semantic ratio is `0.0`, but the configuration must +be present due to how the optimizer is wired internally. + +::: + +:::tip[Tuning guidance] + +The defaults are well-tested and work for most use cases. If you do need to +adjust them: + +- **Lower `semanticDistanceThreshold`** (for example, `"0.6"`) for higher + precision: only very close matches are returned +- **Raise `semanticDistanceThreshold`** (for example, `"1.4"`) for higher + recall: broader matches are included +- **Increase `maxToolsToReturn`** if the AI frequently cannot find the right + tool; decrease it to save tokens +- **Adjust `hybridSearchSemanticRatio`** toward `"1.0"` if tool names are not + descriptive, or toward `"0.0"` if exact keyword matching is more useful +- `semanticDistanceThreshold` filtering is applied before the `maxToolsToReturn` + cap. A low threshold can filter out candidates before the cap takes effect, so + you may need to raise the threshold if too few results are returned + +::: + +## Complete example + +This example shows a full configuration with all available options, including +high availability for the embedding server, persistent model caching, and tuned +optimizer parameters. + +The EmbeddingServer runs two replicas with resource limits and a persistent +volume for model caching, so restarts don't re-download the model: + +```yaml title="embedding-server-full.yaml" +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: EmbeddingServer +metadata: + name: full-embedding + namespace: toolhive-system +spec: + replicas: 2 + resources: + requests: + cpu: '500m' + memory: '512Mi' + limits: + cpu: '2' + memory: '1Gi' + modelCache: + enabled: true + storageSize: 5Gi +``` + +The VirtualMCPServer uses a shorter embedding timeout (15s) because the +EmbeddingServer is co-located with low-latency access. Increase this value if +the embedding service is remote or under high load: + +```yaml title="vmcp-with-optimizer.yaml" +apiVersion: toolhive.stacklok.dev/v1alpha1 +kind: VirtualMCPServer +metadata: + name: full-vmcp + namespace: toolhive-system +spec: + embeddingServerRef: + name: full-embedding + config: + groupRef: my-tools + optimizer: + embeddingServiceTimeout: 15s + maxToolsToReturn: 10 + hybridSearchSemanticRatio: '0.6' + semanticDistanceThreshold: '0.8' + incomingAuth: + type: oidc + oidcConfig: + type: inline + inline: + issuer: https://auth.example.com + audience: vmcp-example +``` + +## Related information + +- [MCP Optimizer tutorial](../tutorials/mcp-optimizer.mdx) — desktop/CLI setup +- [Optimizing LLM context](../concepts/tool-optimization.mdx) — background on + tool filtering and context pollution +- [Configure vMCP servers](./configuration.mdx) +- [EmbeddingServer CRD specification](../reference/crd-spec.md#apiv1alpha1embeddingserver) +- [Virtual MCP Server overview](../concepts/vmcp.mdx) — conceptual overview of + vMCP +- [VirtualMCPServer CRD specification](../reference/crd-spec.md#apiv1alpha1virtualmcpserver) diff --git a/docs/toolhive/guides-vmcp/tool-aggregation.mdx b/docs/toolhive/guides-vmcp/tool-aggregation.mdx index c167257e..728d6915 100644 --- a/docs/toolhive/guides-vmcp/tool-aggregation.mdx +++ b/docs/toolhive/guides-vmcp/tool-aggregation.mdx @@ -17,6 +17,14 @@ tools with the same name (for example, both GitHub and Jira have a `create_issue` tool), a conflict resolution strategy determines how to handle the collision. +:::tip + +When aggregating many backends, the total number of exposed tools can grow +quickly. Consider enabling the [optimizer](./optimizer.mdx) to reduce token +usage and improve tool selection accuracy. + +::: + ## Conflict resolution strategies ### Prefix strategy (default) @@ -232,4 +240,5 @@ With this configuration, tools from each backend are prefixed: ## Related information - [VirtualMCPServer configuration reference](./configuration.mdx) +- [Optimize tool discovery](./optimizer.mdx) - [Customize MCP server tools](../guides-k8s/customize-tools.mdx) diff --git a/docs/toolhive/tutorials/mcp-optimizer.mdx b/docs/toolhive/tutorials/mcp-optimizer.mdx index c4e35ba9..756ff0f7 100644 --- a/docs/toolhive/tutorials/mcp-optimizer.mdx +++ b/docs/toolhive/tutorials/mcp-optimizer.mdx @@ -328,4 +328,6 @@ Now that you've set up MCP Optimizer, consider exploring these next steps: ## Related information - [MCP Optimizer UI guide](../guides-ui/mcp-optimizer.mdx) +- [Optimize tool discovery in vMCP](../guides-vmcp/optimizer.mdx) — Kubernetes + operator approach - [Organize MCP servers into groups](../guides-ui/group-management.mdx) diff --git a/sidebars.ts b/sidebars.ts index 0594a4fe..e1f929d0 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -174,6 +174,7 @@ const sidebars: SidebarsConfig = { 'toolhive/guides-vmcp/authentication', 'toolhive/guides-vmcp/tool-aggregation', 'toolhive/guides-vmcp/composite-tools', + 'toolhive/guides-vmcp/optimizer', 'toolhive/guides-vmcp/failure-handling', 'toolhive/guides-vmcp/telemetry-and-metrics', 'toolhive/guides-vmcp/audit-logging',