Skip to content

UPSTREAM PR #18334: webui: add MCP (Model Context Protocol) support#679

Open
loci-dev wants to merge 18 commits intomainfrom
upstream-PR18334-branch_ochafik-web-ui-mcp
Open

UPSTREAM PR #18334: webui: add MCP (Model Context Protocol) support#679
loci-dev wants to merge 18 commits intomainfrom
upstream-PR18334-branch_ochafik-web-ui-mcp

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18334

Summary

This PR adds MCP (Model Context Protocol) support to llama-server's web ui.
Servers using stdio transport only for now, with the server managing these processes for the frontend, which connects to them through a WebSocket per conversation per server.

Screenshot 2025-12-24 at 00 33 25 Screenshot 2025-12-24 at 00 33 37 Screenshot 2025-12-24 at 00 33 44 Screenshot 2025-12-24 at 00 45 16

Features

  • WebSocket server on HTTP port + 1 for real-time MCP communication
  • MCP bridge that spawns and manages MCP server subprocesses (stdio or docker)
  • Frontend UI for MCP server management with tool exploration
  • Tool calling integration in chat completions with streaming support
  • Auto-reconnection with exponential backoff for resilience

New CLI Option

# Use default config location (~/.llama.cpp/mcp.json)
./llama-server -m model.gguf

# Or specify config path
./llama-server -m model.gguf --mcp-config /path/to/mcp.json

Configuration Example

{
  "mcpServers": {
    "brave-search": {
      "command": "npx",
      "args": ["-y", "@brave/brave-search-mcp-server", "--transport", "stdio"],
      "env": {
        "BRAVE_API_KEY": "... get your key at https://api.search.brave.com/app/keys ..."
      }
    },
    "python": {
      "command": "uvx",
      "args": ["mcp-run-python", "--deps", "numpy,pandas,pydantic,requests,httpx,sympy,aiohttp", "stdio"],
      "env": {}
    }
  }
}

Architecture

  • server-ws.cpp/h - WebSocket server implementation
  • server-mcp-bridge.cpp/h - Routes WebSocket connections to MCP subprocesses
  • server-mproc.cpp/h - Cross-platform subprocess management
  • server-mcp.h - MCP protocol type definitions
  • Frontend: MCP service, stores, and UI components

API Endpoints

  • GET /mcp/servers - List available MCP servers
  • GET /mcp/ws-port - Get WebSocket port number
  • WS /mcp?server=<name> - WebSocket connection (on HTTP port + 1)

Test plan

  • Unit tests added (tools/server/tests/unit/test_mcp.py)
  • Manual testing with @modelcontextprotocol/server-filesystem
  • Test tool calling in chat UI
  • Test connect/disconnect in MCP picker
  • Verify WebSocket reconnection after server restart

🤖 Generated with Claude Code

ochafik and others added 15 commits December 24, 2025 00:38
Add JSON-RPC 2.0 type definitions and MCP server configuration
structures for the Model Context Protocol implementation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add mcp_process class for spawning and managing MCP server subprocesses
with bidirectional stdio communication. Handles process lifecycle,
environment variables for unbuffered output, and cross-platform support.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add custom WebSocket server using raw sockets (no external library).
Implements RFC 6455 handshake, frame parsing, masking, and message
handling. Runs on HTTP port + 1 to avoid conflicts with httplib.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add server_mcp_bridge class that routes WebSocket messages to MCP
server subprocesses. Manages per-connection state, configuration
loading with hot-reload, and JSON-RPC 2.0 message forwarding.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Integrate MCP bridge and WebSocket server into main server:
- Add --mcp-config CLI argument for configuration path
- Add /mcp/servers and /mcp/ws-port HTTP endpoints
- Register WebSocket event handlers for MCP
- Update server-http to properly join thread on stop

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add TypeScript types for MCP protocol (JSON-RPC 2.0) and WebSocket
service for communicating with MCP servers:
- MCP types: tool definitions, JSON-RPC request/response/notification
- McpService: WebSocket client with auto-reconnect and request timeout
- API types: tool call interfaces for chat completions
- Vite config: proxy WebSocket connections to MCP port
- ESLint: allow underscore-prefixed unused args (common convention)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add reactive Svelte 5 stores for managing MCP state:
- mcpStore: Global MCP connection state, tool discovery, tool calling
- conversationMcpStore: Per-conversation MCP server enable/disable

Uses SvelteMap/SvelteSet for proper Svelte 5 reactivity.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add components for displaying MCP tool calls and results:
- ToolCallBlock: Collapsible display of tool call with arguments/results
- ToolResultDisplay: Format and render tool execution results
- tool-results.ts: Utility functions for parsing tool result messages

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add UI components for managing MCP server connections:
- ChatFormActionMcp: Server selector dropdown in chat input
- McpPanel: Full panel for viewing connected servers and tools

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Integrate MCP tool calling into the chat flow:
- chat.ts: Add tool parameter injection and MCP tool execution
- chat.svelte.ts: Track tool calls, results, and processing state
- ChatMessageAssistant: Display tool calls with status and duration
- ChatMessages: Build tool result map, filter tool result messages
- ChatScreen: Wire up tool result event handlers
- Add duration guard for negative timestamp differences

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add Python tests for MCP functionality:
- test_mcp_servers_endpoint: Test /mcp/servers HTTP endpoint
- test_mcp_ws_port_endpoint: Test /mcp/ws-port HTTP endpoint
- test_mcp_initialize_handshake: Test MCP JSON-RPC initialization
- test_mcp_tools_list: Test tools/list method
- test_mcp_tool_call: Test tools/call method

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add documentation and example configuration for MCP:
- README: Document MCP configuration, usage, and WebSocket port
- mcp_config.example.json: Example config with filesystem and brave-search
- Rebuild webui bundle with MCP support

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Force popover to open above (side="top") for consistent positioning
- Search input at bottom (flips based on popover position)
- Small solid dots for connection status (green/gray)
- Hover row to reveal connect/disconnect action icons
- Remove Connect All/Disconnect All footer buttons
- Fix double X button in search input (hide native WebKit clear)
- Add tooltips for status and actions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Don't show "Streaming..." status while arguments are being streamed.
Only show "Calling tool..." when actually waiting for MCP server response.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Reorder assistant message layout so tool call blocks appear
before the model badge and statistics.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
ochafik and others added 3 commits December 24, 2025 01:15
- Remove unused parameter names from MCP HTTP lambda handlers
- Remove conditional websocket import (it's a required dependency)

Fixes unused-parameter warning and pyright type-check errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Adds optional "cwd" field to mcp.json server configurations to set the
working directory for stdio MCP servers.

- Add cwd field to mcp_server_config struct
- Unix: call chdir() before execvp() in child process
- Windows: pass lpCurrentDirectory to CreateProcessA()
- Update mcp_config.example.json with usage example

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@loci-review
Copy link

loci-review bot commented Dec 24, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #679 - MCP Support Integration

Overview

This PR introduces Model Context Protocol support to llama-server through new WebSocket infrastructure, subprocess management, and CLI argument extensions. The changes add approximately 1,200 ns to application startup through argument parsing modifications in common/arg.cpp, specifically affecting the common_params_parser_init function. The implementation adds a new --mcp-config argument and associated validation logic while maintaining backward compatibility.

Key Findings

Argument Parsing Impact:
The most significant changes occur in lambda functions within common_params_parser_init. Lambda 104 shows a 1,164 microsecond increase in response time across llama-tts and llama-cvector-generator binaries, though its self-execution time decreased by 7 ns. Lambda 106 exhibits a 215 ns increase in self-execution time, while Lambda 125 adds 59 ns. These modifications stem from new MCP configuration validation logic, string operations for path handling, and environment variable processing. The call depth analysis reveals 20,000+ level stacks, indicating template instantiation overhead in the argument parsing framework.

Inference Performance:
No functions in the inference pipeline (llama_decode, llama_encode, llama_tokenize) show measurable changes. The modifications are isolated to initialization code paths, leaving the token generation performance unaffected. Expected tokens per second remains unchanged as the core inference functions maintain their baseline performance characteristics.

Power Consumption:
Binary-level analysis shows minimal impact: llama-tts decreased by 730 nJ (0.28%), while llama-cvector-generator increased by 112 nJ (0.044%). All other binaries show zero measurable change. The power consumption variations align with the startup-only nature of the modifications, confirming that runtime energy efficiency remains stable.

Code Changes:
The implementation adds MCP-specific argument handling through lambda closures that validate configuration file paths, parse environment variables, and integrate with the existing common_params structure. The new --mcp-config argument follows established patterns for file-based configuration, using the same validation approach as other file arguments like --model and --lora. The WebSocket server components (server-ws.cpp/h, server-mcp-bridge.cpp/h) operate independently of the inference pipeline, maintaining separation between tool calling infrastructure and model execution.

@loci-review
Copy link

loci-review bot commented Dec 24, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #679 MCP Support

Overview

This PR adds Model Context Protocol (MCP) support to llama-server, introducing WebSocket infrastructure, subprocess management, and frontend UI components. The changes span 42 files with 5,857 additions and 127 deletions.

Key Findings

Performance-Critical Areas Impact

Inference Pipeline Functions:
No changes detected in core inference functions (llama_decode, llama_encode, llama_tokenize, llama_model_load_from_file, llama_kv_cache operations). The inference pipeline remains unmodified, resulting in zero impact on tokens per second for model inference workloads.

Argument Parsing Degradation:
The analysis identified severe response time increases in common_params_parser_init lambda operators within build.bin.llama-tts and build.bin.llama-cvector-generator:

  • Lambda at arg.cpp:2760:2769 increased from 1,811 ns to 1,165,496 ns (absolute change: +1,164,000 ns)
  • Lambda at arg.cpp:3193:3195 increased from 24 ns to 13,464 ns (absolute change: +13,440 ns)
  • Lambda at arg.cpp:2788:2790 increased from 22 ns to 237 ns (absolute change: +215 ns)

These functions handle CLI argument parsing during server initialization. The degradation affects startup time only, not runtime inference performance. The PR adds one new argument (--mcp-config) which contributes minimally to the existing systemic complexity in the argument parser infrastructure.

Server Infrastructure Changes:
New components introduced:

  • server-ws.cpp (835 lines): WebSocket server with custom SHA-1 and frame parsing, adds 1-2 ms per connection handshake
  • server-mproc.cpp (606 lines): Cross-platform subprocess management with 1-5 ms fork/exec overhead per MCP server
  • server-mcp-bridge.cpp (278 lines): Connection routing with sub-millisecond message forwarding

The server-http.cpp stop() method now includes thread.join() for proper cleanup, adding synchronization wait during shutdown only.

Power Consumption Analysis

Power consumption changes are minimal across all binaries:

  • build.bin.llama-tts: 260,709 nJ → 259,979 nJ (-0.28%, -730 nJ decrease)
  • build.bin.llama-cvector-generator: 255,801 nJ → 255,913 nJ (+0.04%, +112 nJ increase)
  • build.bin.llama-run: 223,113 nJ → 223,112 nJ (-0.00%)
  • Core libraries (libggml-base.so, libggml-cpu.so, libggml.so, libmtmd.so): No change

The power consumption variations are within measurement noise and do not indicate meaningful efficiency changes.

Architecture Impact

The PR introduces parallel execution paths for MCP tool calling that operate independently of the inference engine. WebSocket connections and MCP subprocesses run in separate threads, ensuring tool invocations do not block model inference. Memory overhead is approximately 50 KB base plus 25 KB per active MCP server connection.

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 15838f1 to 006b713 Compare December 24, 2025 23:08
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 07aff19 to 1f52e52 Compare January 2, 2026 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants