BerriAI · Sameerlite · Feb 3, 2026 · Feb 2, 2026 · Feb 2, 2026 · Feb 2, 2026
diff --git a/ci_cd/security_scans.sh b/ci_cd/security_scans.sh
@@ -154,6 +154,7 @@ run_grype_scans() {
         "CVE-2025-15367" # No fix available yet
         "CVE-2025-12781" # No fix available yet
         "CVE-2025-11468" # No fix available yet
+        "CVE-2026-1299" # Python 3.13 email module header injection - not applicable, LiteLLM doesn't use BytesGenerator for email serialization
     )
 
     # Build JSON array of allowlisted CVE IDs for jq

diff --git a/docs/my-website/blog/sub_millisecond_proxy_overhead/index.md b/docs/my-website/blog/sub_millisecond_proxy_overhead/index.md
@@ -0,0 +1,92 @@
+---
+slug: sub-millisecond-proxy-overhead
+title: "Achieving Sub-Millisecond Proxy Overhead"
+date: 2026-02-02T10:00:00
+authors:
+  - name: Alexsander Hamir
+    title: "Performance Engineer, LiteLLM"
+    url: https://www.linkedin.com/in/alexsander-baptista/
+    image_url: https://github.com/AlexsanderHamir.png
+  - name: Krrish Dholakia
+    title: "CEO, LiteLLM"
+    url: https://www.linkedin.com/in/krish-d/
+    image_url: https://pbs.twimg.com/profile_images/1298587542745358340/DZv3Oj-h_400x400.jpg
+  - name: Ishaan Jaff
+    title: "CTO, LiteLLM"
+    url: https://www.linkedin.com/in/reffajnaahsi/
+    image_url: https://pbs.twimg.com/profile_images/1613813310264340481/lz54oEiB_400x400.jpg
+description: "Our Q1 performance target and architectural direction for achieving sub-millisecond proxy overhead on modest hardware."
+tags: [performance, architecture]
+hide_table_of_contents: false
+---
+
+![Sidecar architecture: Python control plane vs. sidecar hot path](https://raw.githubusercontent.com/AlexsanderHamir/assets/main/Screenshot%202026-02-02%20172554.png)
+
+# Achieving Sub-Millisecond Proxy Overhead
+
+## Introduction
+
+Our Q1 performance target is to aggressively move toward sub-millisecond proxy overhead on a single instance with 4 CPUs and 8 GB of RAM, and to continue pushing that boundary over time. Our broader goal is to make LiteLLM inexpensive to deploy, lightweight, and fast. This post outlines the architectural direction behind that effort.
+
+Proxy overhead refers to the latency introduced by LiteLLM itself, independent of the upstream provider.
+
+To measure it, we run the same workload directly against the provider and through LiteLLM at identical QPS (for example, 1,000 QPS) and compare the latency delta. To reduce noise, the load generator, LiteLLM, and a mock LLM endpoint all run on the same machine, ensuring the difference reflects proxy overhead rather than network latency.
+
+---
+
+## Where We're Coming From
+
+Under the same benchmark originally conducted by [TensorZero](https://www.tensorzero.com/docs/gateway/benchmarks), LiteLLM previously failed at around 1,000 QPS.
+
+That is no longer the case. Today, LiteLLM can be stress-tested at 1,000 QPS with no failures and can scale up to 5,000 QPS without failures on a 4-CPU, 8-GB RAM single instance setup.
+
+This establishes a more up to date baseline and provides useful context as we continue working on proxy overhead and overall performance.
+
+---
+
+## Design Choice
+
+Achieving sub-millisecond proxy overhead with a Python-based system requires being deliberate about where work happens.
+
+Python is a strong fit for flexibility and extensibility: provider abstraction, configuration-driven routing, and a rich callback ecosystem. These are areas where development velocity and correctness matter more than raw throughput.
+
+At higher request rates, however, certain classes of work become expensive when executed inside the Python process on every request. Rather than rewriting LiteLLM or introducing complex deployment requirements, we adopt an optional **sidecar architecture**.
+
+This architectural change is how we intend to make LiteLLM **permanently fast**. While it supports our near-term performance targets, it is a long-term investment.
+
+Python continues to own:
+
+- Request validation and normalization
+- Model and provider selection
+- Callbacks and integrations
+
+The sidecar owns **performance-critical execution**, such as:
+
+- Efficient request forwarding
+- Connection reuse and pooling
+- Enforcing timeouts and limits
+- Aggregating high-frequency metrics
+
+This separation allows each component to focus on what it does best: Python acts as the control plane, while the sidecar handles the hot path.
+
+---
+
+### Why the Sidecar Is Optional
+
+The sidecar is intentionally **optional**.
+
+This allows us to ship it incrementally, validate it under real-world workloads, and avoid making it a hard dependency before it is fully battle-tested across all LiteLLM features.
+
+Just as importantly, this ensures that self-hosting LiteLLM remains simple. The sidecar is bundled and started automatically, requires no additional infrastructure, and can be disabled entirely. From a user's perspective, LiteLLM continues to behave like a single service.
+
+As of today, the sidecar is an optimization, not a requirement.
+
+---
+
+## Conclusion
+
+Sub-millisecond proxy overhead is not achieved through a single optimization, but through architectural changes.
+
+By keeping Python focused on orchestration and extensibility, and offloading performance-critical execution to a sidecar, we establish a foundation for making LiteLLM **permanently fast over time**—even on modest hardware such as a 1-CPU, 2-GB RAM instance, while keeping deployment and self-hosting simple.
+
+This work extends beyond Q1, and we will continue sharing benchmarks and updates as the architecture evolves.
diff --git a/docs/my-website/docs/mcp_semantic_filter.md b/docs/my-website/docs/mcp_semantic_filter.md
@@ -0,0 +1,158 @@
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+# MCP Semantic Tool Filter
+
+Automatically filter MCP tools by semantic relevance. When you have many MCP tools registered, LiteLLM semantically matches the user's query against tool descriptions and sends only the most relevant tools to the LLM.
+
+## How It Works
+
+Tool search shifts tool selection from a prompt-engineering problem to a retrieval problem. Instead of injecting a large static list of tools into every prompt, the semantic filter:
+
+1. Builds a semantic index of all available MCP tools on startup
+2. On each request, semantically matches the user's query against tool descriptions
+3. Returns only the top-K most relevant tools to the LLM
+
+This approach improves context efficiency, increases reliability by reducing tool confusion, and enables scalability to ecosystems with hundreds or thousands of MCP tools.
+
+```mermaid
+sequenceDiagram
+    participant Client
+    participant LiteLLM as LiteLLM Proxy
+    participant SemanticFilter as Semantic Filter
+    participant MCP as MCP Registry
+    participant LLM as LLM Provider
+
+    Note over LiteLLM,MCP: Startup: Build Semantic Index
+    LiteLLM->>MCP: Fetch all registered MCP tools
+    MCP->>LiteLLM: Return all tools (e.g., 50 tools)
+    LiteLLM->>SemanticFilter: Build semantic router with embeddings
+    SemanticFilter->>LLM: Generate embeddings for tool descriptions
+    LLM->>SemanticFilter: Return embeddings
+    Note over SemanticFilter: Index ready for fast lookup
+
+    Note over Client,LLM: Request: Semantic Tool Filtering
+    Client->>LiteLLM: POST /v1/responses with MCP tools
+    LiteLLM->>SemanticFilter: Expand MCP references (50 tools available)
+    SemanticFilter->>SemanticFilter: Extract user query from request
+    SemanticFilter->>LLM: Generate query embedding
+    LLM->>SemanticFilter: Return query embedding
+    SemanticFilter->>SemanticFilter: Match query against tool embeddings
+    SemanticFilter->>LiteLLM: Return top-K tools (e.g., 3 most relevant)
+    LiteLLM->>LLM: Forward request with filtered tools (3 tools)
+    LLM->>LiteLLM: Return response
+    LiteLLM->>Client: Response with headers<br/>x-litellm-semantic-filter: 50->3<br/>x-litellm-semantic-filter-tools: tool1,tool2,tool3
+```
+
+## Configuration
+
+Enable semantic filtering in your LiteLLM config:
+
+```yaml title="config.yaml" showLineNumbers
+litellm_settings:
+  mcp_semantic_tool_filter:
+    enabled: true
+    embedding_model: "text-embedding-3-small"  # Model for semantic matching
+    top_k: 5                                    # Max tools to return
+    similarity_threshold: 0.3                   # Min similarity score
+```
+
+**Configuration Options:**
+- `enabled` - Enable/disable semantic filtering (default: `false`)
+- `embedding_model` - Model for generating embeddings (default: `"text-embedding-3-small"`)
+- `top_k` - Maximum number of tools to return (default: `10`)
+- `similarity_threshold` - Minimum similarity score for matches (default: `0.3`)
+
+## Usage
+
+Use MCP tools normally with the Responses API or Chat Completions. The semantic filter runs automatically:
+
+<Tabs>
+<TabItem value="responses" label="Responses API">
+
+```bash title="Responses API with Semantic Filtering" showLineNumbers
+curl --location 'http://localhost:4000/v1/responses' \
+--header 'Content-Type: application/json' \
+--header "Authorization: Bearer sk-1234" \
+--data '{
+    "model": "gpt-4o",
+    "input": [
+    {
+      "role": "user",
+      "content": "give me TLDR of what BerriAI/litellm repo is about",
+      "type": "message"
+    }
+  ],
+    "tools": [
+        {
+            "type": "mcp",
+            "server_url": "litellm_proxy",
+            "require_approval": "never"
+        }
+    ],
+    "tool_choice": "required"
+}'
+```
+
+</TabItem>
+<TabItem value="chat" label="Chat Completions">
+
+```bash title="Chat Completions with Semantic Filtering" showLineNumbers
+curl --location 'http://localhost:4000/v1/chat/completions' \
+--header 'Content-Type: application/json' \
+--header "Authorization: Bearer sk-1234" \
+--data '{
+  "model": "gpt-4o",
+  "messages": [
+    {"role": "user", "content": "Search Wikipedia for LiteLLM"}
+  ],
+  "tools": [
+    {
+      "type": "mcp",
+      "server_url": "litellm_proxy"
+    }
+  ]
+}'
+```
+
+</TabItem>
+</Tabs>
+
+## Response Headers
+
+The semantic filter adds diagnostic headers to every response:
+
+```
+x-litellm-semantic-filter: 10->3
+x-litellm-semantic-filter-tools: wikipedia-fetch,github-search,slack-post
+```
+
+- **`x-litellm-semantic-filter`** - Shows before→after tool count (e.g., `10->3` means 10 tools were filtered down to 3)
+- **`x-litellm-semantic-filter-tools`** - CSV list of the filtered tool names (max 150 chars, clipped with `...` if longer)
+
+These headers help you understand which tools were selected for each request and verify the filter is working correctly.
+
+## Example
+
+If you have 50 MCP tools registered and make a request asking about Wikipedia, the semantic filter will:
+
+1. Semantically match your query `"Search Wikipedia for LiteLLM"` against all 50 tool descriptions
+2. Select the top 5 most relevant tools (e.g., `wikipedia-fetch`, `wikipedia-search`, etc.)
+3. Pass only those 5 tools to the LLM
+4. Add headers showing `x-litellm-semantic-filter: 50->5`
+
+This dramatically reduces prompt size while ensuring the LLM has access to the right tools for the task.
+
+## Performance
+
+The semantic filter is optimized for production:
+- Router builds once on startup (no per-request overhead)
+- Semantic matching typically takes under 50ms
+- Fails gracefully - returns all tools if filtering fails
+- No impact on latency for requests without MCP tools
+
+## Related
+
+- [MCP Overview](./mcp.md) - Learn about MCP in LiteLLM
+- [MCP Permission Management](./mcp_control.md) - Control tool access by key/team
+- [Using MCP](./mcp_usage.md) - Complete MCP usage guide
diff --git a/docs/my-website/docs/providers/bedrock.md b/docs/my-website/docs/providers/bedrock.md
@@ -9,7 +9,7 @@ ALL Bedrock models (Anthropic, Meta, Deepseek, Mistral, Amazon, etc.) are Suppor
 | Description | Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs). |
 | Provider Route on LiteLLM | `bedrock/`, [`bedrock/converse/`](#set-converse--invoke-route), [`bedrock/invoke/`](#set-invoke-route), [`bedrock/converse_like/`](#calling-via-internal-proxy), [`bedrock/llama/`](#deepseek-not-r1), [`bedrock/deepseek_r1/`](#deepseek-r1), [`bedrock/qwen3/`](#qwen3-imported-models), [`bedrock/qwen2/`](./bedrock_imported.md#qwen2-imported-models), [`bedrock/openai/`](./bedrock_imported.md#openai-compatible-imported-models-qwen-25-vl-etc), [`bedrock/moonshot`](./bedrock_imported.md#moonshot-kimi-k2-thinking) |
 | Provider Doc | [Amazon Bedrock ↗](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html) |
-| Supported OpenAI Endpoints | `/chat/completions`, `/completions`, `/embeddings`, `/images/generations` |
+| Supported OpenAI Endpoints | `/chat/completions`, `/completions`, `/embeddings`, `/images/generations`, `/v1/realtime`|
 | Rerank Endpoint | `/rerank` |
 | Pass-through Endpoint | [Supported](../pass_through/bedrock.md) |
 

diff --git a/.../tutorials/bedrock_realtime_with_audio.md → .../providers/bedrock_realtime_with_audio.md b/.../tutorials/bedrock_realtime_with_audio.md → .../providers/bedrock_realtime_with_audio.md
@@ -1,8 +1,4 @@
-# Call Bedrock Nova Sonic Realtime API with Audio Input/Output 
-
-:::info
-Requires LiteLLM Proxy v1.70.1+
-:::
+# Bedrock Realtime API
 
 ## Overview
 

diff --git a/docs/my-website/docs/proxy/config_settings.md b/docs/my-website/docs/proxy/config_settings.md
@@ -545,6 +545,9 @@ router_settings:
 | DEFAULT_MAX_TOKENS | Default maximum tokens for LLM calls. Default is 4096
 | DEFAULT_MAX_TOKENS_FOR_TRITON | Default maximum tokens for Triton models. Default is 2000
 | DEFAULT_MAX_REDIS_BATCH_CACHE_SIZE | Default maximum size for redis batch cache. Default is 1000
+| DEFAULT_MCP_SEMANTIC_FILTER_EMBEDDING_MODEL | Default embedding model for MCP semantic tool filtering. Default is "text-embedding-3-small"
+| DEFAULT_MCP_SEMANTIC_FILTER_SIMILARITY_THRESHOLD | Default similarity threshold for MCP semantic tool filtering. Default is 0.3
+| DEFAULT_MCP_SEMANTIC_FILTER_TOP_K | Default number of top results to return for MCP semantic tool filtering. Default is 10
 | DEFAULT_MOCK_RESPONSE_COMPLETION_TOKEN_COUNT | Default token count for mock response completions. Default is 20
 | DEFAULT_MOCK_RESPONSE_PROMPT_TOKEN_COUNT | Default token count for mock response prompts. Default is 10
 | DEFAULT_MODEL_CREATED_AT_TIME | Default creation timestamp for models. Default is 1677610602
@@ -802,6 +805,7 @@ router_settings:
 | MAXIMUM_TRACEBACK_LINES_TO_LOG | Maximum number of lines to log in traceback in LiteLLM Logs UI. Default is 100
 | MAX_RETRY_DELAY | Maximum delay in seconds for retrying requests. Default is 8.0
 | MAX_LANGFUSE_INITIALIZED_CLIENTS | Maximum number of Langfuse clients to initialize on proxy. Default is 50. This is set since langfuse initializes 1 thread everytime a client is initialized. We've had an incident in the past where we reached 100% cpu utilization because Langfuse was initialized several times.
+| MAX_MCP_SEMANTIC_FILTER_TOOLS_HEADER_LENGTH | Maximum header length for MCP semantic filter tools. Default is 150
 | MIN_NON_ZERO_TEMPERATURE | Minimum non-zero temperature value. Default is 0.0001
 | MINIMUM_PROMPT_CACHE_TOKEN_COUNT | Minimum token count for caching a prompt. Default is 1024
 | MISTRAL_API_BASE | Base URL for Mistral API. Default is https://api.mistral.ai

diff --git a/docs/my-website/docs/proxy/ui_logs.md b/docs/my-website/docs/proxy/ui_logs.md
@@ -37,6 +37,40 @@ general_settings:
 
 <Image img={require('../../img/ui_request_logs_content.png')}/>
 
+## Tracing Tools
+
+View which tools were provided and called in your completion requests.
+
+<Image img={require('../../img/ui_tools.png')}/>
+
+**Example:** Make a completion request with tools:
+
+```bash
+curl -X POST 'http://localhost:4000/chat/completions' \
+  -H 'Authorization: Bearer sk-1234' \
+  -H 'Content-Type: application/json' \
+  -d '{
+    "model": "gpt-4",
+    "messages": [{"role": "user", "content": "What is the weather?"}],
+    "tools": [
+      {
+        "type": "function",
+        "function": {
+          "name": "get_weather",
+          "description": "Get the current weather",
+          "parameters": {
+            "type": "object",
+            "properties": {
+              "location": {"type": "string"}
+            }
+          }
+        }
+      }
+    ]
+  }'
+```
+
+Check the Logs page to see all tools provided and which ones were called.
 
 ## Stop storing Error Logs in DB
 

diff --git a/docs/my-website/img/ui_tools.png b/docs/my-website/img/ui_tools.png
diff --git a/docs/my-website/sidebars.js b/docs/my-website/sidebars.js
@@ -538,6 +538,7 @@ const sidebars = {
           items: [
             "mcp",
             "mcp_usage",
+            "mcp_semantic_filter",
             "mcp_control",
             "mcp_cost",
             "mcp_guardrail",
@@ -716,6 +717,7 @@ const sidebars = {
             "providers/bedrock_agents",
             "providers/bedrock_writer",
             "providers/bedrock_batches",
+            "providers/bedrock_realtime_with_audio",
             "providers/aws_polly",
         "providers/bedrock_vector_store",
       ]

diff --git a/...itellm_proxy_extras/migrations/20260129103648_add_verificationtoken_indexes/migration.sql b/...itellm_proxy_extras/migrations/20260129103648_add_verificationtoken_indexes/migration.sql
diff --git a/litellm-proxy-extras/litellm_proxy_extras/schema.prisma b/litellm-proxy-extras/litellm_proxy_extras/schema.prisma
@@ -305,16 +305,6 @@ model LiteLLM_VerificationToken {
     litellm_budget_table LiteLLM_BudgetTable?   @relation(fields: [budget_id], references: [budget_id])
     litellm_organization_table LiteLLM_OrganizationTable?   @relation(fields: [organization_id], references: [organization_id])
     object_permission LiteLLM_ObjectPermissionTable?   @relation(fields: [object_permission_id], references: [object_permission_id])
-
-    // SELECT COUNT(*) FROM (SELECT "public"."LiteLLM_VerificationToken"."token" FROM "public"."LiteLLM_VerificationToken" WHERE ("public"."LiteLLM_VerificationToken"."user_id" = $1 AND ("public"."LiteLLM_VerificationToken"."team_id" IS NULL OR "public"."LiteLLM_VerificationToken"."team_id" <> $2)) OFFSET $3 ) AS "sub"
-    // SELECT ... FROM "public"."LiteLLM_VerificationToken" WHERE "public"."LiteLLM_VerificationToken"."user_id" = $1 OFFSET $2
-    @@index([user_id, team_id])
-
-    // SELECT ... FROM "public"."LiteLLM_VerificationToken" WHERE "public"."LiteLLM_VerificationToken"."team_id" = $1 OFFSET $2
-    @@index([team_id])
-
-    // SELECT ... FROM "public"."LiteLLM_VerificationToken" WHERE (("public"."LiteLLM_VerificationToken"."expires" IS NULL OR "public"."LiteLLM_VerificationToken"."expires" > $1) AND "public"."LiteLLM_VerificationToken"."budget_reset_at" < $2) OFFSET $3
-    @@index([budget_reset_at, expires])
 }
 
 // Audit table for deleted keys - preserves spend and key information for historical tracking