Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
f1bca73
temp commit for branch switching
yuneng-jiang Feb 2, 2026
899bafb
adding team mappings UI
yuneng-jiang Feb 2, 2026
65c62ff
Adding tests
yuneng-jiang Feb 2, 2026
edfe239
reset_spend endpoint
yuneng-jiang Feb 2, 2026
16f0b49
team setting disable global guardrail fix
yuneng-jiang Feb 3, 2026
2645d25
fixing tests
yuneng-jiang Feb 3, 2026
cd154c4
Merge pull request #20307 from BerriAI/litellm_ui_disable_global_guar…
yuneng-jiang Feb 3, 2026
62993b5
Merge pull request #20305 from BerriAI/litellm_reset_spend_endpoint
yuneng-jiang Feb 3, 2026
8eba641
Merge pull request #20299 from BerriAI/litellm_ui_allowed_routes_drop…
yuneng-jiang Feb 3, 2026
32b1ff7
option to hide community engagement buttons
yuneng-jiang Feb 3, 2026
1b9631d
Add blog post: Achieving Sub-Millisecond Proxy Overhead (#20309)
AlexsanderHamir Feb 3, 2026
cf734cb
Migrate Default Team settings to use reusable Model Select
yuneng-jiang Feb 3, 2026
079f49f
[Feat] - MCP Semantic Filtering Support (#20296)
ishaan-jaff Feb 3, 2026
0ef506a
Litellm docs mcp filtering semantic (#20316)
ishaan-jaff Feb 3, 2026
4e8c6d1
fix linting
ishaan-jaff Feb 3, 2026
c8f9af1
fix mypy lint
ishaan-jaff Feb 3, 2026
333419b
Add documentation correctly for nova sonic
Sameerlite Feb 3, 2026
5cfcf67
[Feat] /chat/completions - allow using OpenAI style tools for `web_se…
ishaan-jaff Feb 3, 2026
24a4979
Merge pull request #20320 from BerriAI/litellm_nova-sonic_doc
Sameerlite Feb 3, 2026
5aa8725
docs Tracing Tools
ishaan-jaff Feb 3, 2026
7ae9804
docs fix
ishaan-jaff Feb 3, 2026
2984832
Merge pull request #20310 from BerriAI/litellm_ui_def_team_settings
yuneng-jiang Feb 3, 2026
9202870
Merge pull request #20308 from BerriAI/litellm_ui_community_buttons
yuneng-jiang Feb 3, 2026
7dd0248
Revert "fix: models loadbalancing billing issue by filter (#18891) (#…
Sameerlite Feb 3, 2026
86ae627
Fix litellm/tests/test_litellm/proxy/_experimental/mcp_server/test_se…
Sameerlite Feb 3, 2026
b379fb6
Fix code quality tests
Sameerlite Feb 3, 2026
ecb6413
Revert "add missing indexes on VerificationToken table (#20040)"
Sameerlite Feb 3, 2026
23f662e
Merge pull request #20328 from BerriAI/litellm_tuesday_cicd_release
Sameerlite Feb 3, 2026
1b1854b
Revert "Litellm tuesday cicd release"
Sameerlite Feb 3, 2026
80acd4c
Merge pull request #20330 from BerriAI/revert-20328-litellm_tuesday_c…
Sameerlite Feb 3, 2026
eb8f4d3
Revert "fix: models loadbalancing billing issue by filter (#18891) (#…
Sameerlite Feb 3, 2026
9a6bafe
Fix litellm/tests/test_litellm/proxy/_experimental/mcp_server/test_se…
Sameerlite Feb 3, 2026
017b78d
Fix code quality tests
Sameerlite Feb 3, 2026
fae0554
Revert "add missing indexes on VerificationToken table (#20040)"
Sameerlite Feb 3, 2026
31cdffd
Revert "fix: prevent error when max_fallbacks exceeds available model…
Sameerlite Feb 3, 2026
21e95c7
Fix litellm_security_tests
Sameerlite Feb 3, 2026
793a7fd
Merge pull request #20333 from BerriAI/litellm_tuesday_cicd_release_f…
Sameerlite Feb 3, 2026
47c5366
bump litellm 1.81.7
Sameerlite Feb 3, 2026
070d501
Merge pull request #20336 from BerriAI/litellm_bump_version_1.81.7
Sameerlite Feb 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci_cd/security_scans.sh
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,7 @@ run_grype_scans() {
"CVE-2025-15367" # No fix available yet
"CVE-2025-12781" # No fix available yet
"CVE-2025-11468" # No fix available yet
"CVE-2026-1299" # Python 3.13 email module header injection - not applicable, LiteLLM doesn't use BytesGenerator for email serialization
)

# Build JSON array of allowlisted CVE IDs for jq
Expand Down
92 changes: 92 additions & 0 deletions docs/my-website/blog/sub_millisecond_proxy_overhead/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
---
slug: sub-millisecond-proxy-overhead
title: "Achieving Sub-Millisecond Proxy Overhead"
date: 2026-02-02T10:00:00
authors:
- name: Alexsander Hamir
title: "Performance Engineer, LiteLLM"
url: https://www.linkedin.com/in/alexsander-baptista/
image_url: https://github.com/AlexsanderHamir.png
- name: Krrish Dholakia
title: "CEO, LiteLLM"
url: https://www.linkedin.com/in/krish-d/
image_url: https://pbs.twimg.com/profile_images/1298587542745358340/DZv3Oj-h_400x400.jpg
- name: Ishaan Jaff
title: "CTO, LiteLLM"
url: https://www.linkedin.com/in/reffajnaahsi/
image_url: https://pbs.twimg.com/profile_images/1613813310264340481/lz54oEiB_400x400.jpg
description: "Our Q1 performance target and architectural direction for achieving sub-millisecond proxy overhead on modest hardware."
tags: [performance, architecture]
hide_table_of_contents: false
---

![Sidecar architecture: Python control plane vs. sidecar hot path](https://raw.githubusercontent.com/AlexsanderHamir/assets/main/Screenshot%202026-02-02%20172554.png)

# Achieving Sub-Millisecond Proxy Overhead

## Introduction

Our Q1 performance target is to aggressively move toward sub-millisecond proxy overhead on a single instance with 4 CPUs and 8 GB of RAM, and to continue pushing that boundary over time. Our broader goal is to make LiteLLM inexpensive to deploy, lightweight, and fast. This post outlines the architectural direction behind that effort.

Proxy overhead refers to the latency introduced by LiteLLM itself, independent of the upstream provider.

To measure it, we run the same workload directly against the provider and through LiteLLM at identical QPS (for example, 1,000 QPS) and compare the latency delta. To reduce noise, the load generator, LiteLLM, and a mock LLM endpoint all run on the same machine, ensuring the difference reflects proxy overhead rather than network latency.

---

## Where We're Coming From

Under the same benchmark originally conducted by [TensorZero](https://www.tensorzero.com/docs/gateway/benchmarks), LiteLLM previously failed at around 1,000 QPS.

That is no longer the case. Today, LiteLLM can be stress-tested at 1,000 QPS with no failures and can scale up to 5,000 QPS without failures on a 4-CPU, 8-GB RAM single instance setup.

This establishes a more up to date baseline and provides useful context as we continue working on proxy overhead and overall performance.

---

## Design Choice

Achieving sub-millisecond proxy overhead with a Python-based system requires being deliberate about where work happens.

Python is a strong fit for flexibility and extensibility: provider abstraction, configuration-driven routing, and a rich callback ecosystem. These are areas where development velocity and correctness matter more than raw throughput.

At higher request rates, however, certain classes of work become expensive when executed inside the Python process on every request. Rather than rewriting LiteLLM or introducing complex deployment requirements, we adopt an optional **sidecar architecture**.

This architectural change is how we intend to make LiteLLM **permanently fast**. While it supports our near-term performance targets, it is a long-term investment.

Python continues to own:

- Request validation and normalization
- Model and provider selection
- Callbacks and integrations

The sidecar owns **performance-critical execution**, such as:

- Efficient request forwarding
- Connection reuse and pooling
- Enforcing timeouts and limits
- Aggregating high-frequency metrics

This separation allows each component to focus on what it does best: Python acts as the control plane, while the sidecar handles the hot path.

---

### Why the Sidecar Is Optional

The sidecar is intentionally **optional**.

This allows us to ship it incrementally, validate it under real-world workloads, and avoid making it a hard dependency before it is fully battle-tested across all LiteLLM features.

Just as importantly, this ensures that self-hosting LiteLLM remains simple. The sidecar is bundled and started automatically, requires no additional infrastructure, and can be disabled entirely. From a user's perspective, LiteLLM continues to behave like a single service.

As of today, the sidecar is an optimization, not a requirement.

---

## Conclusion

Sub-millisecond proxy overhead is not achieved through a single optimization, but through architectural changes.

By keeping Python focused on orchestration and extensibility, and offloading performance-critical execution to a sidecar, we establish a foundation for making LiteLLM **permanently fast over time**—even on modest hardware such as a 1-CPU, 2-GB RAM instance, while keeping deployment and self-hosting simple.

This work extends beyond Q1, and we will continue sharing benchmarks and updates as the architecture evolves.
158 changes: 158 additions & 0 deletions docs/my-website/docs/mcp_semantic_filter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

# MCP Semantic Tool Filter

Automatically filter MCP tools by semantic relevance. When you have many MCP tools registered, LiteLLM semantically matches the user's query against tool descriptions and sends only the most relevant tools to the LLM.

## How It Works

Tool search shifts tool selection from a prompt-engineering problem to a retrieval problem. Instead of injecting a large static list of tools into every prompt, the semantic filter:

1. Builds a semantic index of all available MCP tools on startup
2. On each request, semantically matches the user's query against tool descriptions
3. Returns only the top-K most relevant tools to the LLM

This approach improves context efficiency, increases reliability by reducing tool confusion, and enables scalability to ecosystems with hundreds or thousands of MCP tools.

```mermaid
sequenceDiagram
participant Client
participant LiteLLM as LiteLLM Proxy
participant SemanticFilter as Semantic Filter
participant MCP as MCP Registry
participant LLM as LLM Provider

Note over LiteLLM,MCP: Startup: Build Semantic Index
LiteLLM->>MCP: Fetch all registered MCP tools
MCP->>LiteLLM: Return all tools (e.g., 50 tools)
LiteLLM->>SemanticFilter: Build semantic router with embeddings
SemanticFilter->>LLM: Generate embeddings for tool descriptions
LLM->>SemanticFilter: Return embeddings
Note over SemanticFilter: Index ready for fast lookup

Note over Client,LLM: Request: Semantic Tool Filtering
Client->>LiteLLM: POST /v1/responses with MCP tools
LiteLLM->>SemanticFilter: Expand MCP references (50 tools available)
SemanticFilter->>SemanticFilter: Extract user query from request
SemanticFilter->>LLM: Generate query embedding
LLM->>SemanticFilter: Return query embedding
SemanticFilter->>SemanticFilter: Match query against tool embeddings
SemanticFilter->>LiteLLM: Return top-K tools (e.g., 3 most relevant)
LiteLLM->>LLM: Forward request with filtered tools (3 tools)
LLM->>LiteLLM: Return response
LiteLLM->>Client: Response with headers<br/>x-litellm-semantic-filter: 50->3<br/>x-litellm-semantic-filter-tools: tool1,tool2,tool3
```

## Configuration

Enable semantic filtering in your LiteLLM config:

```yaml title="config.yaml" showLineNumbers
litellm_settings:
mcp_semantic_tool_filter:
enabled: true
embedding_model: "text-embedding-3-small" # Model for semantic matching
top_k: 5 # Max tools to return
similarity_threshold: 0.3 # Min similarity score
```

**Configuration Options:**
- `enabled` - Enable/disable semantic filtering (default: `false`)
- `embedding_model` - Model for generating embeddings (default: `"text-embedding-3-small"`)
- `top_k` - Maximum number of tools to return (default: `10`)
- `similarity_threshold` - Minimum similarity score for matches (default: `0.3`)

## Usage

Use MCP tools normally with the Responses API or Chat Completions. The semantic filter runs automatically:

<Tabs>
<TabItem value="responses" label="Responses API">

```bash title="Responses API with Semantic Filtering" showLineNumbers
curl --location 'http://localhost:4000/v1/responses' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer sk-1234" \
--data '{
"model": "gpt-4o",
"input": [
{
"role": "user",
"content": "give me TLDR of what BerriAI/litellm repo is about",
"type": "message"
}
],
"tools": [
{
"type": "mcp",
"server_url": "litellm_proxy",
"require_approval": "never"
}
],
"tool_choice": "required"
}'
```

</TabItem>
<TabItem value="chat" label="Chat Completions">

```bash title="Chat Completions with Semantic Filtering" showLineNumbers
curl --location 'http://localhost:4000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--header "Authorization: Bearer sk-1234" \
--data '{
"model": "gpt-4o",
"messages": [
{"role": "user", "content": "Search Wikipedia for LiteLLM"}
],
"tools": [
{
"type": "mcp",
"server_url": "litellm_proxy"
}
]
}'
```

</TabItem>
</Tabs>

## Response Headers

The semantic filter adds diagnostic headers to every response:

```
x-litellm-semantic-filter: 10->3
x-litellm-semantic-filter-tools: wikipedia-fetch,github-search,slack-post
```

- **`x-litellm-semantic-filter`** - Shows before→after tool count (e.g., `10->3` means 10 tools were filtered down to 3)
- **`x-litellm-semantic-filter-tools`** - CSV list of the filtered tool names (max 150 chars, clipped with `...` if longer)

These headers help you understand which tools were selected for each request and verify the filter is working correctly.

## Example

If you have 50 MCP tools registered and make a request asking about Wikipedia, the semantic filter will:

1. Semantically match your query `"Search Wikipedia for LiteLLM"` against all 50 tool descriptions
2. Select the top 5 most relevant tools (e.g., `wikipedia-fetch`, `wikipedia-search`, etc.)
3. Pass only those 5 tools to the LLM
4. Add headers showing `x-litellm-semantic-filter: 50->5`

This dramatically reduces prompt size while ensuring the LLM has access to the right tools for the task.

## Performance

The semantic filter is optimized for production:
- Router builds once on startup (no per-request overhead)
- Semantic matching typically takes under 50ms
- Fails gracefully - returns all tools if filtering fails
- No impact on latency for requests without MCP tools

## Related

- [MCP Overview](./mcp.md) - Learn about MCP in LiteLLM
- [MCP Permission Management](./mcp_control.md) - Control tool access by key/team
- [Using MCP](./mcp_usage.md) - Complete MCP usage guide
2 changes: 1 addition & 1 deletion docs/my-website/docs/providers/bedrock.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ ALL Bedrock models (Anthropic, Meta, Deepseek, Mistral, Amazon, etc.) are Suppor
| Description | Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs). |
| Provider Route on LiteLLM | `bedrock/`, [`bedrock/converse/`](#set-converse--invoke-route), [`bedrock/invoke/`](#set-invoke-route), [`bedrock/converse_like/`](#calling-via-internal-proxy), [`bedrock/llama/`](#deepseek-not-r1), [`bedrock/deepseek_r1/`](#deepseek-r1), [`bedrock/qwen3/`](#qwen3-imported-models), [`bedrock/qwen2/`](./bedrock_imported.md#qwen2-imported-models), [`bedrock/openai/`](./bedrock_imported.md#openai-compatible-imported-models-qwen-25-vl-etc), [`bedrock/moonshot`](./bedrock_imported.md#moonshot-kimi-k2-thinking) |
| Provider Doc | [Amazon Bedrock ↗](https://docs.aws.amazon.com/bedrock/latest/userguide/what-is-bedrock.html) |
| Supported OpenAI Endpoints | `/chat/completions`, `/completions`, `/embeddings`, `/images/generations` |
| Supported OpenAI Endpoints | `/chat/completions`, `/completions`, `/embeddings`, `/images/generations`, `/v1/realtime`|
| Rerank Endpoint | `/rerank` |
| Pass-through Endpoint | [Supported](../pass_through/bedrock.md) |

Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,4 @@
# Call Bedrock Nova Sonic Realtime API with Audio Input/Output

:::info
Requires LiteLLM Proxy v1.70.1+
:::
# Bedrock Realtime API

## Overview

Expand Down
4 changes: 4 additions & 0 deletions docs/my-website/docs/proxy/config_settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -545,6 +545,9 @@ router_settings:
| DEFAULT_MAX_TOKENS | Default maximum tokens for LLM calls. Default is 4096
| DEFAULT_MAX_TOKENS_FOR_TRITON | Default maximum tokens for Triton models. Default is 2000
| DEFAULT_MAX_REDIS_BATCH_CACHE_SIZE | Default maximum size for redis batch cache. Default is 1000
| DEFAULT_MCP_SEMANTIC_FILTER_EMBEDDING_MODEL | Default embedding model for MCP semantic tool filtering. Default is "text-embedding-3-small"
| DEFAULT_MCP_SEMANTIC_FILTER_SIMILARITY_THRESHOLD | Default similarity threshold for MCP semantic tool filtering. Default is 0.3
| DEFAULT_MCP_SEMANTIC_FILTER_TOP_K | Default number of top results to return for MCP semantic tool filtering. Default is 10
| DEFAULT_MOCK_RESPONSE_COMPLETION_TOKEN_COUNT | Default token count for mock response completions. Default is 20
| DEFAULT_MOCK_RESPONSE_PROMPT_TOKEN_COUNT | Default token count for mock response prompts. Default is 10
| DEFAULT_MODEL_CREATED_AT_TIME | Default creation timestamp for models. Default is 1677610602
Expand Down Expand Up @@ -802,6 +805,7 @@ router_settings:
| MAXIMUM_TRACEBACK_LINES_TO_LOG | Maximum number of lines to log in traceback in LiteLLM Logs UI. Default is 100
| MAX_RETRY_DELAY | Maximum delay in seconds for retrying requests. Default is 8.0
| MAX_LANGFUSE_INITIALIZED_CLIENTS | Maximum number of Langfuse clients to initialize on proxy. Default is 50. This is set since langfuse initializes 1 thread everytime a client is initialized. We've had an incident in the past where we reached 100% cpu utilization because Langfuse was initialized several times.
| MAX_MCP_SEMANTIC_FILTER_TOOLS_HEADER_LENGTH | Maximum header length for MCP semantic filter tools. Default is 150
| MIN_NON_ZERO_TEMPERATURE | Minimum non-zero temperature value. Default is 0.0001
| MINIMUM_PROMPT_CACHE_TOKEN_COUNT | Minimum token count for caching a prompt. Default is 1024
| MISTRAL_API_BASE | Base URL for Mistral API. Default is https://api.mistral.ai
Expand Down
34 changes: 34 additions & 0 deletions docs/my-website/docs/proxy/ui_logs.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,40 @@ general_settings:

<Image img={require('../../img/ui_request_logs_content.png')}/>

## Tracing Tools

View which tools were provided and called in your completion requests.

<Image img={require('../../img/ui_tools.png')}/>

**Example:** Make a completion request with tools:

```bash
curl -X POST 'http://localhost:4000/chat/completions' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "What is the weather?"}],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}
]
}'
```

Check the Logs page to see all tools provided and which ones were called.

## Stop storing Error Logs in DB

Expand Down
Binary file added docs/my-website/img/ui_tools.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/my-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -538,6 +538,7 @@ const sidebars = {
items: [
"mcp",
"mcp_usage",
"mcp_semantic_filter",
"mcp_control",
"mcp_cost",
"mcp_guardrail",
Expand Down Expand Up @@ -716,6 +717,7 @@ const sidebars = {
"providers/bedrock_agents",
"providers/bedrock_writer",
"providers/bedrock_batches",
"providers/bedrock_realtime_with_audio",
"providers/aws_polly",
"providers/bedrock_vector_store",
]
Expand Down

This file was deleted.

10 changes: 0 additions & 10 deletions litellm-proxy-extras/litellm_proxy_extras/schema.prisma
Original file line number Diff line number Diff line change
Expand Up @@ -305,16 +305,6 @@ model LiteLLM_VerificationToken {
litellm_budget_table LiteLLM_BudgetTable? @relation(fields: [budget_id], references: [budget_id])
litellm_organization_table LiteLLM_OrganizationTable? @relation(fields: [organization_id], references: [organization_id])
object_permission LiteLLM_ObjectPermissionTable? @relation(fields: [object_permission_id], references: [object_permission_id])

// SELECT COUNT(*) FROM (SELECT "public"."LiteLLM_VerificationToken"."token" FROM "public"."LiteLLM_VerificationToken" WHERE ("public"."LiteLLM_VerificationToken"."user_id" = $1 AND ("public"."LiteLLM_VerificationToken"."team_id" IS NULL OR "public"."LiteLLM_VerificationToken"."team_id" <> $2)) OFFSET $3 ) AS "sub"
// SELECT ... FROM "public"."LiteLLM_VerificationToken" WHERE "public"."LiteLLM_VerificationToken"."user_id" = $1 OFFSET $2
@@index([user_id, team_id])

// SELECT ... FROM "public"."LiteLLM_VerificationToken" WHERE "public"."LiteLLM_VerificationToken"."team_id" = $1 OFFSET $2
@@index([team_id])

// SELECT ... FROM "public"."LiteLLM_VerificationToken" WHERE (("public"."LiteLLM_VerificationToken"."expires" IS NULL OR "public"."LiteLLM_VerificationToken"."expires" > $1) AND "public"."LiteLLM_VerificationToken"."budget_reset_at" < $2) OFFSET $3
@@index([budget_reset_at, expires])
}

// Audit table for deleted keys - preserves spend and key information for historical tracking
Expand Down
Loading
Loading