From e5e20954a50c7c858d7e81264da4bea68630c6f6 Mon Sep 17 00:00:00 2001
From: Cursor Agent <cursoragent@cursor.com>
Date: Fri, 30 Jan 2026 17:50:09 +0000
Subject: [PATCH 1/2] Add /realtime API benchmarks to Benchmarks documentation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Added new section showing performance improvements for /realtime endpoint
- Included before/after metrics showing 182× faster p99 latency
- Added test setup specifications and key optimizations
- Referenced from v1.80.5-stable release notes

Co-authored-by: ishaan <ishaan@berri.ai>
---
 docs/my-website/docs/benchmarks.md | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/docs/my-website/docs/benchmarks.md b/docs/my-website/docs/benchmarks.md
index 640212808bd..00f45c5d4b3 100644
--- a/docs/my-website/docs/benchmarks.md
+++ b/docs/my-website/docs/benchmarks.md
@@ -48,6 +48,34 @@ In these tests the baseline latency characteristics are measured against a fake-
 - High-percentile latencies drop significantly: P95 630 ms → 150 ms, P99 1,200 ms → 240 ms.
 - Setting workers equal to CPU count gives optimal performance.
 
+## `/realtime` API Benchmarks
+
+LiteLLM's `/realtime` endpoint has been optimized for low-latency WebSocket connections, achieving significant performance improvements through removal of redundant encodings, SSL context reuse, and caching of formatting strings.
+
+### Performance Metrics
+
+| Metric          | Before    | After     | Improvement                |
+| --------------- | --------- | --------- | -------------------------- |
+| Median latency  | 2,200 ms  | **59 ms** | **−97% (~37× faster)**     |
+| p95 latency     | 8,500 ms  | **67 ms** | **−99% (~127× faster)**    |
+| p99 latency     | 18,000 ms | **99 ms** | **−99% (~182× faster)**    |
+| Average latency | 3,214 ms  | **63 ms** | **−98% (~51× faster)**     |
+| RPS             | 165       | **1,207** | **+631% (~7.3× increase)** |
+
+### Test Setup
+
+| Category | Specification |
+|----------|---------------|
+| **Load Testing** | Locust: 1,000 concurrent users, 500 ramp-up |
+| **System** | 4 vCPUs, 8 GB RAM, 4 workers, 4 instances |
+| **Database** | PostgreSQL (Redis unused) |
+
+### Key Optimizations
+
+- Removed redundant encodings on the hot path
+- Reused shared SSL contexts to prevent excessive memory allocation
+- Cached formatting strings that were being regenerated twice per request
+
 ## Machine Spec used for testing
 
 Each machine deploying LiteLLM had the following specs:

From 5b7458234b6d2683655077d2cabd293c98186dae Mon Sep 17 00:00:00 2001
From: Cursor Agent <cursoragent@cursor.com>
Date: Fri, 30 Jan 2026 18:14:51 +0000
Subject: [PATCH 2/2] Update /realtime benchmarks to show current performance
 only

- Removed before/after comparison, showing only current metrics
- Clarified that benchmarks are e2e latency against fake realtime endpoint
- Simplified table format for better readability

Co-authored-by: ishaan <ishaan@berri.ai>
---
 docs/my-website/docs/benchmarks.md | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/docs/my-website/docs/benchmarks.md b/docs/my-website/docs/benchmarks.md
index 00f45c5d4b3..a1489081b4c 100644
--- a/docs/my-website/docs/benchmarks.md
+++ b/docs/my-website/docs/benchmarks.md
@@ -50,17 +50,17 @@ In these tests the baseline latency characteristics are measured against a fake-
 
 ## `/realtime` API Benchmarks
 
-LiteLLM's `/realtime` endpoint has been optimized for low-latency WebSocket connections, achieving significant performance improvements through removal of redundant encodings, SSL context reuse, and caching of formatting strings.
+End-to-end latency benchmarks for the `/realtime` endpoint tested against a fake realtime endpoint.
 
 ### Performance Metrics
 
-| Metric          | Before    | After     | Improvement                |
-| --------------- | --------- | --------- | -------------------------- |
-| Median latency  | 2,200 ms  | **59 ms** | **−97% (~37× faster)**     |
-| p95 latency     | 8,500 ms  | **67 ms** | **−99% (~127× faster)**    |
-| p99 latency     | 18,000 ms | **99 ms** | **−99% (~182× faster)**    |
-| Average latency | 3,214 ms  | **63 ms** | **−98% (~51× faster)**     |
-| RPS             | 165       | **1,207** | **+631% (~7.3× increase)** |
+| Metric          | Value      |
+| --------------- | ---------- |
+| Median latency  | 59 ms      |
+| p95 latency     | 67 ms      |
+| p99 latency     | 99 ms      |
+| Average latency | 63 ms      |
+| RPS             | 1,207      |
 
 ### Test Setup
 
@@ -70,12 +70,6 @@ LiteLLM's `/realtime` endpoint has been optimized for low-latency WebSocket conn
 | **System** | 4 vCPUs, 8 GB RAM, 4 workers, 4 instances |
 | **Database** | PostgreSQL (Redis unused) |
 
-### Key Optimizations
-
-- Removed redundant encodings on the hot path
-- Reused shared SSL contexts to prevent excessive memory allocation
-- Cached formatting strings that were being regenerated twice per request
-
 ## Machine Spec used for testing
 
 Each machine deploying LiteLLM had the following specs: