feat: add control plane for multi-proxy worker management#24217
feat: add control plane for multi-proxy worker management#24217ryan-crabbe merged 5 commits intomainfrom
Conversation
Adds a control plane capability that enables a central admin instance to manage multiple regional worker proxies from a single UI. Backend: - Worker registry loaded from YAML config (worker_id, name, url) - /.well-known/litellm-ui-config exposes is_control_plane and workers list - /v3/login + /v3/login/exchange: opaque code exchange for cross-origin username/password auth (JWT never in URL/logs, single-use 60s TTL) - SSO cookie handoff with return_to → opaque code → exchange - _validate_return_to: full origin validation (scheme+hostname+port) - Startup warning when control_plane_url set without Redis - Both /v3 endpoints gated behind control_plane_url config Frontend: - Worker selector dropdown on login page (gated behind is_control_plane) - Cross-origin SSO code exchange handling on callback - switchToWorkerUrl: localStorage-persisted worker URL for API calls - useWorker hook: shared worker state management - WorkerDropdown in navbar for switching workers - Logout/switch clears worker state from localStorage Tests: - 7 tests for /v3/login + /v3/login/exchange - 10 tests for _validate_return_to - 2 tests for control plane discovery endpoint
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR introduces a control-plane architecture for LiteLLM that lets a central admin instance manage multiple regional worker proxies from a single UI. The backend adds a worker registry (YAML-configured), exposes Key concerns identified (some new, several previously discussed):
Confidence Score: 2/5
|
| Filename | Overview |
|---|---|
| litellm/proxy/proxy_server.py | Adds /v3/login and /v3/login/exchange endpoints for control-plane cross-origin auth. Multiple issues flagged (including in previous threads): TOCTOU on code deletion, missing cookie security attributes, str(None) credential coercion, non-JSON body returning 500 instead of 400, and unguarded dict key access on cached data that can produce a KeyError → 500. |
| litellm/proxy/management_endpoints/ui_sso.py | Extends SSO flow with return_to support for cross-origin worker redirects. _validate_return_to is well-implemented (origin-exact, case-insensitive, default-port normalization). Main concern (flagged in previous thread): litellm_cp_return_to cookie is missing the Secure attribute for HTTPS deployments. |
| litellm/proxy/discovery_endpoints/ui_discovery_endpoints.py | Adds is_control_plane and workers fields to /.well-known/litellm-ui-config. Logic is correct — is_control_plane derived from a non-empty worker registry. Worker URLs are exposed unauthenticated (discussed in previous thread). |
| litellm/types/proxy/control_plane_endpoints.py | New WorkerRegistryEntry Pydantic model with URL scheme validation. Clean and correct. |
| ui/litellm-dashboard/src/app/login/LoginPage.tsx | Most complex frontend change. Adds worker selector, SSO code-exchange flow, and worker-switch token clearing. Multiple issues flagged across previous thread and this review: missing .catch() on exchangeLoginCode, SSO code exchange not gated on is_control_plane, hardcoded localStorage key, hardcoded /ui/login path, and uiConfig load-failure leaving user stuck. |
| ui/litellm-dashboard/src/components/networking.tsx | Adds switchToWorkerUrl, exchangeLoginCode, and localStorage-backed proxyBaseUrl initialization. Worker URL is validated for HTTP/HTTPS scheme before storing. Known issues (previous thread): exchangeLoginCode missing credentials: "include" for cross-origin cookie delivery, and WORKER_URL_KEY constant not exported for use in other files. |
| ui/litellm-dashboard/src/hooks/useWorker.ts | New hook for worker state management. Clean implementation — initializes from localStorage, syncs proxyBaseUrl on mount via useEffect, and provides selectWorker/disconnectFromWorker callbacks. Note that each component instance gets its own state; synchronization happens via localStorage rather than shared React state. |
Sequence Diagram
sequenceDiagram
participant CP as Control Plane UI
participant Worker as Worker Proxy
participant SSO as SSO Provider
rect rgb(230, 240, 255)
CP->>CP: switchToWorkerUrl worker.url
CP->>Worker: POST /v3/login username+password
Worker->>Worker: authenticate and generate JWT
Worker->>Worker: cache JWT at login_code:CODE TTL 60s
Worker-->>CP: response with code and expires_in
CP->>Worker: POST /v3/login/exchange with code
Worker->>Worker: atomic GET and DELETE cache entry
Worker-->>CP: response with token and redirect_url
CP->>CP: set token cookie and navigate to dashboard
end
rect rgb(255, 240, 230)
CP->>CP: store worker URL in localStorage
CP->>Worker: GET /sso/key/generate?return_to=CP_URL
Worker->>Worker: validate return_to origin
Worker->>Worker: set litellm_cp_return_to cookie
Worker-->>CP: redirect to SSO Provider
CP->>SSO: OAuth2 authorization request
SSO-->>Worker: callback with auth code
Worker->>Worker: exchange auth code for user info
Worker->>Worker: cache JWT at login_code:CODE
Worker->>Worker: delete return_to cookie
Worker-->>CP: redirect to CP_URL with login code
CP->>Worker: POST /v3/login/exchange with code
Worker-->>CP: response with token
CP->>CP: set token cookie and navigate to dashboard
end
Comments Outside Diff (2)
-
litellm/proxy/proxy_server.py, line 266-268 (link)Non-JSON body returns 500 instead of 400
await request.json()raises ajson.JSONDecodeErrorwhen the body is not valid JSON, which is caught by the outerexcept Exception as ehandler and re-raised asHTTP_500_INTERNAL_SERVER_ERROR. The equivalent/v2/loginpath already has a test (test_login_v2_returns_json_on_invalid_json_body) covering the 400-on-bad-body case, but the v3 endpoint has no such guard.Consider catching the JSON parse failure explicitly and raising a
ProxyExceptionwithHTTP_400_BAD_REQUEST. The same applies to thebody = await request.json()line insidelogin_v3_exchange. -
litellm/proxy/proxy_server.py, line 383-391 (link)Unguarded dict key access on cached data can raise
KeyErrorcached_data["token"]andcached_data["redirect_url"]are accessed directly after only checkingnot cached_data or not isinstance(cached_data, dict). Ifcached_datais a non-empty dict but lacks one of those keys — for example if a different part of the codebase wrote something to alogin_code:prefixed Redis key — aKeyErroris raised, caught by the outerexcept Exception, and surfaced as a confusing500 Internal Server Errorrather than the more appropriate401 Unauthorized.Prefer using
.get()on both keys followed by an explicitNonecheck, raising aProxyExceptionwithHTTP_401_UNAUTHORIZEDif either value is missing. This keeps the error response semantically consistent with the "invalid or expired code" case just above.
Last reviewed commit: "Merge branch 'litell..."
| if (ssoCode) { | ||
| const workerUrl = localStorage.getItem("litellm_worker_url"); | ||
| exchangeLoginCode(ssoCode, workerUrl).then(() => { | ||
| params.delete("code"); | ||
| const cleanSearch = params.toString(); | ||
| window.history.replaceState(null, "", window.location.pathname + (cleanSearch ? `?${cleanSearch}` : "")); | ||
| router.replace("/ui/?login=success"); | ||
| }); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Unhandled rejection leaves user stuck on loading screen
exchangeLoginCode(...).then(...) has no .catch() / rejection handler. If the exchange fails (e.g., the 60 s TTL expired, Redis missed the code, or the network request errored), the promise rejects silently. isLoading stays true (the setIsLoading(false) is never reached), and the user sees an infinite <LoadingScreen /> with no error message and no way to recover.
| if (ssoCode) { | |
| const workerUrl = localStorage.getItem("litellm_worker_url"); | |
| exchangeLoginCode(ssoCode, workerUrl).then(() => { | |
| params.delete("code"); | |
| const cleanSearch = params.toString(); | |
| window.history.replaceState(null, "", window.location.pathname + (cleanSearch ? `?${cleanSearch}` : "")); | |
| router.replace("/ui/?login=success"); | |
| }); | |
| return; | |
| } | |
| if (ssoCode) { | |
| const workerUrl = localStorage.getItem("litellm_worker_url"); | |
| exchangeLoginCode(ssoCode, workerUrl) | |
| .then(() => { | |
| params.delete("code"); | |
| const cleanSearch = params.toString(); | |
| window.history.replaceState(null, "", window.location.pathname + (cleanSearch ? `?${cleanSearch}` : "")); | |
| router.replace("/ui/?login=success"); | |
| }) | |
| .catch(() => { | |
| setIsLoading(false); | |
| }); | |
| return; | |
| } |
| if not cached_data or not isinstance(cached_data, dict): | ||
| raise ProxyException( | ||
| message="Invalid or expired login code", | ||
| type=ProxyErrorTypes.auth_error, | ||
| param="code", | ||
| code=status.HTTP_401_UNAUTHORIZED, | ||
| ) | ||
|
|
||
| # Single-use: delete immediately | ||
| if redis_usage_cache is not None: | ||
| await redis_usage_cache.async_delete_cache(key=cache_key) | ||
| else: | ||
| await user_api_key_cache.async_delete_cache(key=cache_key) |
There was a problem hiding this comment.
TOCTOU race: single-use code is not atomically consumed
The get → delete sequence is not atomic. Two concurrent POST /v3/login/exchange requests with the same code can both pass the if not cached_data check before either one executes the delete — meaning both callers receive a valid token and the single-use guarantee is defeated.
On the Redis-backed path (the critical multi-pod case), the fix is to use an atomic GETDEL (or a Lua transaction) so the read and removal are a single round-trip. On the in-memory user_api_key_cache path the risk is lower (single-threaded asyncio event loop) but still theoretically exploitable if the cache implementation yields control between the two awaits.
Until an atomic helper is available, ensure at minimum that no additional await expressions appear between the async_get_cache and async_delete_cache calls, to keep the window as small as possible.
| if return_to is not None and sso_redirect is not None: | ||
| SSOAuthenticationHandler._validate_return_to(return_to) | ||
| sso_redirect.set_cookie( | ||
| key="litellm_cp_return_to", | ||
| value=return_to, | ||
| max_age=600, | ||
| httponly=True, | ||
| samesite="lax", | ||
| ) |
There was a problem hiding this comment.
litellm_cp_return_to cookie is missing the Secure attribute
The cookie is created with httponly=True and samesite="lax" but without secure=True. Control-plane deployments run over HTTPS; without the Secure flag the browser will also transmit the cookie over plain HTTP connections, which could expose the return_to value to network interception.
Adding secure=True (or conditionally, only when control_plane_url starts with https://) brings this in line with standard security hardening for cookies used in a security-sensitive redirect flow.
| const params = new URLSearchParams(window.location.search); | ||
| const ssoCode = params.get("code"); | ||
| if (ssoCode) { | ||
| const workerUrl = localStorage.getItem("litellm_worker_url"); |
There was a problem hiding this comment.
Hardcoded localStorage key duplicates a constant defined in
networking.tsx
"litellm_worker_url" is already defined as WORKER_URL_KEY in networking.tsx (line 89). Reading it directly here as a magic string means that if the key is ever renamed in networking.tsx, this lookup will silently break the SSO code-exchange flow.
Consider exporting WORKER_URL_KEY from networking.tsx and importing it here, or exposing a dedicated getStoredWorkerUrl() helper that centralises the localStorage access.
| }, | ||
| status_code=status.HTTP_200_OK, | ||
| ) | ||
| json_response.set_cookie(key="token", value=cached_data["token"]) |
There was a problem hiding this comment.
Token cookie set without
httponly, secure, or samesite attributes
json_response.set_cookie(...) uses all defaults here, which means the cookie is readable by JavaScript, has no SameSite policy, and is sent over HTTP. For a JWT that grants admin-UI access, the cookie should include httponly=True, samesite="lax", and secure=True for HTTPS deployments.
Also note that in the cross-origin case (control-plane frontend fetching the worker's exchange endpoint) this server-set cookie will be blocked by modern browsers anyway because a cross-origin Set-Cookie without SameSite=None; Secure is rejected. The primary token delivery path (setting document.cookie from the response body in networking.tsx) still works — but hardening the server-side cookie is good defence-in-depth for same-origin calls.
| throw new Error(deriveErrorMessage(errorData)); | ||
| } | ||
|
|
||
| const exchangeData: LoginResponse = await exchangeResponse.json(); | ||
| if (exchangeData.token) { | ||
| document.cookie = `token=${exchangeData.token}; path=/; SameSite=Lax`; | ||
| } | ||
| return exchangeData; | ||
| } |
There was a problem hiding this comment.
Cross-origin
exchangeLoginCode missing credentials: "include"
The loginCall exchange path (used in the direct username/password flow) passes credentials: "include" when calling /v3/login/exchange. The exchangeLoginCode function (used in the SSO callback flow) does not, which creates an inconsistency. While the JWT is extracted from the response body in both paths — making the omission non-blocking in most cases — the browser won't send or store cookies for the worker domain without credentials: "include". This diverges from the same endpoint's behaviour elsewhere and may cause subtle issues for consumers that rely on the server-set token cookie on the worker origin.
| throw new Error(deriveErrorMessage(errorData)); | |
| } | |
| const exchangeData: LoginResponse = await exchangeResponse.json(); | |
| if (exchangeData.token) { | |
| document.cookie = `token=${exchangeData.token}; path=/; SameSite=Lax`; | |
| } | |
| return exchangeData; | |
| } | |
| const response = await fetch(`${base}/v3/login/exchange`, { | |
| method: "POST", | |
| body: JSON.stringify({ code }), | |
| credentials: "include", | |
| headers: { "Content-Type": "application/json" }, | |
| }); |
| verbose_proxy_logger.exception( | ||
| "litellm.proxy.proxy_server.login_v3_exchange(): Exception occurred - {}".format( | ||
| str(e) |
There was a problem hiding this comment.
CORS headers required for cross-origin worker
/v3/login and /v3/login/exchange calls
The control-plane UI (e.g. https://cp.example.com) calls these endpoints on a worker (e.g. https://worker1.example.com). That is a cross-origin request. Modern browsers send a CORS preflight (OPTIONS) before the actual POST; without a matching Access-Control-Allow-Origin header in the worker's response, the browser silently blocks both the preflight and the request — making the entire cross-origin login flow fail with no visible error beyond a CORS console message.
If LiteLLM's existing global CORS middleware is already permissive enough this may "just work", but the PR doesn't document or verify this. Consider:
- Adding an explicit
control_plane_url→Access-Control-Allow-Originmapping for these two endpoints, OR - Documenting that the deployer must configure the CORS middleware's
allowed_originsto include the control plane URL whencontrol_plane_urlis set.
Without this, cross-origin username/password login and the SSO code-exchange callback will be blocked by browsers in all production deployments that separate control plane and worker origins.
| is_control_plane: bool = False | ||
| workers: List[WorkerRegistryEntry] = [] |
There was a problem hiding this comment.
Worker registry (including internal URLs) exposed to unauthenticated callers
/.well-known/litellm-ui-config is publicly accessible without authentication (by design, since the login page fetches it before the user is authenticated). As a result, the full workers list — including each worker's name, ID, and internal/external URL — is visible to any unauthenticated user who can reach the control plane.
For deployments where worker URLs are internal hostnames or carry access tokens embedded in the URL, this is a meaningful information-disclosure surface. Consider:
- Omitting
urlfrom theworkersfield inUiDiscoveryEndpointsand having the UI resolve the URL via a worker ID after authentication, OR - Keeping the current design but calling out in documentation that worker URLs should not contain credentials and should be treated as semi-public.
| // If switching workers on a control plane, clear the old token and show login | ||
| const switchingWorker = params.has("worker"); | ||
| if (switchingWorker && uiConfig?.is_control_plane) { | ||
| clearTokenCookies(); | ||
| setIsLoading(false); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Worker switch clears cookies only when
is_control_plane is true — uiConfig may be stale
The guard uiConfig?.is_control_plane is read from the already-loaded config, which is correct. However, the effect dependency array is [isConfigLoading, router, uiConfig]. If a user navigates to /ui/login?worker=team-b on a non-control-plane instance and uiConfig resolves to undefined (e.g., a network error), uiConfig?.is_control_plane is undefined (falsy) — the tokens are not cleared and isLoading is never set to false, leaving the user stuck on the loading screen.
If uiConfig fails to load and the URL has ?worker=, the effect should still set isLoading(false) to avoid the spinner hanging indefinitely.
| master_key=master_key, | ||
| prisma_client=prisma_client, |
There was a problem hiding this comment.
str(body.get(...)) silently coerces missing fields to "None"
When username or password is absent from the request body, body.get("username") returns None, and str(None) produces the literal string "None". authenticate_user then receives "None" as the credential rather than raising a clear validation error, making missing-parameter failures harder to diagnose and potentially interacting unexpectedly with username-based lookups.
Consider explicitly checking for the presence of both fields before coercing to str, and raising a ProxyException with HTTP_400_BAD_REQUEST if either is absent — consistent with how other endpoints validate required body fields.
| localStorage.removeItem("litellm_selected_worker_id"); | ||
| localStorage.removeItem("litellm_worker_url"); | ||
| window.location.href = logoutUrl; | ||
| }; | ||
|
|
||
| const handleWorkerSwitch = (workerId: string) => { | ||
| clearTokenCookies(); | ||
| clearStoredReturnUrl(); | ||
| localStorage.removeItem("litellm_selected_worker_id"); | ||
| localStorage.removeItem("litellm_worker_url"); |
There was a problem hiding this comment.
Magic-string localStorage keys duplicated from constants
Both handleLogout (lines 82–83) and handleWorkerSwitch (lines 90–91) manually remove "litellm_selected_worker_id" and "litellm_worker_url" from localStorage using hardcoded strings. These strings are already defined as constants in useWorker.ts (SELECTED_WORKER_KEY) and networking.tsx (WORKER_URL_KEY), and the useWorker hook already exposes a disconnectFromWorker() function that atomically clears both keys and resets proxyBaseUrl.
If either constant is renamed, these calls will silently stop working. Consider importing disconnectFromWorker from useWorker and calling it in both handlers instead of duplicating the removal logic:
// Before redirecting in handleLogout / handleWorkerSwitch:
disconnectFromWorker(); // clears SELECTED_WORKER_KEY, WORKER_URL_KEY, and proxyBaseUrl
Or at minimum, export SELECTED_WORKER_KEY from useWorker.ts and WORKER_URL_KEY from networking.tsx and import them here.
| // SSO on the worker (or this instance if no worker), always | ||
| // include return_to so the callback redirects back here | ||
| const ssoBase = selectedWorker?.url ?? getProxyBaseUrl(); | ||
| const returnTo = encodeURIComponent(window.location.origin + "/ui/login"); |
There was a problem hiding this comment.
Hardcoded
/ui/login path breaks non-root server path deployments
window.location.origin + "/ui/login" constructs the SSO return_to URL by always appending the hardcoded path /ui/login. For deployments where the control plane is served at a non-root server root path (e.g. https://cp.example.com/litellm/ui/login), window.location.origin gives https://cp.example.com — so the return_to value becomes https://cp.example.com/ui/login, which does not match the actual login page URL.
After the SSO redirect, the browser lands on the wrong path and the ?code= parameter is never consumed by LoginPage.tsx, causing the login flow to silently fail.
Use window.location.pathname (which already contains the full path including any server root path prefix) instead of the hardcoded string:
| const returnTo = encodeURIComponent(window.location.origin + "/ui/login"); | |
| const returnTo = encodeURIComponent(window.location.origin + window.location.pathname); |
| if (ssoCode) { | ||
| const workerUrl = localStorage.getItem("litellm_worker_url"); | ||
| exchangeLoginCode(ssoCode, workerUrl).then(() => { | ||
| params.delete("code"); | ||
| const cleanSearch = params.toString(); | ||
| window.history.replaceState(null, "", window.location.pathname + (cleanSearch ? `?${cleanSearch}` : "")); | ||
| router.replace("/ui/?login=success"); | ||
| }); | ||
| return; |
There was a problem hiding this comment.
SSO code exchange not gated on
is_control_plane
The ?code= exchange path fires on every instance — including workers and non-control-plane deployments — regardless of whether this is actually a control-plane UI. On any instance where control_plane_url is not configured, calling /v3/login/exchange returns a 404. Because there's no .catch() handler, the error is swallowed silently, isLoading stays true, and the user sees an infinite loading screen whenever a ?code= query param appears in the URL on a non-control-plane instance.
Additionally, the exchange is attempted even when params.get("login") is not "success", meaning any ?code= query param (from any source) triggers it.
A minimal guard prevents both failure modes:
const ssoCode = params.get("code");
const loginSuccess = params.get("login") === "success";
if (ssoCode && loginSuccess && uiConfig?.is_control_plane) {
const workerUrl = localStorage.getItem("litellm_worker_url");
exchangeLoginCode(ssoCode, workerUrl)
.then(() => { /* ... */ })
.catch(() => { setIsLoading(false); });
return;
}
Type
🆕 New Feature
Circle CI
https://app.circleci.com/pipelines/github/BerriAI/litellm?branch=litellm_ryan_march_18
Changes
Adds a control plane capability that enables a central admin instance to manage multiple regional worker proxies from a single UI.
Backend:
Frontend: