fix: SSO PKCE support fails in multi-pod Kubernetes deployments#20314
fix: SSO PKCE support fails in multi-pod Kubernetes deployments#20314Harshit28j merged 10 commits intoBerriAI:mainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile OverviewGreptile SummaryFixes PKCE SSO login failures in multi-pod Kubernetes deployments by using Redis instead of in-memory cache for the Key Changes:
Impact:
Confidence Score: 5/5
|
| Filename | Overview |
|---|---|
| litellm/proxy/management_endpoints/ui_sso.py | Adds Redis-backed PKCE verifier storage for multi-pod SSO. When redis_usage_cache is available, uses it instead of in-memory cache, fixing ~50% login failures in multi-pod deployments. Fallback to in-memory cache preserved for single-pod setups. |
| tests/test_litellm/proxy/management_endpoints/test_ui_sso.py | Adds comprehensive PKCE tests: Redis multi-pod roundtrip, in-memory fallback when Redis unavailable, and no-state edge case. Tests verify verifiers are stored/retrieved correctly and consumed after use (single-use token). |
Sequence Diagram
sequenceDiagram
participant User
participant Pod_A as Pod A (Login)
participant Redis
participant InMemory as In-Memory Cache
participant Pod_B as Pod B (Callback)
participant IdP as Identity Provider
Note over User,IdP: Multi-Pod Deployment (With Redis)
User->>Pod_A: GET /sso/generic/login
Pod_A->>Pod_A: Generate code_verifier (PKCE)
Pod_A->>Pod_A: Generate code_challenge
alt redis_usage_cache is not None
Pod_A->>Redis: set_cache("pkce_verifier:state", json.dumps(code_verifier), ttl=600)
Note over Redis: Verifier stored in Redis<br/>(accessible by all pods)
else redis_usage_cache is None
Pod_A->>InMemory: set_cache("pkce_verifier:state", code_verifier, ttl=600)
Note over InMemory: Verifier stored in pod-local memory<br/>(single-pod only)
end
Pod_A->>User: 302 Redirect to IdP with code_challenge
User->>IdP: Authorization request with code_challenge
IdP->>User: 302 Redirect to callback with code
User->>Pod_B: GET /sso/callback?code=...&state=...
Pod_B->>Pod_B: Extract state from query params
alt redis_usage_cache is not None
Pod_B->>Redis: get_cache("pkce_verifier:state")
Redis->>Pod_B: json.loads(stored_value) → code_verifier
Pod_B->>Redis: delete_cache("pkce_verifier:state")
Note over Redis: ✅ Multi-pod: Pod B retrieves<br/>verifier stored by Pod A
else redis_usage_cache is None
Pod_B->>InMemory: get_cache("pkce_verifier:state")
InMemory->>Pod_B: code_verifier (if same pod)
Pod_B->>InMemory: delete_cache("pkce_verifier:state")
Note over InMemory: ⚠️ Multi-pod: ~50% failure<br/>if callback hits different pod
end
Pod_B->>IdP: Token exchange with code + code_verifier
IdP->>Pod_B: Access token + user info
Pod_B->>User: SSO login success
|
@greptile please review this |
Greptile OverviewGreptile SummaryFixes PKCE SSO flow in multi-pod Kubernetes deployments by storing the
Confidence Score: 2/5
|
| Filename | Overview |
|---|---|
| litellm/proxy/management_endpoints/ui_sso.py | Core PKCE fix: uses Redis for cross-pod code_verifier storage. Logic is correct but has unnecessary json.dumps() causing double serialization (works accidentally). Also includes formatting cleanups and async conversion of prepare_token_exchange_parameters. |
| tests/test_litellm/proxy/management_endpoints/test_ui_sso.py | Three new PKCE tests have critical issues: missing await on now-async prepare_token_exchange_parameters, and mock classes define sync methods while production code calls async variants. Tests will either error or silently pass without validating the intended behavior. |
Sequence Diagram
sequenceDiagram
participant User
participant PodA as Pod A (Login)
participant Redis as Redis Cache
participant PodB as Pod B (Callback)
participant IdP as SSO Provider
User->>PodA: Initiate SSO login
PodA->>PodA: Generate PKCE verifier and challenge
PodA->>Redis: Store verifier with state as cache key
PodA->>User: Redirect to IdP with challenge
User->>IdP: Authenticate with provider
IdP->>User: Redirect back with authorization code
User->>PodB: Callback lands on different pod
PodB->>Redis: Retrieve verifier using state
Redis-->>PodB: Return verifier
PodB->>IdP: Token exchange with verifier
IdP-->>PodB: Return access token
PodB->>Redis: Delete consumed verifier
PodB->>User: SSO login complete
Last reviewed commit: 1792b3c
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Greptile OverviewGreptile SummaryFixes PKCE SSO flow in multi-pod Kubernetes deployments by storing the
Confidence Score: 2/5
|
| Filename | Overview |
|---|---|
| litellm/proxy/management_endpoints/ui_sso.py | Core fix is sound: PKCE code_verifier now uses Redis when available for multi-pod SSO. Correctly uses async cache methods. Formatting/whitespace cleanup included. |
| tests/test_litellm/proxy/management_endpoints/test_ui_sso.py | Three new tests have critical bugs: missing await on async prepare_token_exchange_parameters, and mock objects define sync methods instead of the async methods the production code actually calls. Tests will fail or silently pass without exercising real code paths. |
Sequence Diagram
sequenceDiagram
participant User
participant PodA as Pod A (Login)
participant Redis as Redis Cache
participant PodB as Pod B (Callback)
participant IdP as SSO IdP
User->>PodA: GET /sso/login
PodA->>PodA: Generate PKCE code_verifier + code_challenge
PodA->>Redis: async_set_cache(pkce_verifier:{state}, verifier, ttl=600)
PodA->>User: Redirect to IdP (with code_challenge, state)
User->>IdP: Authenticate
IdP->>PodB: Callback with auth code + state
PodB->>Redis: async_get_cache(pkce_verifier:{state})
Redis-->>PodB: code_verifier
PodB->>IdP: Token exchange (code + code_verifier)
IdP-->>PodB: Access token
PodB->>Redis: async_delete_cache(pkce_verifier:{state})
PodB->>User: SSO login complete
Last reviewed commit: bc5543c
…28j/litellm into fix/sso_PKCE_deployments
Greptile OverviewGreptile SummaryFixes PKCE SSO flow in multi-pod Kubernetes deployments by storing the
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| litellm/proxy/management_endpoints/ui_sso.py | Core PKCE fix: stores code_verifier in Redis when available for multi-pod SSO. Changed sync cache ops to async, made prepare_token_exchange_parameters async. Logic is correct. Includes formatting cleanups. |
| tests/test_litellm/proxy/management_endpoints/test_ui_sso.py | Three new PKCE tests added. Tests now correctly use async/await and async mock methods (addressing earlier review feedback). However, MockRedisCache doesn't faithfully replicate real RedisCache serialization, and one assertion will fail at runtime. |
Sequence Diagram
sequenceDiagram
participant User
participant PodA as Pod A (Login)
participant Redis as Redis Cache
participant PodB as Pod B (Callback)
participant IdP as SSO Identity Provider
User->>PodA: GET /sso/login
PodA->>PodA: Generate PKCE code_verifier + code_challenge
PodA->>Redis: async_set_cache(pkce_verifier:{state}, code_verifier, ttl=600)
PodA->>User: Redirect to IdP with code_challenge + state
User->>IdP: Authenticate
IdP->>User: Redirect to callback with auth code + state
User->>PodB: GET /sso/callback?code=...&state=...
PodB->>Redis: async_get_cache(pkce_verifier:{state})
Redis-->>PodB: code_verifier
PodB->>Redis: async_delete_cache(pkce_verifier:{state})
PodB->>IdP: Exchange code + code_verifier for token
IdP-->>PodB: Access token
PodB->>User: Set session cookie, redirect to dashboard
Last reviewed commit: e084638
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Greptile OverviewGreptile SummaryThis PR fixes PKCE SSO flow failures in multi-pod Kubernetes deployments by storing the
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| litellm/proxy/management_endpoints/ui_sso.py | Core fix: PKCE code_verifier storage switched from in-memory-only to Redis-backed (when available) for multi-pod support, with proper async method usage. Also includes formatting cleanup and the Literal import addition. |
| tests/test_litellm/proxy/management_endpoints/test_ui_sso.py | Three new PKCE tests added with proper async/await usage. MockRedisCache now uses async methods with JSON serialization. One test assertion (len(stored_value) == 43) will fail because it checks the JSON-serialized length instead of the deserialized value length. |
Sequence Diagram
sequenceDiagram
participant User
participant PodA as Pod A (Login)
participant Redis as Redis Cache
participant PodB as Pod B (Callback)
participant IdP as SSO Identity Provider
User->>PodA: GET /sso/login
PodA->>PodA: Generate PKCE code_verifier + code_challenge
PodA->>Redis: async_set_cache(pkce_verifier:{state}, code_verifier, ttl=600)
PodA->>User: Redirect to IdP with code_challenge & state
User->>IdP: Authenticate
IdP->>User: Redirect to callback with auth code & state
User->>PodB: GET /sso/callback?code=...&state=...
PodB->>Redis: async_get_cache(pkce_verifier:{state})
Redis-->>PodB: code_verifier
PodB->>Redis: async_delete_cache(pkce_verifier:{state})
PodB->>IdP: Token exchange with code + code_verifier
IdP-->>PodB: Access token
PodB->>User: Set JWT cookie, redirect to dashboard
Last reviewed commit: c56bbb9
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
@greptile please review this? |
Greptile OverviewGreptile SummaryThis PR fixes PKCE SSO failures in multi-pod Kubernetes deployments by switching
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| litellm/proxy/management_endpoints/ui_sso.py | Core PKCE fix: stores/retrieves code_verifier via Redis when available, falls back to in-memory cache. Async conversion of prepare_token_exchange_parameters() is correct. Serialization roundtrip is sound (RedisCache handles json.dumps/loads internally). Minor formatting changes. |
| tests/test_litellm/proxy/management_endpoints/test_ui_sso.py | Three new PKCE tests (Redis roundtrip, in-memory fallback, no-state). Existing tests updated for async. MockRedisCache correctly replicates json.dumps/loads behavior. Large amount of formatting-only changes. |
Sequence Diagram
sequenceDiagram
participant User
participant PodA as Pod A (Login)
participant PodB as Pod B (Callback)
participant Redis as Redis Cache
participant IdP as SSO Identity Provider
User->>PodA: GET /sso/key/generate
PodA->>PodA: Generate PKCE code_verifier + code_challenge
PodA->>Redis: async_set_cache(pkce_verifier:{state}, code_verifier, ttl=600)
PodA->>User: Redirect to IdP with code_challenge + state
User->>IdP: Authenticate
IdP->>User: Redirect to /sso/callback?state=...&code=...
User->>PodB: GET /sso/callback?state=...&code=...
PodB->>PodB: prepare_token_exchange_parameters()
PodB->>Redis: async_get_cache(pkce_verifier:{state})
Redis-->>PodB: code_verifier
PodB->>Redis: async_delete_cache(pkce_verifier:{state})
PodB->>IdP: Token exchange with code + code_verifier
IdP-->>PodB: Access token
PodB->>User: JWT cookie + redirect to dashboard
Last reviewed commit: 37fc4a3
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Relevant issues
Fixes PKCE SSO flow in multi-pod setups: when
GENERIC_CLIENT_USE_PKCE=true, the code_verifier was stored only in the pod’s in-memory cache, so callbacks landing on another pod could not retrieve it (~50% login failures). This change uses Redis when available so any replica can complete the token exchange.Pre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unitCI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Type
🐛 Bug Fix
✅ Test
Changes
Problem: With
GENERIC_CLIENT_USE_PKCE=true, the PKCEcode_verifierwas stored only in the pod’s in-memoryuser_api_key_cache. In multi-pod deployments the callback often hit a different pod, which had no verifier, causing intermittent SSO failures (~50% of logins).Fix: Use the existing Redis-backed
redis_usage_cachefor the PKCE verifier when it is configured, so any replica can complete the token exchange.litellm/proxy/management_endpoints/ui_sso.pyget_generic_sso_redirect_response(): When storing the code_verifier, useredis_usage_cache.set_cache(...)ifredis_usage_cache is not None, otherwise keep usinguser_api_key_cache.set_cache(...).prepare_token_exchange_parameters(): When retrieving/deleting the code_verifier, useredis_usage_cache.get_cache/delete_cacheif Redis is available, otherwise keep usinguser_api_key_cache.token_paramsasDict[str, Any]for mypy.tests/test_litellm/proxy/management_endpoints/test_ui_sso.pytest_pkce_redis_multi_pod_verifier_roundtrip: Fixed assertion (assert onmock_redis._storeinstead of.assert_called_once()on real methods). Ensures Redis path is used when available and in-memory is not.test_pkce_fallback_in_memory_roundtrip_when_redis_none: New test for fallback when Redis is not configured (in-memory roundtrip on same pod).test_pkce_prepare_token_exchange_returns_nothing_when_no_state: New test for request with no state (no cache access, no code_verifier).