Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/my-website/docs/proxy/config_settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -363,6 +363,7 @@ router_settings:
| router_general_settings | RouterGeneralSettings | [SDK-Only] Router general settings - contains optimizations like 'async_only_mode'. [Docs](../routing.md#router-general-settings) |
| optional_pre_call_checks | List[str] | List of pre-call checks to add to the router. Supported: `router_budget_limiting`, `prompt_caching`, `responses_api_deployment_check`, `encrypted_content_affinity`, `deployment_affinity`, `session_affinity`, `forward_client_headers_by_model_group` |
| deployment_affinity_ttl_seconds | int | TTL (seconds) for user-key → deployment affinity mapping when `deployment_affinity` is enabled (configured at Router init / proxy startup). Defaults to `3600` (1 hour). |
| model_group_affinity_config | Dict[str, List[str]] | Per-model-group affinity flags. Keys are model group names; values are lists of checks to enable (`deployment_affinity`, `responses_api_deployment_check`, `session_affinity`). Groups not listed fall back to the global `optional_pre_call_checks`. [Docs](../response_api.md#per-model-group-affinity-configuration) |
| ignore_invalid_deployments | boolean | If true, ignores invalid deployments. Default for proxy is True - to prevent invalid models from blocking other models from being loaded. |
| search_tools | List[SearchToolTypedDict] | List of search tool configurations for Search API integration. Each tool specifies a search_tool_name and litellm_params with search_provider, api_key, api_base, etc. [Further Docs](../search/index.md) |
| guardrail_list | List[GuardrailTypedDict] | List of guardrail configurations for guardrail load balancing. Enables load balancing across multiple guardrail deployments with the same guardrail_name. [Further Docs](./guardrails/guardrail_load_balancing.md) |
Expand Down
79 changes: 79 additions & 0 deletions docs/my-website/docs/response_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -1364,6 +1364,85 @@ litellm --config config.yaml
| `deployment_affinity` | Simple sticky sessions | All requests from same API key | ❌ Reduces quota by # of users |


## Per-Model-Group Affinity Configuration

By default, `optional_pre_call_checks` applies globally to all model groups. Use `model_group_affinity_config` when you want different affinity behavior per model group — for example, enabling stickiness only for models spread across providers (Azure + Bedrock) while leaving single-provider groups free to load-balance.

Groups not listed fall back to the global `optional_pre_call_checks` settings.

<Tabs>
<TabItem value="python-sdk" label="Python SDK">

```python
router = litellm.Router(
model_list=[
{
"model_name": "gpt-4",
"litellm_params": {"model": "azure/gpt-4", "api_key": "...", "api_base": "https://endpoint1.openai.azure.com"},
},
{
"model_name": "gpt-4",
"litellm_params": {"model": "bedrock/anthropic.claude-v2", "aws_region_name": "us-east-1"},
},
{
"model_name": "text-embedding-ada-002",
"litellm_params": {"model": "azure/text-embedding-ada-002", "api_key": "...", "api_base": "https://endpoint1.openai.azure.com"},
},
{
"model_name": "text-embedding-ada-002",
"litellm_params": {"model": "azure/text-embedding-ada-002", "api_key": "...", "api_base": "https://endpoint2.openai.azure.com"},
},
],
# gpt-4: cross-provider (Azure + Bedrock) — enable deployment affinity
# text-embedding-ada-002: same provider — no affinity, let it load balance freely
model_group_affinity_config={
"gpt-4": ["deployment_affinity", "responses_api_deployment_check"],
},
)
```

</TabItem>
<TabItem value="proxy-server" label="Proxy Server">

```yaml title="config.yaml"
model_list:
- model_name: gpt-4
litellm_params:
model: azure/gpt-4
api_key: os.environ/AZURE_API_KEY_1
api_base: https://endpoint1.openai.azure.com

- model_name: gpt-4
litellm_params:
model: bedrock/anthropic.claude-v2
aws_region_name: us-east-1

- model_name: text-embedding-ada-002
litellm_params:
model: azure/text-embedding-ada-002
api_key: os.environ/AZURE_API_KEY_1
api_base: https://endpoint1.openai.azure.com

- model_name: text-embedding-ada-002
litellm_params:
model: azure/text-embedding-ada-002
api_key: os.environ/AZURE_API_KEY_2
api_base: https://endpoint2.openai.azure.com

router_settings:
# gpt-4: cross-provider — enable stickiness
# text-embedding-ada-002: not listed — load balances freely
model_group_affinity_config:
"gpt-4":
- deployment_affinity
- responses_api_deployment_check
```

</TabItem>
</Tabs>

**Supported values:** `deployment_affinity`, `responses_api_deployment_check`, `session_affinity`

## Calling non-Responses API endpoints (`/responses` to `/chat/completions` Bridge)

LiteLLM allows you to call non-Responses API models via a bridge to LiteLLM's `/chat/completions` endpoint. This is useful for calling Anthropic, Gemini and even non-Responses API OpenAI models.
Expand Down
29 changes: 29 additions & 0 deletions litellm/router.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,7 @@ def __init__( # noqa: PLR0915
RouterGeneralSettings
] = RouterGeneralSettings(),
deployment_affinity_ttl_seconds: int = 3600,
model_group_affinity_config: Optional[Dict[str, List[str]]] = None,
ignore_invalid_deployments: bool = False,
) -> None:
"""
Expand Down Expand Up @@ -641,6 +642,9 @@ def __init__( # noqa: PLR0915
self.model_group_retry_policy: Optional[
Dict[str, RetryPolicy]
] = model_group_retry_policy
self.model_group_affinity_config: Optional[
Dict[str, List[str]]
] = model_group_affinity_config

self.allowed_fails_policy: Optional[AllowedFailsPolicy] = None
if allowed_fails_policy is not None:
Expand All @@ -661,6 +665,26 @@ def __init__( # noqa: PLR0915
if optional_pre_call_checks is not None:
self.add_optional_pre_call_checks(optional_pre_call_checks)

# If model_group_affinity_config is set but no global affinity checks were
# enabled, we still need the DeploymentAffinityCheck callback (with global
# flags all False) so per-group config can activate affinity per model group.
if self.model_group_affinity_config and not any(
isinstance(cb, DeploymentAffinityCheck)
for cb in (self.optional_callbacks or [])
):
if self.optional_callbacks is None:
self.optional_callbacks = []
affinity_callback = DeploymentAffinityCheck(
cache=self.cache,
ttl_seconds=self.deployment_affinity_ttl_seconds,
enable_user_key_affinity=False,
enable_responses_api_affinity=False,
enable_session_id_affinity=False,
model_group_affinity_config=self.model_group_affinity_config,
)
self.optional_callbacks.append(affinity_callback)
litellm.logging_callback_manager.add_litellm_callback(affinity_callback)

if self.alerting_config is not None:
self._initialize_alerting()

Expand Down Expand Up @@ -1311,13 +1335,18 @@ def add_optional_pre_call_checks(
existing_affinity_callback.ttl_seconds = (
self.deployment_affinity_ttl_seconds
)
if self.model_group_affinity_config:
existing_affinity_callback.model_group_affinity_config = (
self.model_group_affinity_config
)
else:
affinity_callback = DeploymentAffinityCheck(
cache=self.cache,
ttl_seconds=self.deployment_affinity_ttl_seconds,
enable_user_key_affinity=enable_user_key_affinity,
enable_responses_api_affinity=enable_responses_api_affinity,
enable_session_id_affinity=enable_session_id_affinity,
model_group_affinity_config=self.model_group_affinity_config,
)
self.optional_callbacks.append(affinity_callback)
litellm.logging_callback_manager.add_litellm_callback(affinity_callback)
Expand Down
103 changes: 75 additions & 28 deletions litellm/router_utils/pre_call_checks/deployment_affinity_check.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"""

import hashlib
from typing import Any, Dict, List, Optional, cast
from typing import Any, Dict, List, Optional, Tuple, cast

from typing_extensions import TypedDict

Expand All @@ -38,6 +38,9 @@ class DeploymentAffinityCheck(CustomLogger):
"""

CACHE_KEY_PREFIX = "deployment_affinity:v1"
VALID_FLAGS = frozenset(
{"deployment_affinity", "responses_api_deployment_check", "session_affinity"}
)

def __init__(
self,
Expand All @@ -46,13 +49,49 @@ def __init__(
enable_user_key_affinity: bool,
enable_responses_api_affinity: bool,
enable_session_id_affinity: bool = False,
model_group_affinity_config: Optional[Dict[str, List[str]]] = None,
):
super().__init__()
self.cache = cache
self.ttl_seconds = ttl_seconds
self.enable_user_key_affinity = enable_user_key_affinity
self.enable_responses_api_affinity = enable_responses_api_affinity
self.enable_session_id_affinity = enable_session_id_affinity
self.model_group_affinity_config: Dict[str, List[str]] = (
model_group_affinity_config or {}
)
for group, flags in self.model_group_affinity_config.items():
unknown = set(flags) - self.VALID_FLAGS
if unknown:
verbose_router_logger.warning(
"DeploymentAffinityCheck: unknown flag(s) %s for model group '%s'; will be ignored. Valid flags: %s",
unknown,
group,
self.VALID_FLAGS,
)

def _get_effective_flags(
self, model_group: str
) -> Tuple[bool, bool, bool]:
"""
Return (enable_user_key_affinity, enable_responses_api_affinity, enable_session_id_affinity)
for the given model group.

If the model group has an explicit entry in model_group_affinity_config, use it.
Otherwise fall back to the global instance flags.
"""
group_checks = self.model_group_affinity_config.get(model_group)
if group_checks is not None:
return (
"deployment_affinity" in group_checks,
"responses_api_deployment_check" in group_checks,
"session_affinity" in group_checks,
)
return (
self.enable_user_key_affinity,
self.enable_responses_api_affinity,
self.enable_session_id_affinity,
)
Comment on lines +73 to +94
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No validation of per-group flag strings

_get_effective_flags silently ignores unknown flag strings in a group's list. A user typo like "deployment_affinityy" or "responses_api_check" would result in all flags being False for that group with no warning, making debugging very difficult.

Consider adding a validation step (either at init time or here) to warn about unrecognised flag names:

VALID_FLAGS = {"deployment_affinity", "responses_api_deployment_check", "session_affinity"}

def _get_effective_flags(self, model_group: str) -> Tuple[bool, bool, bool]:
    group_checks = self.model_group_affinity_config.get(model_group)
    if group_checks is not None:
        unknown = set(group_checks) - VALID_FLAGS
        if unknown:
            verbose_router_logger.warning(
                "DeploymentAffinityCheck: unknown flag(s) %s for model group '%s'; will be ignored.",
                unknown, model_group,
            )
        return (
            "deployment_affinity" in group_checks,
            "responses_api_deployment_check" in group_checks,
            "session_affinity" in group_checks,
        )
    ...


@staticmethod
def _looks_like_sha256_hex(value: str) -> bool:
Expand Down Expand Up @@ -277,8 +316,12 @@ async def async_filter_deployments(
request_kwargs = request_kwargs or {}
typed_healthy_deployments = cast(List[dict], healthy_deployments)

enable_user_key, enable_responses_api, enable_session_id = (
self._get_effective_flags(model)
)

# 1) Responses API continuity (high priority)
if self.enable_responses_api_affinity:
if enable_responses_api:
previous_response_id = request_kwargs.get("previous_response_id")
if previous_response_id is not None:
responses_model_id = (
Expand All @@ -305,7 +348,7 @@ async def async_filter_deployments(
return typed_healthy_deployments

# 2) Session-id -> deployment affinity
if self.enable_session_id_affinity:
if enable_session_id:
session_id = self._get_session_id_from_request_kwargs(
request_kwargs=request_kwargs
)
Expand Down Expand Up @@ -344,7 +387,7 @@ async def async_filter_deployments(
)

# 3) User key -> deployment affinity
if not self.enable_user_key_affinity:
if not enable_user_key:
return typed_healthy_deployments

user_key = self._get_user_key_from_request_kwargs(request_kwargs=request_kwargs)
Expand Down Expand Up @@ -394,22 +437,45 @@ async def async_pre_call_deployment_hook(
- LiteLLM runs async success callbacks via a background logging worker for performance.
- We want affinity to be immediately available for subsequent requests.
"""
if not self.enable_user_key_affinity and not self.enable_session_id_affinity:
metadata_dicts = self._iter_metadata_dicts(kwargs)

# Extract deployment_model_name first — needed for both per-group flag resolution
# and cache key scoping.
deployment_model_name: Optional[str] = None
for metadata in metadata_dicts:
maybe_deployment_model_name = metadata.get("deployment_model_name")
if (
isinstance(maybe_deployment_model_name, str)
and maybe_deployment_model_name
):
deployment_model_name = maybe_deployment_model_name
break

if not deployment_model_name:
verbose_router_logger.debug(
"DeploymentAffinityCheck: deployment_model_name missing in metadata; skipping affinity cache update."
)
return None

# Resolve effective flags for this model group
enable_user_key, _enable_responses_api, enable_session_id = (
self._get_effective_flags(deployment_model_name)
)

if not enable_user_key and not enable_session_id:
return None

user_key = None
if self.enable_user_key_affinity:
if enable_user_key:
user_key = self._get_user_key_from_request_kwargs(request_kwargs=kwargs)

session_id = None
if self.enable_session_id_affinity:
if enable_session_id:
session_id = self._get_session_id_from_request_kwargs(request_kwargs=kwargs)

if user_key is None and session_id is None:
return None

metadata_dicts = self._iter_metadata_dicts(kwargs)

model_info = kwargs.get("model_info")
if not isinstance(model_info, dict):
model_info = None
Expand All @@ -433,25 +499,6 @@ async def async_pre_call_deployment_hook(
)
return None

# Scope affinity by the Router deployment model name (alias-safe, consistent across
# heterogeneous providers, and matches standard logging's `model_map_key`).
deployment_model_name: Optional[str] = None
for metadata in metadata_dicts:
maybe_deployment_model_name = metadata.get("deployment_model_name")
if (
isinstance(maybe_deployment_model_name, str)
and maybe_deployment_model_name
):
deployment_model_name = maybe_deployment_model_name
break

if not deployment_model_name:
verbose_router_logger.warning(
"DeploymentAffinityCheck: deployment_model_name missing; skipping affinity cache update. model_id=%s",
model_id,
)
return None

if user_key is not None:
try:
cache_key = self.get_affinity_cache_key(
Expand Down
1 change: 1 addition & 0 deletions litellm/types/router.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ class UpdateRouterConfig(BaseModel):
routing_strategy_args: Optional[dict] = None
routing_strategy: Optional[str] = None
model_group_retry_policy: Optional[dict] = None
model_group_affinity_config: Optional[Dict[str, List[str]]] = None
allowed_fails: Optional[int] = None
cooldown_time: Optional[float] = None
num_retries: Optional[int] = None
Expand Down
Loading
Loading