Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 65 additions & 7 deletions docs/my-website/docs/providers/azure_ai/azure_model_router.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,32 @@

Azure Model Router is a feature in Azure AI Foundry that automatically routes your requests to the best available model based on your requirements. This allows you to use a single endpoint that intelligently selects the optimal model for each request.

## Quick Start

**Model pattern**: `azure_ai/model_router/<deployment-name>`

```python
import litellm

response = litellm.completion(
model="azure_ai/model_router/model-router", # Replace with your deployment name
messages=[{"role": "user", "content": "Hello!"}],
api_base="https://your-endpoint.cognitiveservices.azure.com/openai/v1/",
api_key="your-api-key",
)
```

**Proxy config** (`config.yaml`):

```yaml
model_list:
- model_name: model-router
litellm_params:
model: azure_ai/model_router/model-router
api_base: https://your-endpoint.cognitiveservices.azure.com/openai/deployments/model-router/chat/completions?api-version=2025-01-01-preview
api_key: your-api-key
```
## Key Features
- **Automatic Model Selection**: Azure Model Router dynamically selects the best model for your request
Expand Down Expand Up @@ -229,19 +255,51 @@ Cost is tracked based on the actual model used (e.g., `gpt-4.1-nano`), plus a fl

## Cost Tracking

LiteLLM automatically handles cost tracking for Azure Model Router by:
LiteLLM automatically handles cost tracking for Azure Model Router. Understanding how this works helps you interpret spend and debug billing.

### How LiteLLM Calculates Cost

When you use Azure Model Router, LiteLLM computes **two cost components**:

| Component | Description | When Applied |
|-----------|-------------|--------------|
| **Model Cost** | Token-based cost for the actual model that handled the request (e.g., `gpt-5-nano`, `gpt-4.1-nano`) | Always, when Azure returns the model in the response |
| **Router Flat Cost** | $0.14 per million input tokens (Azure AI Foundry infrastructure fee) | When the **request** was made via a model router endpoint |

### Cost Calculation Flow

1. **Request model detection**: LiteLLM records the model you requested (e.g., `azure_ai/model_router/model-router`). If it contains `model_router` or `model-router`, the request is treated as a router request.

2. **Response model extraction**: Azure returns the actual model used in the response (e.g., `gpt-5-nano-2025-08-07`). LiteLLM uses this for the model cost lookup.

3. **Model cost**: LiteLLM looks up the response model in its pricing table and computes cost from prompt tokens and completion tokens.

4. **Router flat cost**: Because the original request was to a model router, LiteLLM adds the flat cost ($0.14 per M input tokens) on top of the model cost.

1. **Detecting the actual model**: When Azure Model Router routes your request to a specific model (e.g., `gpt-4.1-nano-2025-04-14`), LiteLLM extracts this from the response
2. **Calculating accurate costs**: Costs are calculated based on:
- The actual model used (e.g., `gpt-4.1-nano` token costs)
- Plus a flat infrastructure cost of **$0.14 per million input tokens** for using the Model Router
3. **Streaming support**: Cost tracking works correctly for both streaming and non-streaming requests
5. **Total cost**: `Total = Model Cost + Router Flat Cost`

### Configuration Requirements

For cost tracking to work correctly:

- **Use the full pattern**: `azure_ai/model_router/<deployment-name>` (e.g., `azure_ai/model_router/model-router`)
- **Proxy config**: When using the LiteLLM proxy, set `model` in `litellm_params` to the full pattern so the request model is correctly identified as a router

```yaml
# proxy_server_config.yaml
model_list:
- model_name: model-router
litellm_params:
model: azure_ai/model_router/model-router # Required for router cost detection
api_base: https://your-endpoint.cognitiveservices.azure.com/openai/deployments/model-router/chat/completions?api-version=2025-01-01-preview
api_key: your-api-key
```

### Cost Breakdown

When you use Azure Model Router, the total cost includes:

- **Model Cost**: Based on the actual model that handled your request (e.g., `gpt-4.1-nano`)
- **Model Cost**: Based on the actual model that handled your request (e.g., `gpt-5-nano`, `gpt-4.1-nano`)
- **Router Flat Cost**: $0.14 per million input tokens (Azure AI Foundry infrastructure fee)

### Example Response with Cost
Expand Down
10 changes: 9 additions & 1 deletion litellm/cost_calculator.py
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,8 @@ def cost_per_token( # noqa: PLR0915
### SERVICE TIER ###
service_tier: Optional[str] = None, # for OpenAI service tier pricing
response: Optional[Any] = None,
### REQUEST MODEL ###
request_model: Optional[str] = None, # original request model for router detection
) -> Tuple[float, float]: # type: ignore
"""
Calculates the cost per token for a given model, prompt tokens, and completion tokens.
Expand Down Expand Up @@ -520,7 +522,7 @@ def cost_per_token( # noqa: PLR0915
return dashscope_cost_per_token(model=model, usage=usage_block)
elif custom_llm_provider == "azure_ai":
return azure_ai_cost_per_token(
model=model, usage=usage_block, response_time_ms=response_time_ms
model=model, usage=usage_block, response_time_ms=response_time_ms, request_model=request_model
)
else:
model_info = _cached_get_model_info_helper(
Expand Down Expand Up @@ -1457,6 +1459,11 @@ def completion_cost( # noqa: PLR0915
text=completion_string
)

# Get the original request model for router detection
request_model_for_cost = None
if litellm_logging_obj is not None:
request_model_for_cost = litellm_logging_obj.model

(
prompt_tokens_cost_usd_dollar,
completion_tokens_cost_usd_dollar,
Expand All @@ -1479,6 +1486,7 @@ def completion_cost( # noqa: PLR0915
rerank_billed_units=rerank_billed_units,
service_tier=service_tier,
response=completion_response,
request_model=request_model_for_cost,
)

# Get additional costs from provider (e.g., routing fees, infrastructure costs)
Expand Down
30 changes: 21 additions & 9 deletions litellm/llms/azure_ai/cost_calculator.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,10 @@ def calculate_azure_model_router_flat_cost(model: str, prompt_tokens: int) -> fl


def cost_per_token(
model: str, usage: Usage, response_time_ms: Optional[float] = 0.0
model: str,
usage: Usage,
response_time_ms: Optional[float] = 0.0,
request_model: Optional[str] = None,
) -> Tuple[float, float]:
"""
Calculate the cost per token for Azure AI models.
Expand All @@ -71,9 +74,10 @@ def cost_per_token(
- Plus the cost of the actual model used (handled by generic_cost_per_token)

Args:
model: str, the model name without provider prefix
model: str, the model name without provider prefix (from response)
usage: LiteLLM Usage block
response_time_ms: Optional response time in milliseconds
request_model: Optional[str], the original request model name (to detect router usage)

Returns:
Tuple[float, float] - prompt_cost_in_usd, completion_cost_in_usd
Expand All @@ -84,7 +88,13 @@ def cost_per_token(
"""
prompt_cost = 0.0
completion_cost = 0.0


# Determine if this was a model router request
# Check both the response model and the request model
is_router_request = _is_azure_model_router(model) or (
request_model is not None and _is_azure_model_router(request_model)
)

# Calculate base cost using generic cost calculator
# This may raise an exception if the model is not in the cost map
try:
Expand All @@ -103,19 +113,21 @@ def cost_per_token(
verbose_logger.debug(
f"Azure AI Model Router: model '{model}' not in cost map, calculating routing flat cost only. Error: {e}"
)

# Add flat cost for Azure Model Router
# The flat cost is defined in model_prices_and_context_window.json for azure_ai/model_router
if _is_azure_model_router(model):
router_flat_cost = calculate_azure_model_router_flat_cost(model, usage.prompt_tokens)

if is_router_request:
# Use the request model for flat cost calculation if available, otherwise use response model
router_model_for_calc = request_model if request_model else model
router_flat_cost = calculate_azure_model_router_flat_cost(router_model_for_calc, usage.prompt_tokens)

if router_flat_cost > 0:
verbose_logger.debug(
f"Azure AI Model Router flat cost: ${router_flat_cost:.6f} "
f"({usage.prompt_tokens} tokens × ${router_flat_cost / usage.prompt_tokens:.9f}/token)"
)

# Add flat cost to prompt cost
prompt_cost += router_flat_cost

return prompt_cost, completion_cost
7 changes: 7 additions & 0 deletions litellm/model_prices_and_context_window_backup.json
Original file line number Diff line number Diff line change
Expand Up @@ -20632,6 +20632,7 @@
"supports_tool_choice": true,
"supports_service_tier": true,
"supports_vision": true,
"supports_web_search": true,
"supports_none_reasoning_effort": true,
"supports_xhigh_reasoning_effort": false
},
Expand Down Expand Up @@ -20670,6 +20671,7 @@
"supports_tool_choice": true,
"supports_service_tier": true,
"supports_vision": true,
"supports_web_search": true,
"supports_none_reasoning_effort": true,
"supports_xhigh_reasoning_effort": false
},
Expand Down Expand Up @@ -20707,6 +20709,7 @@
"supports_system_messages": true,
"supports_tool_choice": false,
"supports_vision": true,
"supports_web_search": true,
"supports_none_reasoning_effort": true,
"supports_xhigh_reasoning_effort": false
},
Expand Down Expand Up @@ -20746,6 +20749,7 @@
"supports_tool_choice": true,
"supports_service_tier": true,
"supports_vision": true,
"supports_web_search": true,
"supports_none_reasoning_effort": true,
"supports_xhigh_reasoning_effort": true
},
Expand Down Expand Up @@ -20785,6 +20789,7 @@
"supports_tool_choice": true,
"supports_service_tier": true,
"supports_vision": true,
"supports_web_search": true,
"supports_none_reasoning_effort": true,
"supports_xhigh_reasoning_effort": true
},
Expand Down Expand Up @@ -20821,6 +20826,7 @@
"supports_system_messages": true,
"supports_tool_choice": true,
"supports_vision": true,
"supports_web_search": true,
"supports_none_reasoning_effort": false,
"supports_xhigh_reasoning_effort": false
},
Expand Down Expand Up @@ -20857,6 +20863,7 @@
"supports_system_messages": true,
"supports_tool_choice": true,
"supports_vision": true,
"supports_web_search": true,
"supports_none_reasoning_effort": false,
"supports_xhigh_reasoning_effort": false
},
Expand Down
38 changes: 38 additions & 0 deletions tests/test_litellm/llms/azure_ai/test_azure_ai_cost_calculator.py
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,44 @@ def test_model_router_with_cached_tokens(self):
)
print(f"Total prompt cost: ${prompt_cost:.6f}")

def test_router_flat_cost_when_response_has_actual_model(self):
"""
Test that router flat cost is added when request was via router but response
contains the actual model (e.g., gpt-5-nano).

This is the key fix: Azure returns the actual model in the response, but we
must still add the router flat cost because the request was made via model router.
"""
usage = Usage(
prompt_tokens=10000,
completion_tokens=5000,
total_tokens=15000,
)

# Response model is the actual model Azure used (not a router name)
response_model = "gpt-5-nano-2025-08-07"
# Request model is the router - user called azure_ai/model_router/model-router
request_model = "azure_ai/model_router/model-router"

prompt_cost, completion_cost = cost_per_token(
model=response_model,
usage=usage,
request_model=request_model,
)

# Expected: model cost (from gpt-5-nano) + router flat cost
expected_flat_cost = (
usage.prompt_tokens * AZURE_MODEL_ROUTER_FLAT_COST_PER_M_INPUT_TOKENS / 1_000_000
)
assert expected_flat_cost == pytest.approx(0.0014, rel=1e-9)

# Total cost should be model cost + flat cost
total_cost = prompt_cost + completion_cost
assert total_cost >= expected_flat_cost

# Prompt cost should include both model prompt cost and router flat cost
assert prompt_cost >= expected_flat_cost


class TestAzureModelRouterCostBreakdown:
"""Test that Azure Model Router flat cost is tracked in cost breakdown."""
Expand Down
Loading