Skip to content

fix: preserve deployment custom pricing for /responses and /messages#23239

Closed
DmitriyAlergant wants to merge 17 commits intoBerriAI:mainfrom
DmitriyAlergant:fix-custom-pricing
Closed

fix: preserve deployment custom pricing for /responses and /messages#23239
DmitriyAlergant wants to merge 17 commits intoBerriAI:mainfrom
DmitriyAlergant:fix-custom-pricing

Conversation

@DmitriyAlergant
Copy link
Contributor

Relevant issues

#23185

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
    All good aside of 7 pre-existing failures due to responses package shadowing AttributeError: module 'responses' has no attribute 'activate', this is a pre-existing issue
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • [will request] I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Type

🐛 Bug Fix

Changes

Custom pricing from model_info config was ignored for /v1/responses (streaming) and /v1/messages (all modes) endpoints, causing cost calculations to use standard pricing instead. Regression introduced by: PR #20679 (commit 1ee43b11de). See repro steps in the issue linked above.

@vercel
Copy link

vercel bot commented Mar 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Mar 14, 2026 6:16pm

Request Review

@DmitriyAlergant
Copy link
Contributor Author

@greptileai please review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR fixes a regression introduced by PR #20679 where custom pricing from model_info config was silently ignored for /v1/responses (streaming) and /v1/messages (all modes) endpoints. The root cause was that the Router places deployment model_info in kwargs["litellm_metadata"]["model_info"], but both affected code paths either dropped litellm_metadata entirely or used a broken or-chain (metadata or kwargs.get("litellm_metadata") or {}) that silently discarded the router-injected metadata when user metadata was non-empty.

Key changes:

  • Adds get_model_info_from_litellm_params to core_helpers.py: reads model_info from three lookup locations in priority order, using a sentinel (_MISSING_MODEL_INFO) to distinguish an absent key from an explicitly empty {}.
  • Adds merge_metadata_preserving_deployment_model_info to core_helpers.py: merges user-supplied metadata into litellm_metadata while ensuring the deployment's model_info is never overwritten by user-supplied values.
  • Fixes responses/main.py and llm_http_handler.py to use the merge helper when constructing metadata_for_callbacks / litellm_params["metadata"].
  • Fixes anthropic_passthrough_logging_handler.py to explicitly pass router_model_id and custom_pricing to litellm.completion_cost and to merge logging-obj metadata into kwargs["litellm_params"]["metadata"].
  • Updates all three cost calculation sites in litellm_logging.py (non-streaming, sync-streaming, async-streaming) to extract router_model_id from model_call_details["litellm_params"] via the new helper.
  • Comprehensive test coverage: unit tests for both helpers, and mock-based integration tests for all streaming modes and both endpoint types, including user-metadata passthrough and user-supplied model_info clobber scenarios.

Confidence Score: 4/5

  • Safe to merge — the fix is targeted, well-tested with mocked integration tests, and backwards-compatible; custom pricing that was previously broken for these endpoints will now work correctly.
  • The helpers are correctly implemented with a proper sentinel pattern. All three affected code paths (responses streaming, responses non-streaming, messages all modes) are fixed and covered by tests. The minor outstanding concern is that router_model_id is now injected into cost calculation for all non-streaming call types (not just responses/messages) via the broadened _process_hidden_params_and_response_cost change, though this is guarded by the existing elif response_cost is not None: pass branch for nearly all non-responses paths (which set cost via _hidden_params). The implementation is well-structured with a shared utility rather than scattered logic.
  • The litellm_logging.py changes deserve a second look to confirm no unintended cost-calculation side-effects for non-responses call types in the else branch of _process_hidden_params_and_response_cost.

Important Files Changed

Filename Overview
litellm/litellm_core_utils/core_helpers.py Adds two well-implemented helper functions: get_model_info_from_litellm_params (with sentinel pattern for explicit empty dict) and merge_metadata_preserving_deployment_model_info (preserves deployment model_info while merging user metadata). Clean, centralized solution. Minor: or {} for metadata uses falsy semantics while model_info uses sentinel, creating a subtle inconsistency (e.g., metadata=None falls through to lower-priority locations).
litellm/litellm_core_utils/litellm_logging.py Three cost calculation call sites updated to extract router_model_id from model_call_details["litellm_params"] via the new helper — covering non-streaming, sync-streaming, and async-streaming paths. The change in _process_hidden_params_and_response_cost now applies to all non-streaming call types (not just responses/messages), which is guarded by the existing elif response_cost is not None: pass branch for most non-responses paths.
litellm/llms/custom_httpx/llm_http_handler.py Fixes the /v1/messages path by using merge_metadata_preserving_deployment_model_info to build the litellm_params["metadata"] passed to update_environment_variables, ensuring model_info from litellm_metadata reaches the logging object's metadata for cost calculation.
litellm/proxy/pass_through_endpoints/llm_provider_handlers/anthropic_passthrough_logging_handler.py Explicitly passes router_model_id and custom_pricing to completion_cost for the passthrough path. Also merges logging-obj metadata into kwargs["litellm_params"]["metadata"] using the helper, preserving deployment model_info while allowing existing pipeline keys to win for non-model_info fields. Well-tested by the new unit test.
litellm/responses/main.py Core bug fix: replaces the broken metadata or kwargs.get("litellm_metadata") or {} or-chain with merge_metadata_preserving_deployment_model_info(...), ensuring litellm_metadata["model_info"] (router-injected) is never silently dropped when user metadata is also non-empty.
tests/test_litellm/test_custom_pricing_metadata_propagation.py New comprehensive test file covering unit tests for both helpers and integration tests for all responses API paths (async non-streaming, async streaming, sync streaming) and the messages API path. All HTTP calls are properly mocked. Minor: list[str] type annotation (line 143) is Python 3.9+ syntax — consistent with the rest of the codebase.
tests/test_litellm/litellm_core_utils/test_litellm_logging.py Adds unit tests verifying the sentinel-based empty-dict behavior of get_model_info_from_litellm_params and the model_info precedence of merge_metadata_preserving_deployment_model_info. Also adds integration tests for _process_hidden_params_and_response_cost and success_handler confirming router_model_id is passed to _response_cost_calculator for aresponses call types.
tests/test_litellm/proxy/pass_through_endpoints/llm_provider_handlers/test_anthropic_passthrough_logging_handler.py Adds TestAnthropicPassthroughLoggingPayload class with streaming integration tests using a mocked HTTP response and a CostCapturingCallback to verify custom pricing is applied for the passthrough streaming path. Also adds a unit test verifying the metadata merge doesn't clobber existing litellm_params keys like stream_response.

Sequence Diagram

sequenceDiagram
    participant User
    participant Router
    participant responses_main as responses/main.py
    participant llm_http as llm_http_handler.py
    participant passthrough as anthropic_passthrough_logging_handler.py
    participant logging as litellm_logging.py
    participant core_helpers as core_helpers.py

    User->>Router: aresponses(model, input, metadata={...})
    Router->>Router: _update_kwargs_with_deployment()<br/>sets kwargs["litellm_metadata"]["model_info"]
    Router->>responses_main: responses(kwargs incl. litellm_metadata)

    responses_main->>core_helpers: merge_metadata_preserving_deployment_model_info(<br/>litellm_metadata=kwargs["litellm_metadata"],<br/>user_metadata=metadata, model_info=kwargs["model_info"])
    core_helpers-->>responses_main: metadata_for_callbacks<br/>(model_info preserved from litellm_metadata)

    responses_main->>logging: update_environment_variables(<br/>litellm_params={"metadata": metadata_for_callbacks})
    Note over logging: self.litellm_params["metadata"]["model_info"] = deployment model_info

    responses_main->>llm_http: HTTP call

    alt /v1/messages path
        llm_http->>core_helpers: merge_metadata_preserving_deployment_model_info(...)
        core_helpers-->>llm_http: merged_metadata (model_info preserved)
        llm_http->>logging: update_environment_variables(litellm_params={"metadata": merged_metadata})
    end

    alt Anthropic passthrough path
        passthrough->>core_helpers: get_model_info_from_litellm_params(logging_obj.litellm_params)
        core_helpers-->>passthrough: model_info
        passthrough->>passthrough: completion_cost(router_model_id=model_info["id"],<br/>custom_pricing=use_custom_pricing_for_model(...))
    end

    llm_http-->>responses_main: response
    responses_main->>logging: success_handler(result)
    logging->>core_helpers: get_model_info_from_litellm_params(model_call_details["litellm_params"])
    core_helpers-->>logging: model_info with deployment id
    logging->>logging: _response_cost_calculator(result,<br/>router_model_id=model_info["id"])
    Note over logging: Custom pricing applied via deployment ID lookup
    logging-->>User: async_log_success_event(kwargs, response_cost=custom_price)
Loading

Last reviewed commit: b19fcaa

Comment on lines +146 to +149
kwargs["model"] = model
kwargs.setdefault("litellm_params", {})
if litellm_params:
kwargs["litellm_params"].update(litellm_params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditional .update(litellm_params) may overwrite existing kwargs["litellm_params"] keys

litellm_params here is extracted from logging_obj.model_call_details["litellm_params"], which was populated by update_environment_variables in llm_http_handler.py. It contains not only metadata (the key you need to propagate) but also preset_cache_key: None, stream_response: {}, and all of anthropic_messages_optional_request_params (e.g. max_tokens, stream, etc.).

If kwargs["litellm_params"] already holds data that was set earlier in the proxy request pipeline, calling .update(litellm_params) overwrites any conflicting keys. The intent appears to be to ensure metadata with model_info reaches the downstream logging callbacks — a targeted merge on just metadata would be safer:

kwargs.setdefault("litellm_params", {})
if litellm_params:
    # Only propagate metadata (which carries model_info for custom pricing).
    # Avoid clobbering unrelated keys already set in kwargs["litellm_params"].
    kwargs["litellm_params"].setdefault("metadata", {})
    kwargs["litellm_params"]["metadata"].update(
        litellm_params.get("metadata") or {}
    )

Comment on lines +1655 to +1666
_lp_meta = (
self.model_call_details
.get("litellm_params", {})
.get("metadata", {}) or {}
)
_lp = self.model_call_details.get("litellm_params", {}) or {}
_mi = (
_lp_meta.get("model_info")
or _lp.get("model_info")
or (_lp.get("litellm_metadata", {}) or {}).get("model_info")
or {}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated model_info resolution logic

The identical 3-location lookup pattern:

_lp_meta = self.model_call_details.get("litellm_params", {}).get("metadata", {}) or {}
_lp = self.model_call_details.get("litellm_params", {}) or {}
_mi = (
    _lp_meta.get("model_info")
    or _lp.get("model_info")
    or (_lp.get("litellm_metadata", {}) or {}).get("model_info")
    or {}
)

appears verbatim three times: here (non-streaming path), at the streaming path around line 2491, and in anthropic_passthrough_logging_handler.py lines 122–129. Any future change to where model_info can live (e.g. a fourth location) requires updating all three copies.

Consider extracting it to a small helper on the Logging class (or as a module-level function), e.g.:

def _extract_model_info_from_litellm_params(litellm_params: dict) -> dict:
    metadata = litellm_params.get("metadata", {}) or {}
    return (
        metadata.get("model_info")
        or litellm_params.get("model_info")
        or (litellm_params.get("litellm_metadata", {}) or {}).get("model_info")
        or {}
    )

Comment on lines +4442 to +4454
def _get_model_info_from_litellm_params(
litellm_params: Optional[dict],
) -> Dict[str, Any]:
if litellm_params is None:
return {}

metadata = litellm_params.get("metadata", {}) or {}
return (
metadata.get("model_info")
or litellm_params.get("model_info")
or (litellm_params.get("litellm_metadata", {}) or {}).get("model_info")
or {}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_model_info_from_litellm_params is annotated to return Dict[str, Any], but the expression chain can return a non-dict truthy value. If someone stores a non-dict value under model_info (e.g., a string), call sites like line 1656 (model_info.get("id")) will raise AttributeError at runtime.

Add a defensive cast to guarantee the return type:

Suggested change
def _get_model_info_from_litellm_params(
litellm_params: Optional[dict],
) -> Dict[str, Any]:
if litellm_params is None:
return {}
metadata = litellm_params.get("metadata", {}) or {}
return (
metadata.get("model_info")
or litellm_params.get("model_info")
or (litellm_params.get("litellm_metadata", {}) or {}).get("model_info")
or {}
)
def _get_model_info_from_litellm_params(
litellm_params: Optional[dict],
) -> Dict[str, Any]:
if litellm_params is None:
return {}
metadata = litellm_params.get("metadata", {}) or {}
result = (
metadata.get("model_info")
or litellm_params.get("model_info")
or (litellm_params.get("litellm_metadata", {}) or {}).get("model_info")
or {}
)
return result if isinstance(result, dict) else {}

Comment on lines +290 to +293
if "model_info" not in request_payload and self.litellm_metadata:
_mi = self.litellm_metadata.get("model_info")
if _mi:
request_payload["model_info"] = _mi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top-level model_info injection bypasses the established lookup path

_get_model_info_from_litellm_params (the helper added by this PR) looks inside litellm_params["metadata"], litellm_params["model_info"], or litellm_params["litellm_metadata"] — it does not look at the top-level key being set here (request_payload["model_info"]).

After the fix in responses/main.py, litellm_params["metadata"]["model_info"] is already populated correctly when _handle_completed_response runs. The streaming cost path at litellm_logging.py:2473 extracts model_info from model_call_details["litellm_params"] (via the same helper), so this top-level injection has no effect on cost calculation.

The change appears to target update_response_metadata / set_hidden_params so that hidden_params["model_id"] gets set — but the explicit router_model_id argument now passed by litellm_logging.py renders that fallback unnecessary.

Consider documenting the exact code path this unblocks (i.e., which consumer of request_payload["model_info"] requires it), or remove the block if it is truly redundant, to avoid future confusion.

Comment on lines 1651 to 1657
model_info = _get_model_info_from_litellm_params(
self.model_call_details.get("litellm_params", {}) or {}
)
self.model_call_details["response_cost"] = self._response_cost_calculator(
result=logging_result
result=logging_result,
router_model_id=model_info.get("id"),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broader scope than intended for this fix

The new router_model_id injection at lines 1651–1657 affects ALL non-streaming call types that reach the else branch (chat completions, embeddings, etc.), not just /v1/responses. While regressions are prevented by the earlier elif self.model_call_details.get("response_cost") is not None: pass guard, it is worth verifying that for these other call types _get_model_info_from_litellm_params consistently returns {} (or a dict without "id") so that router_model_id=None is passed and cost calculation is unchanged.

If any existing call type stores a model_info with an "id" in litellm_params (which many router-dispatched calls do), _response_cost_calculator will now be called with a non-None router_model_id where it previously wasn't. If the deployment ID happens to exist in litellm.model_cost with unexpectedly-priced entries, this could silently change cost calculations for those call types.

Consider scoping the router_model_id injection to the specific call types this PR targets (aresponses, responses) by guarding on self.call_type:

# Only apply router_model_id for response API call types
if self.call_type in ("aresponses", "responses"):
    model_info = _get_model_info_from_litellm_params(
        self.model_call_details.get("litellm_params", {}) or {}
    )
    router_model_id = model_info.get("id")
else:
    router_model_id = None
self.model_call_details["response_cost"] = self._response_cost_calculator(
    result=logging_result,
    router_model_id=router_model_id,
)

Comment on lines +145 to +149
if litellm_params.get("metadata") is not None:
existing_metadata = kwargs["litellm_params"].get("metadata", {}) or {}
merged_metadata = dict(existing_metadata)
merged_metadata.update(litellm_params.get("metadata") or {})
kwargs["litellm_params"]["metadata"] = merged_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge direction may silently overwrite proxy-pipeline metadata

The merge applies logging_obj's metadata on top of the existing kwargs["litellm_params"]["metadata"]:

merged_metadata = dict(existing_metadata)
merged_metadata.update(litellm_params.get("metadata") or {})  # logging_obj wins

litellm_params.get("metadata") here is the merged_metadata built by llm_http_handler.py, which already contains all user-provided metadata keys plus the deployment's model_info. Because logging_obj's metadata wins, any key that the proxy pipeline added to kwargs["litellm_params"]["metadata"] after the initial handler ran (e.g., a middleware that injects user or trace_id) will be silently overwritten if that key also appears in the logging object's metadata.

For correctness in the custom-pricing case you only need to ensure the model_info key from litellm_params reaches kwargs["litellm_params"]["metadata"]. A targeted write is safer:

kwargs.setdefault("litellm_params", {})
kwargs["litellm_params"].setdefault("metadata", {})
if litellm_params.get("metadata", {}).get("model_info"):
    kwargs["litellm_params"]["metadata"]["model_info"] = (
        litellm_params["metadata"]["model_info"]
    )

Comment on lines +145 to +152
if litellm_params.get("metadata") is not None:
logging_metadata = litellm_params.get("metadata") or {}
existing_metadata = kwargs["litellm_params"].get("metadata", {}) or {}
merged_metadata = dict(logging_metadata)
merged_metadata.update(existing_metadata)
if logging_metadata.get("model_info") is not None:
merged_metadata["model_info"] = logging_metadata["model_info"]
kwargs["litellm_params"]["metadata"] = merged_metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge start-point is counter-intuitive but correctly tested

The merge starts from logging_metadata and then overlays existing_metadata, which means existing pipeline keys win for every field except model_info. The test at test_custom_pricing_metadata_propagation.py:785 (shared_field == "from-existing") confirms this is intentional.

A brief comment would help future readers understand the direction:

Suggested change
if litellm_params.get("metadata") is not None:
logging_metadata = litellm_params.get("metadata") or {}
existing_metadata = kwargs["litellm_params"].get("metadata", {}) or {}
merged_metadata = dict(logging_metadata)
merged_metadata.update(existing_metadata)
if logging_metadata.get("model_info") is not None:
merged_metadata["model_info"] = logging_metadata["model_info"]
kwargs["litellm_params"]["metadata"] = merged_metadata
if litellm_params.get("metadata") is not None:
logging_metadata = litellm_params.get("metadata") or {}
existing_metadata = kwargs["litellm_params"].get("metadata", {}) or {}
# Start from logging_obj metadata, let existing pipeline metadata win for
# all keys, then unconditionally restore deployment model_info.
merged_metadata = dict(logging_metadata)
merged_metadata.update(existing_metadata)
if logging_metadata.get("model_info") is not None:
merged_metadata["model_info"] = logging_metadata["model_info"]
kwargs["litellm_params"]["metadata"] = merged_metadata

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +556 to +558
deadline = time.time() + 2.0
while cost_callback.response_cost is None and time.time() < deadline:
time.sleep(0.01)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polling loop is redundant for sync streaming

For sync streaming, log_success_event is called synchronously as each chunk is processed inside the for _chunk in response loop above. By the time that loop exits, cost_callback.response_cost is already set. The while polling loop will exit immediately on the first iteration.

Consider simplifying:

Suggested change
deadline = time.time() + 2.0
while cost_callback.response_cost is None and time.time() < deadline:
time.sleep(0.01)
# For sync streaming, cost_callback.response_cost is already set
# by the time the for-loop above completes (callbacks are synchronous).

If there is a theoretical case where the callback is deferred (e.g. a background thread), a brief comment explaining that intent would help readers understand why the poll is needed.

@DmitriyAlergant
Copy link
Contributor Author

@greptileai @greptile-apps can you review one last commit 1f3530185 and update the summary?

Comment on lines +4477 to +4478
or litellm_params.get("model_info")
or (litellm_params.get("litellm_metadata", {}) or {}).get("model_info")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty-dict short-circuit in or-chain silently skips explicit model_info: {}

The chained or evaluates Python truthiness, so if metadata["model_info"] is an empty dict {} (falsy), the expression falls through to litellm_params.get("model_info") even though model_info was explicitly set in the highest-priority location.

In practice this is benign (an empty model_info has no custom pricing), but it creates an inconsistency: a caller who deliberately sets metadata={"model_info": {}} to reset any inherited model info would silently get a different (lower-priority) model_info returned instead of {}.

Consider using an explicit sentinel instead of truthiness:

_MISSING = object()

metadata_mi = metadata.get("model_info", _MISSING)
if metadata_mi is not _MISSING:
    result = metadata_mi
else:
    lp_mi = litellm_params.get("model_info", _MISSING)
    if lp_mi is not _MISSING:
        result = lp_mi
    else:
        result = (litellm_params.get("litellm_metadata") or {}).get("model_info", {})
return result if isinstance(result, dict) else {}

@DmitriyAlergant
Copy link
Contributor Author

DmitriyAlergant commented Mar 13, 2026

@ishaan-berri can you review?

This fixes a regression you introduced in #20679

4/5 greptile confidence, 5.0 is not feasible at this complexity; It keeps coming up with ridiculously minor and far-fetched stiff.

pass
else:
router_model_id = None
if self.call_type in (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a weird hardcoding, why do it this way? @DmitriyAlergant

@DmitriyAlergant
Copy link
Contributor Author

@krrishdholakia to be honest this was mainly to satisfy "the beast" (greptile): #23239 (comment)

She is not wrong, since other endpoints are not properly covered by tests for this change. And this core pathway is complex and not 100% understood for implications on all possible endpoints.

Based on your comment, I pushed a commit to remove this guard, and the tests are still passing.

@krrishdholakia
Copy link
Member

Fix on main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants