Studio: surface external-provider cache hits and writes in context bar#5736
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for tracking and displaying prompt cache statistics across multiple providers, including Anthropic, OpenAI, and Gemini. It updates the chat adapter to normalize cache-hit data and capture Anthropic-specific cache-write tokens, which are now displayed in the UI's context usage bar. I have no feedback to provide as there were no review comments to evaluate.
|
End-to-end verification against live Anthropic via an isolated Setup: registered an Anthropic provider record, sent the same long-system-prompt chat twice with Turn 1 (cache miss, write): "usage": {
"prompt_tokens": 5220,
"completion_tokens": 5,
"total_tokens": 5225,
"prompt_tokens_details": {"cached_tokens": 0},
"cache_creation_input_tokens": 5217,
"cache_read_input_tokens": 0
}Turn 2 (cache hit, read): "usage": {
"prompt_tokens": 5220,
"completion_tokens": 5,
"total_tokens": 5225,
"prompt_tokens_details": {"cached_tokens": 5217},
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 5217
}Frontend wiring this PR adds: const cachedTokens =
meta?.timings?.cache_n ??
meta?.usage?.prompt_tokens_details?.cached_tokens ??
meta?.usage?.cache_read_input_tokens ??
0;
const cacheWriteTokens = meta?.usage?.cache_creation_input_tokens ?? 0;Mapped tooltip rows (with the new conditional render hiding 0-valued rows):
Backend coverage (run against the same venv): CI on this PR is still BLOCKED by the same Windows pydantic install flake every other open Studio PR is sitting on; once #5733 (root-cause installer fix) merges this clears via rebase. |
|
Pushed 8bd686e addressing three P1s from a fresh reviewer.py pass that I should have caught before opening this PR. All three were asymmetric-fix sites: the producer side was widened for external providers but the consumer side still gated on
|
The Anthropic / OpenAI Responses streaming paths already emit an include_usage-style SSE chunk carrying prompt_tokens_details.cached_tokens and cache_creation_input_tokens / cache_read_input_tokens (see _build_usage_chunk in external_provider.py), but the chat-adapter only read the local llama-server timings.cache_n field. As a result, the context-usage tooltip never showed cache hits or writes for external providers, even though the backend was computing them. Read the external usage envelope as a fallback when timings.cache_n is absent, and surface Anthropic cache_creation_input_tokens as a separate "Cache writes" line in the tooltip so users can tell a cache miss from a cache hit on a turn that both reads and writes the cache. - ServerUsage gains optional prompt_tokens_details.cached_tokens, cache_creation_input_tokens, cache_read_input_tokens. - contextUsage store entry gains optional cacheWriteTokens. - ContextUsageBar gains optional cacheWrites tooltip line. - chat-page wires both fields through to the bar.
Reviewer round on the original PR caught three asymmetric-fix sites where the producer side surfaced external prompt-cache stats but the consumer side still gated on ggufContextLength (which is only ever set for the local llama-server runtime). Result: the entire cache-stats PR shipped invisible for Anthropic / OpenAI Responses / Gemini, which is exactly the set of providers it was added for. - chat-page.tsx: drop the ggufContextLength precondition on the ContextUsageBar mount. The bar already tracks usage; let it decide what to render based on what it knows. - context-usage-bar.tsx: make `total` optional. When absent, drop the "/ total" ratio + percentage progress bar + "approaching limit" helper, and just show per-turn counters + cache stats. Bootstrap guard tightened so an all-zero, all-undefined state still renders nothing. - runtime-provider.tsx: external-provider rehydration was rejected by the `store.ggufContextLength` check. Keep the "fits inside window" sanity check when a local context window IS known, drop it when it isn't. - message-timing.tsx: the per-message timing popover used a separate "Cache hits" code path that only read llama-server's timings.cache_n. Fall through to custom.contextUsage for external providers, and add a parallel "Cache writes" line for Anthropic cache_creation events.
d80e54c to
b264466
Compare
Three follow-ups on #5736 so the relaxed external-provider render gate does not show stale token / cache stats from a different model: 1) setCheckpoint now clears contextUsage on a real checkpoint change. setActiveThreadId and clearCheckpoint already did this; the most-traveled transition path (the user switching models from the picker) leaked the prior turn's counts because they were never cleared. 2) The external-selection branch in chat-page.tsx now also clears contextUsage at the same time it nulls ggufContextLength / activeNativePathToken. Without this an in-session switch from a local model to an external provider would visibly carry the previous local turn's counters into the new provider's bar. 3) exitCompare's rehydration is now scoped: restore the saved usage only when the message's modelId matches the active checkpoint AND, for local turns where a context window is known, when the saved total fits inside that window. Without this the bar could render a stale local-model usage on top of an external provider, or an oversized usage object that exceeds the now- active window. Typecheck clean.
Follow-up to 042e0ac that catches four asymmetric-fix sites the checkpoint-scoping pass missed: 1) setParams now also clears contextUsage on a real checkpoint change. The local model load path in use-chat-model-runtime calls setParams(mergeBackendRecommendedInference(...)) which mutates params.checkpoint before refresh() eventually fires setCheckpoint; the intermediate window rendered the previous model's counters under the new checkpoint. 2) chat-adapter.ts setContextUsage on stream completion now gates on the captured params.checkpoint still being active. A late completion from provider A used to clobber the context bar after the user switched to provider B mid-stream. 3) chat-page.tsx exitCompare rehydration no longer accepts a saved modelId-stamped usage when the active checkpoint is empty. A user who entered compare, cleared the model, and exited compare would otherwise see the cleared model's stats reappear. 4) runtime-provider.tsx thread-load no longer restores legacy unscoped usage (no modelId) unless a local context window is known. With the relaxed external-provider render gate, old pre-PR persisted messages without a modelId stamp could attach their counts to an unrelated active provider. Also switches message-timing.tsx cache-hit fallback from || to ?? so an explicit cache_n=0 is not replaced by a stale cachedTokens. Typecheck clean.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Round 4 cross-cutting fix: merged origin/main into this branch (no conflicts) to bring in PR #5735 (orphan tool_call XML strip widening + 263-line test_tool_xml_strip.py). All 8 PRs in this audit cohort had been forked off a pre-#5735 main, so a squash-merge of any of them would have silently reverted the widened _TOOL_XML_RE regex and deleted the dedicated test file. Verified: diff against origin/main now shows zero unintended changes to routes/inference.py and test_tool_xml_strip.py outside the actual PR scope. |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Summary
include_usage-style chunk carryingprompt_tokens_details.cached_tokensandcache_creation_input_tokens/cache_read_input_tokens(built by_build_usage_chunkinstudio/backend/core/inference/external_provider.py), but the chat-adapter only read llama-server'stimings.cache_n. The context-usage tooltip therefore never showed cache activity for external providers, even though the backend already had the numbers.cachedTokenswhentimings.cache_nis absent. Surface Anthropiccache_creation_input_tokensseparately as a new "Cache writes" line in the tooltip so a cache miss (writes-only) is visually distinguishable from a cache hit (reads).Changes
studio/frontend/src/features/chat/api/chat-adapter.ts:ServerUsagegains optionalprompt_tokens_details.cached_tokens,cache_creation_input_tokens,cache_read_input_tokens. ExtractcachedTokensfromtimings.cache_n->prompt_tokens_details.cached_tokens->cache_read_input_tokens(in that order). ExtractcacheWriteTokensfromcache_creation_input_tokens. Both flow into the runtime store and the persisted metadata payload.studio/frontend/src/features/chat/stores/chat-runtime-store.ts:contextUsagegains optionalcacheWriteTokens.studio/frontend/src/features/chat/runtime-provider.tsx: stored-message rehydration acceptscacheWriteTokens.studio/frontend/src/features/chat/components/context-usage-bar.tsx: new optionalcacheWritesprop, rendered as a separate tooltip line when > 0.studio/frontend/src/features/chat/chat-page.tsx: forwardscontextUsage.cacheWriteTokensinto<ContextUsageBar />.Test plan
cd studio/frontend && npm run typecheckcleantimings.cache_nas "Cache hits" with no "Cache writes" line