Skip to content

Studio: surface external-provider cache hits and writes in context bar#5736

Merged
danielhanchen merged 7 commits into
mainfrom
feat/surface-cache-stats
May 26, 2026
Merged

Studio: surface external-provider cache hits and writes in context bar#5736
danielhanchen merged 7 commits into
mainfrom
feat/surface-cache-stats

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

  • The Anthropic / OpenAI Responses streaming paths already emit an OpenAI include_usage-style chunk carrying prompt_tokens_details.cached_tokens and cache_creation_input_tokens / cache_read_input_tokens (built by _build_usage_chunk in studio/backend/core/inference/external_provider.py), but the chat-adapter only read llama-server's timings.cache_n. The context-usage tooltip therefore never showed cache activity for external providers, even though the backend already had the numbers.
  • Read the external usage envelope as a fallback for cachedTokens when timings.cache_n is absent. Surface Anthropic cache_creation_input_tokens separately as a new "Cache writes" line in the tooltip so a cache miss (writes-only) is visually distinguishable from a cache hit (reads).

Changes

  • studio/frontend/src/features/chat/api/chat-adapter.ts: ServerUsage gains optional prompt_tokens_details.cached_tokens, cache_creation_input_tokens, cache_read_input_tokens. Extract cachedTokens from timings.cache_n -> prompt_tokens_details.cached_tokens -> cache_read_input_tokens (in that order). Extract cacheWriteTokens from cache_creation_input_tokens. Both flow into the runtime store and the persisted metadata payload.
  • studio/frontend/src/features/chat/stores/chat-runtime-store.ts: contextUsage gains optional cacheWriteTokens.
  • studio/frontend/src/features/chat/runtime-provider.tsx: stored-message rehydration accepts cacheWriteTokens.
  • studio/frontend/src/features/chat/components/context-usage-bar.tsx: new optional cacheWrites prop, rendered as a separate tooltip line when > 0.
  • studio/frontend/src/features/chat/chat-page.tsx: forwards contextUsage.cacheWriteTokens into <ContextUsageBar />.

Test plan

  • cd studio/frontend && npm run typecheck clean
  • Open chat with Anthropic (or any cached-prompt provider), send a turn long enough to cache, verify tooltip shows non-zero "Cache hits" on the second turn and non-zero "Cache writes" on the first turn
  • Local llama-server chat still surfaces timings.cache_n as "Cache hits" with no "Cache writes" line

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for tracking and displaying prompt cache statistics across multiple providers, including Anthropic, OpenAI, and Gemini. It updates the chat adapter to normalize cache-hit data and capture Anthropic-specific cache-write tokens, which are now displayed in the UI's context usage bar. I have no feedback to provide as there were no review comments to evaluate.

@danielhanchen

Copy link
Copy Markdown
Member Author

End-to-end verification against live Anthropic via an isolated UNSLOTH_STUDIO_HOME instance on this branch (Studio C on port 8902).

Setup: registered an Anthropic provider record, sent the same long-system-prompt chat twice with enable_prompt_caching=true.

Turn 1 (cache miss, write):

"usage": {
  "prompt_tokens": 5220,
  "completion_tokens": 5,
  "total_tokens": 5225,
  "prompt_tokens_details": {"cached_tokens": 0},
  "cache_creation_input_tokens": 5217,
  "cache_read_input_tokens": 0
}

Turn 2 (cache hit, read):

"usage": {
  "prompt_tokens": 5220,
  "completion_tokens": 5,
  "total_tokens": 5225,
  "prompt_tokens_details": {"cached_tokens": 5217},
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 5217
}

Frontend wiring this PR adds:

const cachedTokens =
  meta?.timings?.cache_n ??
  meta?.usage?.prompt_tokens_details?.cached_tokens ??
  meta?.usage?.cache_read_input_tokens ??
  0;
const cacheWriteTokens = meta?.usage?.cache_creation_input_tokens ?? 0;

Mapped tooltip rows (with the new conditional render hiding 0-valued rows):

  • Turn 1 -> "Cache writes: 5,217"
  • Turn 2 -> "Cache hits: 5,217"

Backend coverage (run against the same venv): test_external_provider_usage_chunk.py, test_anthropic_cache_ttl.py, test_anthropic_messages.py, test_pricing.py -> 123 passed, 0 failed.

CI on this PR is still BLOCKED by the same Windows pydantic install flake every other open Studio PR is sitting on; once #5733 (root-cause installer fix) merges this clears via rebase.

@danielhanchen

Copy link
Copy Markdown
Member Author

Pushed 8bd686e addressing three P1s from a fresh reviewer.py pass that I should have caught before opening this PR. All three were asymmetric-fix sites: the producer side was widened for external providers but the consumer side still gated on ggufContextLength (which is only ever set for the local llama-server runtime).

  • chat-page.tsx:1494: dropped the ggufContextLength precondition on the ContextUsageBar mount. The bar already tracks usage; let it decide what to render based on what it knows.
  • context-usage-bar.tsx: made total optional. When absent, drop the "/ total" ratio + percentage bar + "approaching limit" helper, and just show per-turn counters + cache stats. Bootstrap guard tightened so an all-zero, all-undefined state still renders nothing.
  • runtime-provider.tsx:836: external-provider rehydration was being rejected by the store.ggufContextLength check. Keep the "fits inside window" sanity check when a local context window IS known, drop it otherwise.
  • message-timing.tsx:125: the per-message timing popover used a separate "Cache hits" code path that only read llama-server's timings.cache_n. Fall through to custom.contextUsage for external providers, and add a parallel "Cache writes" line for Anthropic cache_creation events.

npm run typecheck clean. End-to-end Anthropic verification posted earlier still holds with the consumer side now actually rendering.

The Anthropic / OpenAI Responses streaming paths already emit an
include_usage-style SSE chunk carrying prompt_tokens_details.cached_tokens
and cache_creation_input_tokens / cache_read_input_tokens (see
_build_usage_chunk in external_provider.py), but the chat-adapter only
read the local llama-server timings.cache_n field. As a result, the
context-usage tooltip never showed cache hits or writes for external
providers, even though the backend was computing them.

Read the external usage envelope as a fallback when timings.cache_n is
absent, and surface Anthropic cache_creation_input_tokens as a separate
"Cache writes" line in the tooltip so users can tell a cache miss from a
cache hit on a turn that both reads and writes the cache.

- ServerUsage gains optional prompt_tokens_details.cached_tokens,
  cache_creation_input_tokens, cache_read_input_tokens.
- contextUsage store entry gains optional cacheWriteTokens.
- ContextUsageBar gains optional cacheWrites tooltip line.
- chat-page wires both fields through to the bar.
Reviewer round on the original PR caught three asymmetric-fix sites
where the producer side surfaced external prompt-cache stats but the
consumer side still gated on ggufContextLength (which is only ever set
for the local llama-server runtime). Result: the entire cache-stats
PR shipped invisible for Anthropic / OpenAI Responses / Gemini, which
is exactly the set of providers it was added for.

- chat-page.tsx: drop the ggufContextLength precondition on the
  ContextUsageBar mount. The bar already tracks usage; let it decide
  what to render based on what it knows.
- context-usage-bar.tsx: make `total` optional. When absent, drop the
  "/ total" ratio + percentage progress bar + "approaching limit"
  helper, and just show per-turn counters + cache stats. Bootstrap
  guard tightened so an all-zero, all-undefined state still renders
  nothing.
- runtime-provider.tsx: external-provider rehydration was rejected by
  the `store.ggufContextLength` check. Keep the "fits inside window"
  sanity check when a local context window IS known, drop it when
  it isn't.
- message-timing.tsx: the per-message timing popover used a separate
  "Cache hits" code path that only read llama-server's timings.cache_n.
  Fall through to custom.contextUsage for external providers, and add
  a parallel "Cache writes" line for Anthropic cache_creation events.
@danielhanchen danielhanchen force-pushed the feat/surface-cache-stats branch from d80e54c to b264466 Compare May 23, 2026 14:00
Three follow-ups on #5736 so the relaxed external-provider render
gate does not show stale token / cache stats from a different model:

1) setCheckpoint now clears contextUsage on a real checkpoint
   change. setActiveThreadId and clearCheckpoint already did this;
   the most-traveled transition path (the user switching models from
   the picker) leaked the prior turn's counts because they were never
   cleared.

2) The external-selection branch in chat-page.tsx now also clears
   contextUsage at the same time it nulls ggufContextLength /
   activeNativePathToken. Without this an in-session switch from a
   local model to an external provider would visibly carry the
   previous local turn's counters into the new provider's bar.

3) exitCompare's rehydration is now scoped: restore the saved
   usage only when the message's modelId matches the active
   checkpoint AND, for local turns where a context window is known,
   when the saved total fits inside that window. Without this the
   bar could render a stale local-model usage on top of an external
   provider, or an oversized usage object that exceeds the now-
   active window.

Typecheck clean.
Follow-up to 042e0ac that catches four asymmetric-fix sites the
checkpoint-scoping pass missed:

1) setParams now also clears contextUsage on a real checkpoint
   change. The local model load path in use-chat-model-runtime calls
   setParams(mergeBackendRecommendedInference(...)) which mutates
   params.checkpoint before refresh() eventually fires setCheckpoint;
   the intermediate window rendered the previous model's counters
   under the new checkpoint.

2) chat-adapter.ts setContextUsage on stream completion now gates on
   the captured params.checkpoint still being active. A late
   completion from provider A used to clobber the context bar after
   the user switched to provider B mid-stream.

3) chat-page.tsx exitCompare rehydration no longer accepts a saved
   modelId-stamped usage when the active checkpoint is empty. A user
   who entered compare, cleared the model, and exited compare would
   otherwise see the cleared model's stats reappear.

4) runtime-provider.tsx thread-load no longer restores legacy
   unscoped usage (no modelId) unless a local context window is
   known. With the relaxed external-provider render gate, old
   pre-PR persisted messages without a modelId stamp could attach
   their counts to an unrelated active provider.

Also switches message-timing.tsx cache-hit fallback from || to ??
so an explicit cache_n=0 is not replaced by a stale cachedTokens.

Typecheck clean.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen

Copy link
Copy Markdown
Member Author

Round 4 cross-cutting fix: merged origin/main into this branch (no conflicts) to bring in PR #5735 (orphan tool_call XML strip widening + 263-line test_tool_xml_strip.py). All 8 PRs in this audit cohort had been forked off a pre-#5735 main, so a squash-merge of any of them would have silently reverted the widened _TOOL_XML_RE regex and deleted the dedicated test file. Verified: diff against origin/main now shows zero unintended changes to routes/inference.py and test_tool_xml_strip.py outside the actual PR scope.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen danielhanchen merged commit cc68720 into main May 26, 2026
34 checks passed
@danielhanchen danielhanchen deleted the feat/surface-cache-stats branch May 26, 2026 06:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant