Studio: surface external-provider cache hits and writes in context bar by danielhanchen · Pull Request #5736 · unslothai/unsloth

danielhanchen · 2026-05-23T05:15:33Z

Summary

The Anthropic / OpenAI Responses streaming paths already emit an OpenAI include_usage-style chunk carrying prompt_tokens_details.cached_tokens and cache_creation_input_tokens / cache_read_input_tokens (built by _build_usage_chunk in studio/backend/core/inference/external_provider.py), but the chat-adapter only read llama-server's timings.cache_n. The context-usage tooltip therefore never showed cache activity for external providers, even though the backend already had the numbers.
Read the external usage envelope as a fallback for cachedTokens when timings.cache_n is absent. Surface Anthropic cache_creation_input_tokens separately as a new "Cache writes" line in the tooltip so a cache miss (writes-only) is visually distinguishable from a cache hit (reads).

Changes

studio/frontend/src/features/chat/api/chat-adapter.ts: ServerUsage gains optional prompt_tokens_details.cached_tokens, cache_creation_input_tokens, cache_read_input_tokens. Extract cachedTokens from timings.cache_n -> prompt_tokens_details.cached_tokens -> cache_read_input_tokens (in that order). Extract cacheWriteTokens from cache_creation_input_tokens. Both flow into the runtime store and the persisted metadata payload.
studio/frontend/src/features/chat/stores/chat-runtime-store.ts: contextUsage gains optional cacheWriteTokens.
studio/frontend/src/features/chat/runtime-provider.tsx: stored-message rehydration accepts cacheWriteTokens.
studio/frontend/src/features/chat/components/context-usage-bar.tsx: new optional cacheWrites prop, rendered as a separate tooltip line when > 0.
studio/frontend/src/features/chat/chat-page.tsx: forwards contextUsage.cacheWriteTokens into <ContextUsageBar />.

Test plan

cd studio/frontend && npm run typecheck clean
Open chat with Anthropic (or any cached-prompt provider), send a turn long enough to cache, verify tooltip shows non-zero "Cache hits" on the second turn and non-zero "Cache writes" on the first turn
Local llama-server chat still surfaces timings.cache_n as "Cache hits" with no "Cache writes" line

gemini-code-assist

Code Review

This pull request introduces support for tracking and displaying prompt cache statistics across multiple providers, including Anthropic, OpenAI, and Gemini. It updates the chat adapter to normalize cache-hit data and capture Anthropic-specific cache-write tokens, which are now displayed in the UI's context usage bar. I have no feedback to provide as there were no review comments to evaluate.

danielhanchen · 2026-05-23T07:10:21Z

End-to-end verification against live Anthropic via an isolated UNSLOTH_STUDIO_HOME instance on this branch (Studio C on port 8902).

Setup: registered an Anthropic provider record, sent the same long-system-prompt chat twice with enable_prompt_caching=true.

Turn 1 (cache miss, write):

"usage": {
  "prompt_tokens": 5220,
  "completion_tokens": 5,
  "total_tokens": 5225,
  "prompt_tokens_details": {"cached_tokens": 0},
  "cache_creation_input_tokens": 5217,
  "cache_read_input_tokens": 0
}

Turn 2 (cache hit, read):

"usage": {
  "prompt_tokens": 5220,
  "completion_tokens": 5,
  "total_tokens": 5225,
  "prompt_tokens_details": {"cached_tokens": 5217},
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 5217
}

Frontend wiring this PR adds:

const cachedTokens =
  meta?.timings?.cache_n ??
  meta?.usage?.prompt_tokens_details?.cached_tokens ??
  meta?.usage?.cache_read_input_tokens ??
  0;
const cacheWriteTokens = meta?.usage?.cache_creation_input_tokens ?? 0;

Mapped tooltip rows (with the new conditional render hiding 0-valued rows):

Turn 1 -> "Cache writes: 5,217"
Turn 2 -> "Cache hits: 5,217"

Backend coverage (run against the same venv): test_external_provider_usage_chunk.py, test_anthropic_cache_ttl.py, test_anthropic_messages.py, test_pricing.py -> 123 passed, 0 failed.

CI on this PR is still BLOCKED by the same Windows pydantic install flake every other open Studio PR is sitting on; once #5733 (root-cause installer fix) merges this clears via rebase.

danielhanchen · 2026-05-23T07:24:19Z

Pushed 8bd686e addressing three P1s from a fresh reviewer.py pass that I should have caught before opening this PR. All three were asymmetric-fix sites: the producer side was widened for external providers but the consumer side still gated on ggufContextLength (which is only ever set for the local llama-server runtime).

chat-page.tsx:1494: dropped the ggufContextLength precondition on the ContextUsageBar mount. The bar already tracks usage; let it decide what to render based on what it knows.
context-usage-bar.tsx: made total optional. When absent, drop the "/ total" ratio + percentage bar + "approaching limit" helper, and just show per-turn counters + cache stats. Bootstrap guard tightened so an all-zero, all-undefined state still renders nothing.
runtime-provider.tsx:836: external-provider rehydration was being rejected by the store.ggufContextLength check. Keep the "fits inside window" sanity check when a local context window IS known, drop it otherwise.
message-timing.tsx:125: the per-message timing popover used a separate "Cache hits" code path that only read llama-server's timings.cache_n. Fall through to custom.contextUsage for external providers, and add a parallel "Cache writes" line for Anthropic cache_creation events.

npm run typecheck clean. End-to-end Anthropic verification posted earlier still holds with the consumer side now actually rendering.

The Anthropic / OpenAI Responses streaming paths already emit an include_usage-style SSE chunk carrying prompt_tokens_details.cached_tokens and cache_creation_input_tokens / cache_read_input_tokens (see _build_usage_chunk in external_provider.py), but the chat-adapter only read the local llama-server timings.cache_n field. As a result, the context-usage tooltip never showed cache hits or writes for external providers, even though the backend was computing them. Read the external usage envelope as a fallback when timings.cache_n is absent, and surface Anthropic cache_creation_input_tokens as a separate "Cache writes" line in the tooltip so users can tell a cache miss from a cache hit on a turn that both reads and writes the cache. - ServerUsage gains optional prompt_tokens_details.cached_tokens, cache_creation_input_tokens, cache_read_input_tokens. - contextUsage store entry gains optional cacheWriteTokens. - ContextUsageBar gains optional cacheWrites tooltip line. - chat-page wires both fields through to the bar.

Reviewer round on the original PR caught three asymmetric-fix sites where the producer side surfaced external prompt-cache stats but the consumer side still gated on ggufContextLength (which is only ever set for the local llama-server runtime). Result: the entire cache-stats PR shipped invisible for Anthropic / OpenAI Responses / Gemini, which is exactly the set of providers it was added for. - chat-page.tsx: drop the ggufContextLength precondition on the ContextUsageBar mount. The bar already tracks usage; let it decide what to render based on what it knows. - context-usage-bar.tsx: make `total` optional. When absent, drop the "/ total" ratio + percentage progress bar + "approaching limit" helper, and just show per-turn counters + cache stats. Bootstrap guard tightened so an all-zero, all-undefined state still renders nothing. - runtime-provider.tsx: external-provider rehydration was rejected by the `store.ggufContextLength` check. Keep the "fits inside window" sanity check when a local context window IS known, drop it when it isn't. - message-timing.tsx: the per-message timing popover used a separate "Cache hits" code path that only read llama-server's timings.cache_n. Fall through to custom.contextUsage for external providers, and add a parallel "Cache writes" line for Anthropic cache_creation events.

Three follow-ups on #5736 so the relaxed external-provider render gate does not show stale token / cache stats from a different model: 1) setCheckpoint now clears contextUsage on a real checkpoint change. setActiveThreadId and clearCheckpoint already did this; the most-traveled transition path (the user switching models from the picker) leaked the prior turn's counts because they were never cleared. 2) The external-selection branch in chat-page.tsx now also clears contextUsage at the same time it nulls ggufContextLength / activeNativePathToken. Without this an in-session switch from a local model to an external provider would visibly carry the previous local turn's counters into the new provider's bar. 3) exitCompare's rehydration is now scoped: restore the saved usage only when the message's modelId matches the active checkpoint AND, for local turns where a context window is known, when the saved total fits inside that window. Without this the bar could render a stale local-model usage on top of an external provider, or an oversized usage object that exceeds the now- active window. Typecheck clean.

Follow-up to 042e0ac that catches four asymmetric-fix sites the checkpoint-scoping pass missed: 1) setParams now also clears contextUsage on a real checkpoint change. The local model load path in use-chat-model-runtime calls setParams(mergeBackendRecommendedInference(...)) which mutates params.checkpoint before refresh() eventually fires setCheckpoint; the intermediate window rendered the previous model's counters under the new checkpoint. 2) chat-adapter.ts setContextUsage on stream completion now gates on the captured params.checkpoint still being active. A late completion from provider A used to clobber the context bar after the user switched to provider B mid-stream. 3) chat-page.tsx exitCompare rehydration no longer accepts a saved modelId-stamped usage when the active checkpoint is empty. A user who entered compare, cleared the model, and exited compare would otherwise see the cleared model's stats reappear. 4) runtime-provider.tsx thread-load no longer restores legacy unscoped usage (no modelId) unless a local context window is known. With the relaxed external-provider render gate, old pre-PR persisted messages without a modelId stamp could attach their counts to an unrelated active provider. Also switches message-timing.tsx cache-hit fallback from || to ?? so an explicit cache_n=0 is not replaced by a stale cachedTokens. Typecheck clean.

chatgpt-codex-connector · 2026-05-26T01:09:27Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

danielhanchen · 2026-05-26T01:10:26Z

Round 4 cross-cutting fix: merged origin/main into this branch (no conflicts) to bring in PR #5735 (orphan tool_call XML strip widening + 263-line test_tool_xml_strip.py). All 8 PRs in this audit cohort had been forked off a pre-#5735 main, so a squash-merge of any of them would have silently reverted the widened _TOOL_XML_RE regex and deleted the dedicated test file. Verified: diff against origin/main now shows zero unintended changes to routes/inference.py and test_tool_xml_strip.py outside the actual PR scope.

chatgpt-codex-connector · 2026-05-26T05:50:33Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

danielhanchen requested a review from rolandtannous as a code owner May 23, 2026 05:15

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

danielhanchen added 3 commits May 23, 2026 14:00

Studio: tighten cache-stats comments

b264466

danielhanchen force-pushed the feat/surface-cache-stats branch from d80e54c to b264466 Compare May 23, 2026 14:00

danielhanchen added 3 commits May 25, 2026 07:37

Merge remote-tracking branch 'origin/main' into feat/surface-cache-stats

d59f720

Shorten cache-stats comments for PR #5736

ff66b74

danielhanchen merged commit cc68720 into main May 26, 2026
34 checks passed

danielhanchen deleted the feat/surface-cache-stats branch May 26, 2026 06:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Studio: surface external-provider cache hits and writes in context bar#5736

Studio: surface external-provider cache hits and writes in context bar#5736
danielhanchen merged 7 commits into
mainfrom
feat/surface-cache-stats

danielhanchen commented May 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

danielhanchen commented May 23, 2026

Uh oh!

danielhanchen commented May 23, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 26, 2026

Uh oh!

danielhanchen commented May 26, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielhanchen commented May 23, 2026

Summary

Changes

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen commented May 23, 2026

Uh oh!

danielhanchen commented May 23, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 26, 2026

Uh oh!

danielhanchen commented May 26, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant