Improve Responses API captures and metrics by coder543 · Pull Request #577 · mostlygeek/llama-swap

coder543 · 2026-03-09T22:54:50Z

Right now, requests through the Responses API have no metrics at all, and the responses are not captured. Using codex through llama-swap is a little annoying because of this, since codex requires the Responses API.

So, a little PR to parse those stats out.

parse nested Responses usage payloads in streaming and non-streaming metrics
render Responses SSE captures with response and pretty JSON views
estimate only generation speed for usage-only streaming responses
leave prompt speed unknown without upstream timing/cache data (otherwise cached tokens make it look like your prompt processing speed is insanely fast)
add Responses-specific coverage for metrics fallback behavior

Hopefully this PR is useful.

Written by codex using gpt-5.4. Tested by hand to make sure it works. Code reviewed to make sure it seems reasonable.

coderabbitai · 2026-03-09T22:55:09Z

Walkthrough

Reworks metrics collection to use a timing-aware internal type (responseTimingInfo), updates streaming parsing to discover usage/timings across varied payload shapes, adjusts response body timing capture, expands tests for streaming/non-streaming and edge cases, and consolidates SSE/chat rendering into a RenderedResponse pipeline in the UI. (50 words)

Changes

Cohort / File(s)	Summary
Metrics Timing & Parsing `proxy/metrics_monitor.go`	Introduce `responseTimingInfo` (requestStart/firstWrite/lastWrite) and methods `totalDuration()`/`generationDuration()`; `responseBodyCopier` records/exposes timings via `Timings()` and StartTime now returns firstWrite; update `processStreamingResponse` and `parseMetrics` to accept timingInfo; add `findMetricsPayload()` and `allowFallback` handling; replace StartTime-based logic with timing-aware parsing.
Metrics Tests `proxy/metrics_monitor_test.go`	Add extensive tests covering JSON and OpenAI-style SSE payloads, streaming and non-streaming responses, timing-present and timing-absent cases, fallback behavior for tokens/sec, chunked/non-chunked edge cases, single-write fallbacks, non-OK/invalid responses, and various encodings (gzip/deflate).
UI: SSE / Rendered Response `ui-svelte/src/components/CaptureDialog.svelte`	Replace prior SSE chat state with `RenderedResponse` accumulator and `appendResponseOutput`; update parsing/path to accumulate reasoning/content from SSE and JSON shapes; switch UI, copy logic, and tab visibility to use renderedResponse fields.
UI: Stats Calculation `ui-svelte/src/components/StatsPanel.svelte`	Prefer backend-provided `tokens_per_second` when > 0; otherwise fall back to computed output_tokens/duration_ms to derive tokens/sec used in percentiles/histograms.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Include metrics from upstream chat requests #361 — Modifies the same metrics_monitor internals and streaming-response parsing (processStreamingResponse / parseMetrics).
proxy: extract metrics for v1/messages #419 — Updates metrics extraction to handle nested usage/timings payload shapes and adjusts parseMetrics call sites.
Fix token metrics parsing #199 — Changes parsing of streaming responses to prefer/use timings payloads for tokens-per-second calculations.

Suggested reviewers

mostlygeek

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main objective of the PR: improving support for Responses API by adding metrics capture and parsing functionality.
Description check	✅ Passed	The description is directly related to the changeset, detailing specific improvements for Responses API metrics parsing, SSE rendering, speed estimation logic, and test coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ui-svelte/src/components/CaptureDialog.svelte`:
- Around line 500-529: The "(empty)" fallback inside the respBody "pretty"
branch is unreachable because the outer condition checks respBodyTab ===
"pretty" && (renderedResponse.reasoning || renderedResponse.content); remove the
unreachable check and its <pre>(empty)</pre> fallback so the UI only renders
Reasoning and/or Response when renderedResponse.reasoning or
renderedResponse.content exist; locate the block guarded by respBodyTab ===
"pretty" and renderedResponse checks in CaptureDialog.svelte (use respBodyTab
and renderedResponse identifiers to find it) and delete the unreachable fallback
lines.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7fb713c5-90d2-467c-acfb-c3467215c3ff

📥 Commits

Reviewing files that changed from the base of the PR and between cc77139 and cbfbbe7.

📒 Files selected for processing (3)

proxy/metrics_monitor.go
proxy/metrics_monitor_test.go
ui-svelte/src/components/CaptureDialog.svelte

ui-svelte/src/components/CaptureDialog.svelte

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

ui-svelte/src/components/CaptureDialog.svelte (1)

471-484: Minor: Redundant conditions could be simplified.

Lines 471 and 478 use equivalent conditions (isResponseJson || isSSE vs isSSE || isResponseJson). Consider extracting to a derived variable for clarity, though this is a minor readability nit.

♻️ Optional refactor

+ let showPrettyRawTabs = $derived(isResponseJson || isSSE);

  // Then in template:
- {`#if` isResponseJson || isSSE}
+ {`#if` showPrettyRawTabs}
    <button ...>Pretty</button>
  {/if}
- {`#if` isSSE || isResponseJson}
+ {`#if` showPrettyRawTabs}
    <button ...>Raw</button>
  {/if}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@ui-svelte/src/components/CaptureDialog.svelte` around lines 471 - 484,
Extract the repeated boolean expression (isResponseJson || isSSE) into a single
derived variable (e.g., showRespTabs) and use that variable in both conditional
blocks that render the "Pretty" and "Raw" buttons (where respBodyTab is
toggled); update the conditions around the two buttons (currently using
isResponseJson || isSSE and isSSE || isResponseJson) to use showRespTabs to
remove redundancy and improve readability while keeping respBodyTab,
isResponseJson and isSSE logic unchanged.

proxy/metrics_monitor.go (1)

505-521: StartTime() is semantically misleading now.

After this change it returns firstWrite, while the actual request start lives in Timings().requestStart. Renaming or removing StartTime() would make future duration code much harder to misuse.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@proxy/metrics_monitor.go` around lines 505 - 521, StartTime() currently
returns firstWrite which is misleading; change the API so StartTime() returns
the actual request start (responseBodyCopier.requestStart) and add a new
FirstWriteTime() (or FirstWrite()) that returns firstWrite, or alternatively
rename StartTime() to FirstWriteTime() and update callers—ensure any code using
StartTime() for request start now uses Timings().requestStart or the new
StartTime() returning requestStart; update references to
responseBodyCopier.StartTime(), responseBodyCopier.Timings(), and any callers
accordingly to preserve correct semantics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@proxy/metrics_monitor.go`:
- Around line 428-431: The fallback TokensPerSecond estimation is being applied
for both streaming and non-streaming responses; restrict it so it only runs for
streaming usage-only streams by adding a boolean parameter (e.g., allowFallback
or isStreaming) to the function that computes TokensPerSecond where
timingInfo.generationDuration() is used, call it with true from
processStreamingResponse and false from the non-streaming response path, and
update the branch that sets tokensPerSecond (the block using generationDuration
and outputTokens) to only execute when that boolean is true; also add a
regression test that simulates a chunked non-streaming JSON response to ensure
no synthetic generation speed is recorded for non-streaming responses.

---

Nitpick comments:
In `@proxy/metrics_monitor.go`:
- Around line 505-521: StartTime() currently returns firstWrite which is
misleading; change the API so StartTime() returns the actual request start
(responseBodyCopier.requestStart) and add a new FirstWriteTime() (or
FirstWrite()) that returns firstWrite, or alternatively rename StartTime() to
FirstWriteTime() and update callers—ensure any code using StartTime() for
request start now uses Timings().requestStart or the new StartTime() returning
requestStart; update references to responseBodyCopier.StartTime(),
responseBodyCopier.Timings(), and any callers accordingly to preserve correct
semantics.

In `@ui-svelte/src/components/CaptureDialog.svelte`:
- Around line 471-484: Extract the repeated boolean expression (isResponseJson
|| isSSE) into a single derived variable (e.g., showRespTabs) and use that
variable in both conditional blocks that render the "Pretty" and "Raw" buttons
(where respBodyTab is toggled); update the conditions around the two buttons
(currently using isResponseJson || isSSE and isSSE || isResponseJson) to use
showRespTabs to remove redundancy and improve readability while keeping
respBodyTab, isResponseJson and isSSE logic unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0a39fc5b-141c-4202-ae49-94fe6adca750

📥 Commits

Reviewing files that changed from the base of the PR and between cbfbbe7 and 8c54f41.

📒 Files selected for processing (3)

proxy/metrics_monitor.go
proxy/metrics_monitor_test.go
ui-svelte/src/components/CaptureDialog.svelte

proxy/metrics_monitor.go

Handle llama-server Responses API captures and metrics more gracefully. - parse nested Responses usage payloads in streaming and non-streaming metrics - render Responses SSE captures with response and pretty JSON views - estimate only generation speed for usage-only streaming responses - leave prompt speed unknown without upstream timing/cache data - add Responses-specific coverage for metrics fallback behavior

coderabbitai

🧹 Nitpick comments (1)

ui-svelte/src/components/CaptureDialog.svelte (1)

262-270: Minor: Potential double newlines when only reasoning exists.

When renderedResponse.reasoning exists but renderedResponse.content is empty, the copy text will end with "\n\n" followed by an empty string. This is cosmetic but may leave trailing whitespace.

💡 Optional fix

   function getCopyText(): string {
     if (respBodyTab === "pretty" && (renderedResponse.reasoning || renderedResponse.content)) {
       let text = "";
-      if (renderedResponse.reasoning) text += renderedResponse.reasoning + "\n\n";
-      text += renderedResponse.content;
+      if (renderedResponse.reasoning) {
+        text += renderedResponse.reasoning;
+        if (renderedResponse.content) text += "\n\n";
+      }
+      if (renderedResponse.content) text += renderedResponse.content;
       return text;
     }
     return displayedResponseBody;
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@ui-svelte/src/components/CaptureDialog.svelte` around lines 262 - 270, The
getCopyText function may append "\n\n" even when renderedResponse.content is
empty, causing trailing whitespace; update getCopyText to only add the separator
between reasoning and content when both are present by checking
renderedResponse.content before concatenating the "\n\n", ensuring respBodyTab
=== "pretty" still returns the combined reasoning+content when both exist and
otherwise returns the single non-empty piece or displayedResponseBody.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ui-svelte/src/components/CaptureDialog.svelte`:
- Around line 262-270: The getCopyText function may append "\n\n" even when
renderedResponse.content is empty, causing trailing whitespace; update
getCopyText to only add the separator between reasoning and content when both
are present by checking renderedResponse.content before concatenating the
"\n\n", ensuring respBodyTab === "pretty" still returns the combined
reasoning+content when both exist and otherwise returns the single non-empty
piece or displayedResponseBody.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 06e26352-638a-41ae-88b0-0aeec38925fc

📥 Commits

Reviewing files that changed from the base of the PR and between 8c54f41 and bbbd718.

📒 Files selected for processing (3)

proxy/metrics_monitor.go
proxy/metrics_monitor_test.go
ui-svelte/src/components/CaptureDialog.svelte

🚧 Files skipped from review as they are similar to previous changes (1)

proxy/metrics_monitor_test.go

Restore the existing capture dialog behavior and fix aggregate throughput calculations for generated token metrics. - restore Chat and Pretty tab semantics in the capture dialog - keep JSON Pretty views formatted instead of rendered - prefer backend tokens_per_second in the stats panel

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

ui-svelte/src/components/StatsPanel.svelte (1)

57-66: ⚠️ Potential issue | 🟡 Minor

Edge case: histogram calculation fails when all values are identical.

When min === max (all tokens-per-second values are the same), binSize becomes 0, causing (value - min) / binSize to produce NaN or Infinity. This results in incorrect bin indexing.

🛡️ Suggested fix

     const min = Math.min(...tokensPerSecond);
     const max = Math.max(...tokensPerSecond);
+    
+    // Handle edge case where all values are identical
+    if (min === max) {
+      const histogramData: HistogramData = {
+        bins: [tokensPerSecond.length],
+        min,
+        max,
+        binSize: 1,
+        p99,
+        p95,
+        p50,
+      };
+      return {
+        totalRequests,
+        totalInputTokens,
+        totalOutputTokens,
+        inFlightRequests: $inFlightRequests,
+        tokenStats: {
+          p99: p99.toFixed(2),
+          p95: p95.toFixed(2),
+          p50: p50.toFixed(2),
+        },
+        histogramData,
+      };
+    }
+    
     const binCount = Math.min(30, Math.max(10, Math.floor(tokensPerSecond.length / 5)));
     const binSize = (max - min) / binCount;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@ui-svelte/src/components/StatsPanel.svelte` around lines 57 - 66, The
histogram code in StatsPanel.svelte can divide by zero when min === max; modify
the logic around min, max, binCount, and binSize so that when max === min you
avoid binSize = 0: either set binCount = 1 (and populate bins[0] =
tokensPerSecond.length) or set binSize = 1 and explicitly assign every value to
a valid bin index (e.g., 0 or binCount-1); update the block that computes
binSize and the tokensPerSecond.forEach bin assignment to handle this branch and
ensure no NaN/Infinity binIndex is produced.

🧹 Nitpick comments (2)

ui-svelte/src/components/CaptureDialog.svelte (2)
34-38: Potential reactivity dependency issue: renderedResponse used in $effect before being declared.

In Svelte 5, $effect callbacks capture reactive dependencies. Here, renderedResponse (derived at line 283) is used inside the effect at line 34, but renderedResponse depends on responseBodyRaw which depends on capture. Since capture is already a dependency of this effect block, the reactive graph should work correctly. However, accessing a derived value that transitively depends on the same trigger (capture) within the same effect can lead to subtle timing issues if the derivations haven't re-evaluated yet.

Consider whether the intent is to react to capture changes or to renderedResponse changes. If the latter, this could be clearer:
$effect(() => {
  if (capture && renderedResponse) {
    // explicit dependency
  }
});
This appears to work correctly in practice, but worth verifying the tab selection behaves as expected when capture changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui-svelte/src/components/CaptureDialog.svelte` around lines 34 - 38, The
$effect uses renderedResponse before it's declared which can cause timing
issues; change the effect so it explicitly depends on renderedResponse (not just
capture) by referencing renderedResponse inside the effect (e.g., guard with if
(capture && renderedResponse) ...) and compute respBodyTab using
renderedResponse, isSSE and respCt so the reactive graph clearly tracks
renderedResponse changes; update the effect containing respBodyTab to read
renderedResponse directly to ensure proper re-evaluation when
responseBodyRaw/capture change.
283-287: Consider supporting non-streaming chat response rendering.

Currently, renderedResponse is only populated for SSE responses. Non-streaming chat completion responses (which return JSON directly) won't have the Chat tab available even though they contain the same logical content structure.

If this is intentional (SSE captures only), the current implementation is correct. If you'd like to support the Chat view for non-streaming responses as well:
♻️ Optional enhancement
 let renderedResponse = $derived.by(() => {
   if (!responseBodyRaw) return { reasoning: "", content: "" } as RenderedResponse;
   if (isSSE) return parseSSEChat(responseBodyRaw);
+  if (isResponseJson) {
+    try {
+      const parsed = JSON.parse(responseBodyRaw);
+      return appendResponseOutput({ reasoning: "", content: "" }, parsed);
+    } catch {
+      return { reasoning: "", content: "" } as RenderedResponse;
+    }
+  }
   return { reasoning: "", content: "" } as RenderedResponse;
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui-svelte/src/components/CaptureDialog.svelte` around lines 283 - 287,
renderedResponse only handles SSE via parseSSEChat so non-streaming JSON chat
completions (responseBodyRaw when isSSE is false) are ignored; update the
$derived.by callback for renderedResponse to detect non-streaming chat JSON and
parse it into the same RenderedResponse shape (either by calling a new helper
like parseNonStreamingChat or by JSON.parse + mapping to { reasoning, content })
while keeping existing parseSSEChat for isSSE paths; reference
variables/functions: renderedResponse, responseBodyRaw, isSSE, parseSSEChat,
RenderedResponse, $derived.by.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@ui-svelte/src/components/StatsPanel.svelte`:
- Around line 57-66: The histogram code in StatsPanel.svelte can divide by zero
when min === max; modify the logic around min, max, binCount, and binSize so
that when max === min you avoid binSize = 0: either set binCount = 1 (and
populate bins[0] = tokensPerSecond.length) or set binSize = 1 and explicitly
assign every value to a valid bin index (e.g., 0 or binCount-1); update the
block that computes binSize and the tokensPerSecond.forEach bin assignment to
handle this branch and ensure no NaN/Infinity binIndex is produced.

---

Nitpick comments:
In `@ui-svelte/src/components/CaptureDialog.svelte`:
- Around line 34-38: The $effect uses renderedResponse before it's declared
which can cause timing issues; change the effect so it explicitly depends on
renderedResponse (not just capture) by referencing renderedResponse inside the
effect (e.g., guard with if (capture && renderedResponse) ...) and compute
respBodyTab using renderedResponse, isSSE and respCt so the reactive graph
clearly tracks renderedResponse changes; update the effect containing
respBodyTab to read renderedResponse directly to ensure proper re-evaluation
when responseBodyRaw/capture change.
- Around line 283-287: renderedResponse only handles SSE via parseSSEChat so
non-streaming JSON chat completions (responseBodyRaw when isSSE is false) are
ignored; update the $derived.by callback for renderedResponse to detect
non-streaming chat JSON and parse it into the same RenderedResponse shape
(either by calling a new helper like parseNonStreamingChat or by JSON.parse +
mapping to { reasoning, content }) while keeping existing parseSSEChat for isSSE
paths; reference variables/functions: renderedResponse, responseBodyRaw, isSSE,
parseSSEChat, RenderedResponse, $derived.by.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e14aef3c-b50b-48de-80cc-ec36af6bd864

📥 Commits

Reviewing files that changed from the base of the PR and between bbbd718 and b829ca7.

📒 Files selected for processing (2)

ui-svelte/src/components/CaptureDialog.svelte
ui-svelte/src/components/StatsPanel.svelte

mostlygeek · 2026-03-10T04:57:52Z

Is your upstream a local server or a peer inference endpoint?

llama-server does not return any timing information for v1/responses so only usage information that's what llama-swap shows:

The captures seem to be alright:

coder543 · 2026-03-10T09:23:54Z

Mine is a local llama-server, but I have never seen llama-swap show any stats or captures for requests sent via the Responses API until the changes above were implemented.

Are you using streaming Responses API queries? Or non-streaming? Codex only supports streaming, and that’s where I’ve only seen zeroes and missing captures. I have not tried non-streaming Responses API queries.

properly parse anthropic compatible usage data from streaming responses. closes: #577

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

ui-svelte/src/components/CaptureDialog.svelte Outdated Show resolved Hide resolved

coder543 force-pushed the main branch from cbfbbe7 to 8c54f41 Compare March 9, 2026 23:01

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

proxy/metrics_monitor.go Outdated Show resolved Hide resolved

coder543 changed the title ~~improve Responses API captures and metrics~~ Improve Responses API captures and metrics Mar 9, 2026

coder543 force-pushed the main branch from 8c54f41 to bbbd718 Compare March 9, 2026 23:11

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

mostlygeek added a commit that referenced this pull request Mar 13, 2026

proxy: fix metrics capture for v1/responses

3f7be88

properly parse anthropic compatible usage data from streaming responses. closes: #577

mostlygeek mentioned this pull request Mar 13, 2026

proxy: fix metrics capture for v1/responses #586

Merged

mostlygeek closed this in c3c258a Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Responses API captures and metrics#577

Improve Responses API captures and metrics#577
coder543 wants to merge 2 commits intomostlygeek:mainfrom
coder543:main

coder543 commented Mar 9, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 9, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

mostlygeek commented Mar 10, 2026

Uh oh!

coder543 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coder543 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

mostlygeek commented Mar 10, 2026

Uh oh!

coder543 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coder543 commented Mar 9, 2026 •

edited

Loading

coderabbitai bot commented Mar 9, 2026 •

edited

Loading