Skip to content

Improve Responses API captures and metrics#577

Closed
coder543 wants to merge 2 commits intomostlygeek:mainfrom
coder543:main
Closed

Improve Responses API captures and metrics#577
coder543 wants to merge 2 commits intomostlygeek:mainfrom
coder543:main

Conversation

@coder543
Copy link

@coder543 coder543 commented Mar 9, 2026

Right now, requests through the Responses API have no metrics at all, and the responses are not captured. Using codex through llama-swap is a little annoying because of this, since codex requires the Responses API.

So, a little PR to parse those stats out.

  • parse nested Responses usage payloads in streaming and non-streaming metrics
  • render Responses SSE captures with response and pretty JSON views
  • estimate only generation speed for usage-only streaming responses
  • leave prompt speed unknown without upstream timing/cache data (otherwise cached tokens make it look like your prompt processing speed is insanely fast)
  • add Responses-specific coverage for metrics fallback behavior

Hopefully this PR is useful.

Written by codex using gpt-5.4. Tested by hand to make sure it works. Code reviewed to make sure it seems reasonable.

@coderabbitai
Copy link

coderabbitai bot commented Mar 9, 2026

Walkthrough

Reworks metrics collection to use a timing-aware internal type (responseTimingInfo), updates streaming parsing to discover usage/timings across varied payload shapes, adjusts response body timing capture, expands tests for streaming/non-streaming and edge cases, and consolidates SSE/chat rendering into a RenderedResponse pipeline in the UI. (50 words)

Changes

Cohort / File(s) Summary
Metrics Timing & Parsing
proxy/metrics_monitor.go
Introduce responseTimingInfo (requestStart/firstWrite/lastWrite) and methods totalDuration()/generationDuration(); responseBodyCopier records/exposes timings via Timings() and StartTime now returns firstWrite; update processStreamingResponse and parseMetrics to accept timingInfo; add findMetricsPayload() and allowFallback handling; replace StartTime-based logic with timing-aware parsing.
Metrics Tests
proxy/metrics_monitor_test.go
Add extensive tests covering JSON and OpenAI-style SSE payloads, streaming and non-streaming responses, timing-present and timing-absent cases, fallback behavior for tokens/sec, chunked/non-chunked edge cases, single-write fallbacks, non-OK/invalid responses, and various encodings (gzip/deflate).
UI: SSE / Rendered Response
ui-svelte/src/components/CaptureDialog.svelte
Replace prior SSE chat state with RenderedResponse accumulator and appendResponseOutput; update parsing/path to accumulate reasoning/content from SSE and JSON shapes; switch UI, copy logic, and tab visibility to use renderedResponse fields.
UI: Stats Calculation
ui-svelte/src/components/StatsPanel.svelte
Prefer backend-provided tokens_per_second when > 0; otherwise fall back to computed output_tokens/duration_ms to derive tokens/sec used in percentiles/histograms.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • mostlygeek
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main objective of the PR: improving support for Responses API by adding metrics capture and parsing functionality.
Description check ✅ Passed The description is directly related to the changeset, detailing specific improvements for Responses API metrics parsing, SSE rendering, speed estimation logic, and test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ui-svelte/src/components/CaptureDialog.svelte`:
- Around line 500-529: The "(empty)" fallback inside the respBody "pretty"
branch is unreachable because the outer condition checks respBodyTab ===
"pretty" && (renderedResponse.reasoning || renderedResponse.content); remove the
unreachable check and its <pre>(empty)</pre> fallback so the UI only renders
Reasoning and/or Response when renderedResponse.reasoning or
renderedResponse.content exist; locate the block guarded by respBodyTab ===
"pretty" and renderedResponse checks in CaptureDialog.svelte (use respBodyTab
and renderedResponse identifiers to find it) and delete the unreachable fallback
lines.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7fb713c5-90d2-467c-acfb-c3467215c3ff

📥 Commits

Reviewing files that changed from the base of the PR and between cc77139 and cbfbbe7.

📒 Files selected for processing (3)
  • proxy/metrics_monitor.go
  • proxy/metrics_monitor_test.go
  • ui-svelte/src/components/CaptureDialog.svelte

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
ui-svelte/src/components/CaptureDialog.svelte (1)

471-484: Minor: Redundant conditions could be simplified.

Lines 471 and 478 use equivalent conditions (isResponseJson || isSSE vs isSSE || isResponseJson). Consider extracting to a derived variable for clarity, though this is a minor readability nit.

♻️ Optional refactor
+ let showPrettyRawTabs = $derived(isResponseJson || isSSE);

  // Then in template:
- {`#if` isResponseJson || isSSE}
+ {`#if` showPrettyRawTabs}
    <button ...>Pretty</button>
  {/if}
- {`#if` isSSE || isResponseJson}
+ {`#if` showPrettyRawTabs}
    <button ...>Raw</button>
  {/if}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui-svelte/src/components/CaptureDialog.svelte` around lines 471 - 484,
Extract the repeated boolean expression (isResponseJson || isSSE) into a single
derived variable (e.g., showRespTabs) and use that variable in both conditional
blocks that render the "Pretty" and "Raw" buttons (where respBodyTab is
toggled); update the conditions around the two buttons (currently using
isResponseJson || isSSE and isSSE || isResponseJson) to use showRespTabs to
remove redundancy and improve readability while keeping respBodyTab,
isResponseJson and isSSE logic unchanged.
proxy/metrics_monitor.go (1)

505-521: StartTime() is semantically misleading now.

After this change it returns firstWrite, while the actual request start lives in Timings().requestStart. Renaming or removing StartTime() would make future duration code much harder to misuse.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/metrics_monitor.go` around lines 505 - 521, StartTime() currently
returns firstWrite which is misleading; change the API so StartTime() returns
the actual request start (responseBodyCopier.requestStart) and add a new
FirstWriteTime() (or FirstWrite()) that returns firstWrite, or alternatively
rename StartTime() to FirstWriteTime() and update callers—ensure any code using
StartTime() for request start now uses Timings().requestStart or the new
StartTime() returning requestStart; update references to
responseBodyCopier.StartTime(), responseBodyCopier.Timings(), and any callers
accordingly to preserve correct semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@proxy/metrics_monitor.go`:
- Around line 428-431: The fallback TokensPerSecond estimation is being applied
for both streaming and non-streaming responses; restrict it so it only runs for
streaming usage-only streams by adding a boolean parameter (e.g., allowFallback
or isStreaming) to the function that computes TokensPerSecond where
timingInfo.generationDuration() is used, call it with true from
processStreamingResponse and false from the non-streaming response path, and
update the branch that sets tokensPerSecond (the block using generationDuration
and outputTokens) to only execute when that boolean is true; also add a
regression test that simulates a chunked non-streaming JSON response to ensure
no synthetic generation speed is recorded for non-streaming responses.

---

Nitpick comments:
In `@proxy/metrics_monitor.go`:
- Around line 505-521: StartTime() currently returns firstWrite which is
misleading; change the API so StartTime() returns the actual request start
(responseBodyCopier.requestStart) and add a new FirstWriteTime() (or
FirstWrite()) that returns firstWrite, or alternatively rename StartTime() to
FirstWriteTime() and update callers—ensure any code using StartTime() for
request start now uses Timings().requestStart or the new StartTime() returning
requestStart; update references to responseBodyCopier.StartTime(),
responseBodyCopier.Timings(), and any callers accordingly to preserve correct
semantics.

In `@ui-svelte/src/components/CaptureDialog.svelte`:
- Around line 471-484: Extract the repeated boolean expression (isResponseJson
|| isSSE) into a single derived variable (e.g., showRespTabs) and use that
variable in both conditional blocks that render the "Pretty" and "Raw" buttons
(where respBodyTab is toggled); update the conditions around the two buttons
(currently using isResponseJson || isSSE and isSSE || isResponseJson) to use
showRespTabs to remove redundancy and improve readability while keeping
respBodyTab, isResponseJson and isSSE logic unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0a39fc5b-141c-4202-ae49-94fe6adca750

📥 Commits

Reviewing files that changed from the base of the PR and between cbfbbe7 and 8c54f41.

📒 Files selected for processing (3)
  • proxy/metrics_monitor.go
  • proxy/metrics_monitor_test.go
  • ui-svelte/src/components/CaptureDialog.svelte

@coder543 coder543 changed the title improve Responses API captures and metrics Improve Responses API captures and metrics Mar 9, 2026
Handle llama-server Responses API captures and metrics more gracefully.

- parse nested Responses usage payloads in streaming and non-streaming metrics
- render Responses SSE captures with response and pretty JSON views
- estimate only generation speed for usage-only streaming responses
- leave prompt speed unknown without upstream timing/cache data
- add Responses-specific coverage for metrics fallback behavior
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ui-svelte/src/components/CaptureDialog.svelte (1)

262-270: Minor: Potential double newlines when only reasoning exists.

When renderedResponse.reasoning exists but renderedResponse.content is empty, the copy text will end with "\n\n" followed by an empty string. This is cosmetic but may leave trailing whitespace.

💡 Optional fix
   function getCopyText(): string {
     if (respBodyTab === "pretty" && (renderedResponse.reasoning || renderedResponse.content)) {
       let text = "";
-      if (renderedResponse.reasoning) text += renderedResponse.reasoning + "\n\n";
-      text += renderedResponse.content;
+      if (renderedResponse.reasoning) {
+        text += renderedResponse.reasoning;
+        if (renderedResponse.content) text += "\n\n";
+      }
+      if (renderedResponse.content) text += renderedResponse.content;
       return text;
     }
     return displayedResponseBody;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui-svelte/src/components/CaptureDialog.svelte` around lines 262 - 270, The
getCopyText function may append "\n\n" even when renderedResponse.content is
empty, causing trailing whitespace; update getCopyText to only add the separator
between reasoning and content when both are present by checking
renderedResponse.content before concatenating the "\n\n", ensuring respBodyTab
=== "pretty" still returns the combined reasoning+content when both exist and
otherwise returns the single non-empty piece or displayedResponseBody.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@ui-svelte/src/components/CaptureDialog.svelte`:
- Around line 262-270: The getCopyText function may append "\n\n" even when
renderedResponse.content is empty, causing trailing whitespace; update
getCopyText to only add the separator between reasoning and content when both
are present by checking renderedResponse.content before concatenating the
"\n\n", ensuring respBodyTab === "pretty" still returns the combined
reasoning+content when both exist and otherwise returns the single non-empty
piece or displayedResponseBody.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 06e26352-638a-41ae-88b0-0aeec38925fc

📥 Commits

Reviewing files that changed from the base of the PR and between 8c54f41 and bbbd718.

📒 Files selected for processing (3)
  • proxy/metrics_monitor.go
  • proxy/metrics_monitor_test.go
  • ui-svelte/src/components/CaptureDialog.svelte
🚧 Files skipped from review as they are similar to previous changes (1)
  • proxy/metrics_monitor_test.go

Restore the existing capture dialog behavior and fix aggregate throughput calculations for generated token metrics.

- restore Chat and Pretty tab semantics in the capture dialog
- keep JSON Pretty views formatted instead of rendered
- prefer backend tokens_per_second in the stats panel
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
ui-svelte/src/components/StatsPanel.svelte (1)

57-66: ⚠️ Potential issue | 🟡 Minor

Edge case: histogram calculation fails when all values are identical.

When min === max (all tokens-per-second values are the same), binSize becomes 0, causing (value - min) / binSize to produce NaN or Infinity. This results in incorrect bin indexing.

🛡️ Suggested fix
     const min = Math.min(...tokensPerSecond);
     const max = Math.max(...tokensPerSecond);
+    
+    // Handle edge case where all values are identical
+    if (min === max) {
+      const histogramData: HistogramData = {
+        bins: [tokensPerSecond.length],
+        min,
+        max,
+        binSize: 1,
+        p99,
+        p95,
+        p50,
+      };
+      return {
+        totalRequests,
+        totalInputTokens,
+        totalOutputTokens,
+        inFlightRequests: $inFlightRequests,
+        tokenStats: {
+          p99: p99.toFixed(2),
+          p95: p95.toFixed(2),
+          p50: p50.toFixed(2),
+        },
+        histogramData,
+      };
+    }
+    
     const binCount = Math.min(30, Math.max(10, Math.floor(tokensPerSecond.length / 5)));
     const binSize = (max - min) / binCount;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui-svelte/src/components/StatsPanel.svelte` around lines 57 - 66, The
histogram code in StatsPanel.svelte can divide by zero when min === max; modify
the logic around min, max, binCount, and binSize so that when max === min you
avoid binSize = 0: either set binCount = 1 (and populate bins[0] =
tokensPerSecond.length) or set binSize = 1 and explicitly assign every value to
a valid bin index (e.g., 0 or binCount-1); update the block that computes
binSize and the tokensPerSecond.forEach bin assignment to handle this branch and
ensure no NaN/Infinity binIndex is produced.
🧹 Nitpick comments (2)
ui-svelte/src/components/CaptureDialog.svelte (2)

34-38: Potential reactivity dependency issue: renderedResponse used in $effect before being declared.

In Svelte 5, $effect callbacks capture reactive dependencies. Here, renderedResponse (derived at line 283) is used inside the effect at line 34, but renderedResponse depends on responseBodyRaw which depends on capture. Since capture is already a dependency of this effect block, the reactive graph should work correctly. However, accessing a derived value that transitively depends on the same trigger (capture) within the same effect can lead to subtle timing issues if the derivations haven't re-evaluated yet.

Consider whether the intent is to react to capture changes or to renderedResponse changes. If the latter, this could be clearer:

$effect(() => {
  if (capture && renderedResponse) {
    // explicit dependency
  }
});

This appears to work correctly in practice, but worth verifying the tab selection behaves as expected when capture changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui-svelte/src/components/CaptureDialog.svelte` around lines 34 - 38, The
$effect uses renderedResponse before it's declared which can cause timing
issues; change the effect so it explicitly depends on renderedResponse (not just
capture) by referencing renderedResponse inside the effect (e.g., guard with if
(capture && renderedResponse) ...) and compute respBodyTab using
renderedResponse, isSSE and respCt so the reactive graph clearly tracks
renderedResponse changes; update the effect containing respBodyTab to read
renderedResponse directly to ensure proper re-evaluation when
responseBodyRaw/capture change.

283-287: Consider supporting non-streaming chat response rendering.

Currently, renderedResponse is only populated for SSE responses. Non-streaming chat completion responses (which return JSON directly) won't have the Chat tab available even though they contain the same logical content structure.

If this is intentional (SSE captures only), the current implementation is correct. If you'd like to support the Chat view for non-streaming responses as well:

♻️ Optional enhancement
 let renderedResponse = $derived.by(() => {
   if (!responseBodyRaw) return { reasoning: "", content: "" } as RenderedResponse;
   if (isSSE) return parseSSEChat(responseBodyRaw);
+  if (isResponseJson) {
+    try {
+      const parsed = JSON.parse(responseBodyRaw);
+      return appendResponseOutput({ reasoning: "", content: "" }, parsed);
+    } catch {
+      return { reasoning: "", content: "" } as RenderedResponse;
+    }
+  }
   return { reasoning: "", content: "" } as RenderedResponse;
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ui-svelte/src/components/CaptureDialog.svelte` around lines 283 - 287,
renderedResponse only handles SSE via parseSSEChat so non-streaming JSON chat
completions (responseBodyRaw when isSSE is false) are ignored; update the
$derived.by callback for renderedResponse to detect non-streaming chat JSON and
parse it into the same RenderedResponse shape (either by calling a new helper
like parseNonStreamingChat or by JSON.parse + mapping to { reasoning, content })
while keeping existing parseSSEChat for isSSE paths; reference
variables/functions: renderedResponse, responseBodyRaw, isSSE, parseSSEChat,
RenderedResponse, $derived.by.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@ui-svelte/src/components/StatsPanel.svelte`:
- Around line 57-66: The histogram code in StatsPanel.svelte can divide by zero
when min === max; modify the logic around min, max, binCount, and binSize so
that when max === min you avoid binSize = 0: either set binCount = 1 (and
populate bins[0] = tokensPerSecond.length) or set binSize = 1 and explicitly
assign every value to a valid bin index (e.g., 0 or binCount-1); update the
block that computes binSize and the tokensPerSecond.forEach bin assignment to
handle this branch and ensure no NaN/Infinity binIndex is produced.

---

Nitpick comments:
In `@ui-svelte/src/components/CaptureDialog.svelte`:
- Around line 34-38: The $effect uses renderedResponse before it's declared
which can cause timing issues; change the effect so it explicitly depends on
renderedResponse (not just capture) by referencing renderedResponse inside the
effect (e.g., guard with if (capture && renderedResponse) ...) and compute
respBodyTab using renderedResponse, isSSE and respCt so the reactive graph
clearly tracks renderedResponse changes; update the effect containing
respBodyTab to read renderedResponse directly to ensure proper re-evaluation
when responseBodyRaw/capture change.
- Around line 283-287: renderedResponse only handles SSE via parseSSEChat so
non-streaming JSON chat completions (responseBodyRaw when isSSE is false) are
ignored; update the $derived.by callback for renderedResponse to detect
non-streaming chat JSON and parse it into the same RenderedResponse shape
(either by calling a new helper like parseNonStreamingChat or by JSON.parse +
mapping to { reasoning, content }) while keeping existing parseSSEChat for isSSE
paths; reference variables/functions: renderedResponse, responseBodyRaw, isSSE,
parseSSEChat, RenderedResponse, $derived.by.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e14aef3c-b50b-48de-80cc-ec36af6bd864

📥 Commits

Reviewing files that changed from the base of the PR and between bbbd718 and b829ca7.

📒 Files selected for processing (2)
  • ui-svelte/src/components/CaptureDialog.svelte
  • ui-svelte/src/components/StatsPanel.svelte

@mostlygeek
Copy link
Owner

Is your upstream a local server or a peer inference endpoint?

llama-server does not return any timing information for v1/responses so only usage information that's what llama-swap shows:
image

The captures seem to be alright:
image

@coder543
Copy link
Author

Mine is a local llama-server, but I have never seen llama-swap show any stats or captures for requests sent via the Responses API until the changes above were implemented.

Are you using streaming Responses API queries? Or non-streaming? Codex only supports streaming, and that’s where I’ve only seen zeroes and missing captures. I have not tried non-streaming Responses API queries.

mostlygeek added a commit that referenced this pull request Mar 13, 2026
properly parse anthropic compatible usage data from streaming
responses.

closes: #577
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants