Skip to content

proxy: fix metrics for non-llama.cpp backends (vLLM) and correct wall-clock timing#701

Closed
efschu wants to merge 2 commits into
mostlygeek:mainfrom
efschu:fix-vllm-metrics
Closed

proxy: fix metrics for non-llama.cpp backends (vLLM) and correct wall-clock timing#701
efschu wants to merge 2 commits into
mostlygeek:mainfrom
efschu:fix-vllm-metrics

Conversation

@efschu
Copy link
Copy Markdown

@efschu efschu commented Apr 24, 2026

Problem

  1. recorder.StartTime() is only set on the first Write() call, which happens after response headers are sent. For non-streaming JSON responses this means wallDurationMs is ~0, resulting in meaningless speed metrics.
  2. vLLM (and other non-llama.cpp backends) do not return a timings field in their response, so prompt_per_second and tokens_per_second stay at -1.0.

Fix

  1. Capture requestStart := time.Now() before calling next() to get accurate wall-clock duration for all request types.
  2. Add fallback in parseMetrics: when timings is absent, estimate speeds from inputTokens/durationSec and outputTokens/durationSec.

Testing

  • vLLM 35B TurboQuant: before fix prompt_per_second: -1, after fix prompt_per_second: 53.6 (warm cache, 224ms duration)
  • llama.cpp metrics unchanged (still uses precise timings field when available)

…timing

- Capture requestStart before next() call instead of recorder.StartTime()
  which is only set on first Write(), causing ~0ms duration for non-streaming
- Add fallback in parseMetrics: when timings field is absent (vLLM, etc.),
  estimate prompt_per_second and tokens_per_second from wall clock duration
- Fixes -1 values shown in Activity page for vLLM models
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Walkthrough

Metrics monitor now accepts a config.Config, maps model IDs to display names via configured aliases, captures a wall-clock requestStart before proxying and uses it for token duration and parsing, and falls back to estimating per-second rates from wall-clock duration when timings are absent.

Changes

Cohort / File(s) Summary
Metrics monitor core
proxy/metrics_monitor.go
Constructor now takes config.Config; metricsMonitor maps model IDs to display names via first alias. wrapHandler snapshots wall-clock requestStart and forwards it to streaming and non-streaming parsing. parseMetrics uses the wall-clock duration and falls back to estimating prompt_per_second and tokens_per_second from usage token counts when timings are missing.
Tests updated
proxy/metrics_monitor_test.go
All newMetricsMonitor test/benchmark call sites updated to pass a config.Config{} argument; no test logic changed beyond constructor usage.
Proxy initialization
proxy/proxymanager.go
ProxyManager.New now constructs metricsMonitor by passing the full proxyConfig into newMetricsMonitor (changed constructor parameterization).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • mostlygeek
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main changes: fixing metrics for non-llama.cpp backends (vLLM) and correcting wall-clock timing issues.
Description check ✅ Passed The description provides clear context about the problem (recorder.StartTime() timing issues and missing timings field in vLLM responses), the fix (capturing requestStart before proxying and adding fallback calculations), and testing evidence.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@proxy/metrics_monitor.go`:
- Around line 487-496: The fallback calculation for promptPerSecond and
tokensPerSecond (when timings.Exists() is false) uses wallDurationMs for both
rates, conflating prompt evaluation time and decode time; update the code around
timings, wallDurationMs, promptPerSecond, tokensPerSecond, inputTokens and
outputTokens to make this explicit: either expand the comment to state these are
approximate rates computed over total wall time (including both prompt and
decode phases) or add a boolean/flag (e.g., estimatedRates or
ratesAreApproximate) or separate field to indicate “estimated vs measured” so
downstream consumers know these values are biased and not directly comparable to
phase-specific metrics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 99d4370b-6c08-4ce2-ba2d-149811e994c7

📥 Commits

Reviewing files that changed from the base of the PR and between 3cd7837 and e29a602.

📒 Files selected for processing (1)
  • proxy/metrics_monitor.go

Comment thread proxy/metrics_monitor.go
Comment on lines +487 to +496
// Fallback: estimate speeds from wall clock when timings unavailable (e.g., vLLM)
if !timings.Exists() && wallDurationMs > 0 {
durationSec := float64(wallDurationMs) / 1000.0
if inputTokens > 0 {
promptPerSecond = float64(inputTokens) / durationSec
}
if outputTokens > 0 {
tokensPerSecond = float64(outputTokens) / durationSec
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fallback speeds conflate prompt eval with token generation.

When timings is absent, both rates are divided by the same wall-clock duration that encompasses prompt processing and decode. This produces:

  • promptPerSecond that is far lower than the true prompt-eval rate (prompt eval usually occupies a small fraction of the total request).
  • tokensPerSecond that is biased low by the prompt time (i.e. effectively outputTokens / (prompt_ms + decode_ms) rather than outputTokens / decode_ms).

Given the backend doesn't expose per-phase timings, this is an acceptable approximation — but worth making the caveat explicit so downstream dashboards don't treat these as directly comparable to llama.cpp's prompt_per_second / predicted_per_second. Consider expanding the comment (e.g., "approximate rates over total wall time; includes both prompt and decode phases") or emitting a separate flag/field to distinguish estimated vs. measured rates.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/metrics_monitor.go` around lines 487 - 496, The fallback calculation
for promptPerSecond and tokensPerSecond (when timings.Exists() is false) uses
wallDurationMs for both rates, conflating prompt evaluation time and decode
time; update the code around timings, wallDurationMs, promptPerSecond,
tokensPerSecond, inputTokens and outputTokens to make this explicit: either
expand the comment to state these are approximate rates computed over total wall
time (including both prompt and decode phases) or add a boolean/flag (e.g.,
estimatedRates or ratesAreApproximate) or separate field to indicate “estimated
vs measured” so downstream consumers know these values are biased and not
directly comparable to phase-specific metrics.

The activity table now displays the model alias (e.g., qwen3.6-35b-a3b-uncensored) instead of the internal model ID (e.g., llama-cpp-5090) when aliases are configured.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
proxy/metrics_monitor.go (1)

136-139: Consider guarding against empty/whitespace alias strings.

modelConfig.Aliases[0] is used as the display name unconditionally. If a user configures an empty or whitespace-only alias entry in YAML, the metric's Model field will silently become empty, which is worse than falling back to the model ID. Other parts of the codebase (e.g., listModelsHandler at proxy/proxymanager.go lines 573–577) already guard with strings.TrimSpace before using aliases.

🔧 Suggested fix
-	// Resolve modelID to display name (first alias or modelID itself)
-	if modelConfig, exists := mp.config.Models[metric.Model]; exists && len(modelConfig.Aliases) > 0 {
-		metric.Model = modelConfig.Aliases[0]
-	}
+	// Resolve modelID to display name (first non-empty alias, falling back to the model ID)
+	if modelConfig, exists := mp.config.Models[metric.Model]; exists {
+		for _, alias := range modelConfig.Aliases {
+			if a := strings.TrimSpace(alias); a != "" {
+				metric.Model = a
+				break
+			}
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/metrics_monitor.go` around lines 136 - 139, The current resolution sets
metric.Model to modelConfig.Aliases[0] unconditionally which can assign an
empty/whitespace string; change the logic in the block that reads
mp.config.Models and modelConfig.Aliases so you pick the first alias whose
strings.TrimSpace(alias) != "" (or if none found, leave metric.Model as the
original metric.Model/modelID) — i.e., iterate modelConfig.Aliases, use the
first non-empty trimmed alias as the display name, otherwise fall back to the
existing metric.Model; reference mp.config.Models, metric.Model and
modelConfig.Aliases when locating the code to update.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@proxy/metrics_monitor.go`:
- Around line 16-19: The file fails gofmt due to import grouping/spacing and
misaligned struct/map fields: reorder the imports in metrics_monitor.go to
standard grouping (stdlib, external, then internal) and ensure the inserted
config import (github.com/mostlygeek/llama-swap/proxy/config) sits with other
external/internal imports consistent with the file's style; fix the tab
alignment of the config field in the metricsMonitor struct (symbol:
metricsMonitor) and the corresponding config: key in the NewMetricsMonitor (or
equivalent initializer) so field alignment matches mu, metrics, logger, etc.;
then run gofmt -w proxy/metrics_monitor.go and verify with gofmt -l . before
committing.

---

Nitpick comments:
In `@proxy/metrics_monitor.go`:
- Around line 136-139: The current resolution sets metric.Model to
modelConfig.Aliases[0] unconditionally which can assign an empty/whitespace
string; change the logic in the block that reads mp.config.Models and
modelConfig.Aliases so you pick the first alias whose strings.TrimSpace(alias)
!= "" (or if none found, leave metric.Model as the original
metric.Model/modelID) — i.e., iterate modelConfig.Aliases, use the first
non-empty trimmed alias as the display name, otherwise fall back to the existing
metric.Model; reference mp.config.Models, metric.Model and modelConfig.Aliases
when locating the code to update.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 807032b6-fbc1-4a85-bdcd-d419a19ed7a1

📥 Commits

Reviewing files that changed from the base of the PR and between e29a602 and 3f58ca3.

📒 Files selected for processing (3)
  • proxy/metrics_monitor.go
  • proxy/metrics_monitor_test.go
  • proxy/proxymanager.go
✅ Files skipped from review due to trivial changes (1)
  • proxy/metrics_monitor_test.go

Comment thread proxy/metrics_monitor.go
Comment on lines +16 to 19
"github.com/mostlygeek/llama-swap/proxy/config"
"github.com/klauspost/compress/zstd"
"github.com/mostlygeek/llama-swap/event"
"github.com/tidwall/gjson"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix gofmt formatting failures.

CI's gofmt -l . step failed on this file. Two likely culprits in the changed lines:

  1. Import block ordering/spacing at lines 16–19 mixes external modules (config inserted between gin-gonic/gin and klauspost/compress/zstd) in a way that appears to disagree with the existing formatting of this file.
  2. The config field added to metricsMonitor at line 99 and the config: key at line 118 don't match the tab alignment used for the surrounding fields (mu, metrics, logger, etc.), which gofmt will re-flow.

Please run gofmt -w proxy/metrics_monitor.go (and double-check with gofmt -l .) before pushing.

As per coding guidelines: "Run gofmt -l . before committing to verify formatting. Fix any reported files with gofmt -w <file>."

🔧 Verification
#!/bin/bash
gofmt -l proxy/metrics_monitor.go proxy/proxymanager.go
# Show the diff gofmt would apply:
gofmt -d proxy/metrics_monitor.go
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/metrics_monitor.go` around lines 16 - 19, The file fails gofmt due to
import grouping/spacing and misaligned struct/map fields: reorder the imports in
metrics_monitor.go to standard grouping (stdlib, external, then internal) and
ensure the inserted config import
(github.com/mostlygeek/llama-swap/proxy/config) sits with other
external/internal imports consistent with the file's style; fix the tab
alignment of the config field in the metricsMonitor struct (symbol:
metricsMonitor) and the corresponding config: key in the NewMetricsMonitor (or
equivalent initializer) so field alignment matches mu, metrics, logger, etc.;
then run gofmt -w proxy/metrics_monitor.go and verify with gofmt -l . before
committing.

@efschu
Copy link
Copy Markdown
Author

efschu commented Apr 24, 2026

Closing in favor of #702 which supersedes this PR with a unified description covering all fixes.

@efschu efschu closed this Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant