Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 25 additions & 4 deletions proxy/metrics_monitor.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ import (
"time"

"github.com/gin-gonic/gin"
"github.com/mostlygeek/llama-swap/proxy/config"
"github.com/klauspost/compress/zstd"
"github.com/mostlygeek/llama-swap/event"
"github.com/tidwall/gjson"
Comment on lines +16 to 19
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix gofmt formatting failures.

CI's gofmt -l . step failed on this file. Two likely culprits in the changed lines:

  1. Import block ordering/spacing at lines 16–19 mixes external modules (config inserted between gin-gonic/gin and klauspost/compress/zstd) in a way that appears to disagree with the existing formatting of this file.
  2. The config field added to metricsMonitor at line 99 and the config: key at line 118 don't match the tab alignment used for the surrounding fields (mu, metrics, logger, etc.), which gofmt will re-flow.

Please run gofmt -w proxy/metrics_monitor.go (and double-check with gofmt -l .) before pushing.

As per coding guidelines: "Run gofmt -l . before committing to verify formatting. Fix any reported files with gofmt -w <file>."

🔧 Verification
#!/bin/bash
gofmt -l proxy/metrics_monitor.go proxy/proxymanager.go
# Show the diff gofmt would apply:
gofmt -d proxy/metrics_monitor.go
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/metrics_monitor.go` around lines 16 - 19, The file fails gofmt due to
import grouping/spacing and misaligned struct/map fields: reorder the imports in
metrics_monitor.go to standard grouping (stdlib, external, then internal) and
ensure the inserted config import
(github.com/mostlygeek/llama-swap/proxy/config) sits with other
external/internal imports consistent with the file's style; fix the tab
alignment of the config field in the metricsMonitor struct (symbol:
metricsMonitor) and the corresponding config: key in the NewMetricsMonitor (or
equivalent initializer) so field alignment matches mu, metrics, logger, etc.;
then run gofmt -w proxy/metrics_monitor.go and verify with gofmt -l . before
committing.

Expand Down Expand Up @@ -95,6 +96,7 @@ func (e TokenMetricsEvent) Type() uint32 {

// metricsMonitor parses llama-server output for token statistics
type metricsMonitor struct {
config config.Config
mu sync.RWMutex
metrics []TokenMetrics
maxMetrics int
Expand All @@ -111,8 +113,9 @@ type metricsMonitor struct {

// newMetricsMonitor creates a new metricsMonitor. captureBufferMB is the
// capture buffer size in megabytes; 0 disables captures.
func newMetricsMonitor(logger *LogMonitor, maxMetrics int, captureBufferMB int) *metricsMonitor {
func newMetricsMonitor(cfg config.Config, logger *LogMonitor, maxMetrics int, captureBufferMB int) *metricsMonitor {
return &metricsMonitor{
config: cfg,
logger: logger,
maxMetrics: maxMetrics,
enableCaptures: captureBufferMB > 0,
Expand All @@ -130,6 +133,10 @@ func (mp *metricsMonitor) addMetrics(metric TokenMetrics) int {
defer mp.mu.Unlock()

metric.ID = mp.nextID
// Resolve modelID to display name (first alias or modelID itself)
if modelConfig, exists := mp.config.Models[metric.Model]; exists && len(modelConfig.Aliases) > 0 {
metric.Model = modelConfig.Aliases[0]
}
mp.nextID++
mp.metrics = append(mp.metrics, metric)
if len(mp.metrics) > mp.maxMetrics {
Expand Down Expand Up @@ -271,6 +278,9 @@ func (mp *metricsMonitor) wrapHandler(
request.Header.Set("Accept-Encoding", filterAcceptEncoding(ae))
}

// Capture wall clock time before proxying the request
requestStart := time.Now()

if err := next(modelID, recorder, request); err != nil {
return err
}
Expand All @@ -287,7 +297,7 @@ func (mp *metricsMonitor) wrapHandler(
tm := TokenMetrics{
Timestamp: time.Now(),
Model: modelID,
DurationMs: int(time.Since(recorder.StartTime()).Milliseconds()),
DurationMs: int(time.Since(requestStart).Milliseconds()),
}

body := recorder.body.Bytes()
Expand All @@ -308,7 +318,7 @@ func (mp *metricsMonitor) wrapHandler(
}
}
if strings.Contains(recorder.Header().Get("Content-Type"), "text/event-stream") {
if parsed, err := processStreamingResponse(modelID, recorder.StartTime(), body); err != nil {
if parsed, err := processStreamingResponse(modelID, requestStart, body); err != nil {
mp.logger.Warnf("error processing streaming response: %v, path=%s, recording minimal metrics", err, request.URL.Path)
} else {
tm = parsed
Expand All @@ -328,7 +338,7 @@ func (mp *metricsMonitor) wrapHandler(
}

if usage.Exists() || timings.Exists() {
if parsedMetrics, err := parseMetrics(modelID, recorder.StartTime(), usage, timings); err != nil {
if parsedMetrics, err := parseMetrics(modelID, requestStart, usage, timings); err != nil {
mp.logger.Warnf("error parsing metrics: %v, path=%s, recording minimal metrics", err, request.URL.Path)
} else {
tm = parsedMetrics
Expand Down Expand Up @@ -481,6 +491,17 @@ func parseMetrics(modelID string, start time.Time, usage, timings gjson.Result)
}
}

// Fallback: estimate speeds from wall clock when timings unavailable (e.g., vLLM)
if !timings.Exists() && wallDurationMs > 0 {
durationSec := float64(wallDurationMs) / 1000.0
if inputTokens > 0 {
promptPerSecond = float64(inputTokens) / durationSec
}
if outputTokens > 0 {
tokensPerSecond = float64(outputTokens) / durationSec
}
}
Comment on lines +494 to +503
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fallback speeds conflate prompt eval with token generation.

When timings is absent, both rates are divided by the same wall-clock duration that encompasses prompt processing and decode. This produces:

  • promptPerSecond that is far lower than the true prompt-eval rate (prompt eval usually occupies a small fraction of the total request).
  • tokensPerSecond that is biased low by the prompt time (i.e. effectively outputTokens / (prompt_ms + decode_ms) rather than outputTokens / decode_ms).

Given the backend doesn't expose per-phase timings, this is an acceptable approximation — but worth making the caveat explicit so downstream dashboards don't treat these as directly comparable to llama.cpp's prompt_per_second / predicted_per_second. Consider expanding the comment (e.g., "approximate rates over total wall time; includes both prompt and decode phases") or emitting a separate flag/field to distinguish estimated vs. measured rates.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@proxy/metrics_monitor.go` around lines 487 - 496, The fallback calculation
for promptPerSecond and tokensPerSecond (when timings.Exists() is false) uses
wallDurationMs for both rates, conflating prompt evaluation time and decode
time; update the code around timings, wallDurationMs, promptPerSecond,
tokensPerSecond, inputTokens and outputTokens to make this explicit: either
expand the comment to state these are approximate rates computed over total wall
time (including both prompt and decode phases) or add a boolean/flag (e.g.,
estimatedRates or ratesAreApproximate) or separate field to indicate “estimated
vs measured” so downstream consumers know these values are biased and not
directly comparable to phase-specific metrics.


return TokenMetrics{
Timestamp: time.Now(),
Model: modelID,
Expand Down
Loading
Loading