Stream loading state when swapping models by mostlygeek · Pull Request #371 · mostlygeek/llama-swap

mostlygeek · 2025-10-29T04:46:06Z

Swapping models can take a long time and leave a lot of silence while the model is loading. Rather than silently load the model in the background, this PR allows llama-swap to send status updates in the reasoning_content of a streaming chat response.

Fixes #366

Summary by CodeRabbit

New Features
- Global config option to enable/disable streaming of loading-state messages with per-model overrides.
- Optional streaming of startup and in-request loading feedback (progress updates, status messages, playful remarks) when enabled; requests can opt into streaming or fall back to standard responses.
Tests
- Added tests validating global and per-model loading-state settings and streaming behavior.

coderabbitai · 2025-10-29T04:46:14Z

Walkthrough

Adds a global sendLoadingState config and optional per-model sendLoadingState override; loader defaults per-model nil to the global value. Implements SSE-style streaming of loading/status messages while models are starting and stores request-level stream flag and resolved model name in the request context.

Changes

Cohort / File(s)	Change Summary
Config example `config.example.yaml`	Added top-level `sendLoadingState` and a per-model `sendLoadingState` example under `models.llama`.
Config structs & loader `proxy/config/config.go`, `proxy/config/model_config.go`	Added `SendLoadingState bool` to top-level `Config` and `SendLoadingState *bool` to `ModelConfig`; `LoadConfigFromReader` defaults per-model nil to the global value.
Config tests `proxy/config/config_posix_test.go`, `proxy/config/config_windows_test.go`, `proxy/config/model_config_test.go`	Updated tests to include top-level and per-model `SendLoadingState` expectations and added `TestConfig_ModelSendLoadingState` validating defaulting and overrides.
Startup streaming implementation `proxy/process.go`	Added `statusResponseWriter` SSE-style writer and background status updates loop emitting JSON SSE loading messages while a model is not Ready; integrates lifecycle/cancellation, flush semantics, and waits before proxying to upstream.
Request context propagation `proxy/proxymanager.go`	Added `proxyCtxKey` type and logic to extract `"stream"` flag and resolved model name from normalized JSON body, storing both in the request context for downstream handlers.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant ProxyHandler as Proxy Handler
    participant Proc as Process Manager
    participant SSE as statusResponseWriter
    participant Upstream as Upstream LLM

    Client->>ProxyHandler: POST ... (body includes "stream": true)
    ProxyHandler->>Proc: Check model Ready?
    alt Not Ready && config.SendLoadingState == true
        ProxyHandler->>SSE: create SSE writer & send headers
        activate SSE
        loop while Not Ready & not cancelled
            SSE->>Client: SSE message (loading state / remark)
            SSE-->>SSE: wait / tick
        end
        deactivate SSE
    end
    ProxyHandler->>Upstream: forward request to upstream (reverse proxy)
    Upstream->>Client: final streamed response

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Review focus:
- proxy/process.go: SSE writer correctness, header/flush behavior, goroutine lifecycle and timeout handling.
- proxy/config/config.go: defaulting logic for per-model SendLoadingState pointer.
- proxy/proxymanager.go: JSON extraction and context key usage.
- Tests: pointer semantics and expectations for per-model overrides.

Possibly related PRs

proxy/config: create config package and migrate configuration #329 — Touches the same config types and process integration; likely directly related to adding/using SendLoadingState.
Add metrics logging and an Activity page to show requests #195 — Modifies proxy request handling to store resolved model name and streaming flags in context; overlaps with proxymanager changes here.
feat: config hot-reload #106 — Changes in runtime HTTP handling and process startup flow; related to process.go streaming and lifecycle edits.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The PR title "Stream loading state when swapping models" directly and accurately summarizes the primary change being introduced. The title clearly identifies that the feature involves streaming status updates during model swaps, which aligns with the main objective described in the PR summary. The title is concise, specific, and conveys meaningful information about the changeset without unnecessary verbosity or vague language.
Linked Issues Check	✅ Passed	The changes comprehensively address the objectives stated in linked issue #366. The implementation adds a configurable `sendLoadingState` feature that injects loading messages into streaming chat responses via a new `statusResponseWriter` type that emits JSON-wrapped SSE messages during model loading. The per-model configuration allows fine-grained control over when loading state is streamed, eliminating the "dead air" silence mentioned in the issue. The implementation demonstrates content-awareness by managing the HTTP response stream through the statusResponseWriter, coordinating the streaming lifecycle with context cancellation, and properly delegating to the upstream proxy after streaming completes, which addresses the issue's concern about handling upstream content appropriately.
Out of Scope Changes Check	✅ Passed	All changes in the pull request are directly related to implementing the requested feature from issue #366. The modifications include configuration support (config.example.yaml, config.go, model_config.go), comprehensive test coverage for the new fields and behavior, the core streaming implementation in proxy/process.go with statusResponseWriter, and necessary context propagation in proxy/proxymanager.go to track streaming requests and model information. No changes appear to introduce unrelated functionality or address concerns outside the scope of injecting loading state into streaming chat responses.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch show-loading-issue-366

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6791b1b and 467480d.

📒 Files selected for processing (2)

proxy/config/config.go (2 hunks)
proxy/process.go (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

proxy/config/config.go

🧰 Additional context used

📓 Path-based instructions (1)

**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

Fix all staticcheck-reported issues in Go code

Files:

proxy/process.go

🧠 Learnings (1)

📚 Learning: 2025-10-29T05:26:34.956Z

Learnt from: mostlygeek
PR: mostlygeek/llama-swap#371
File: proxy/process.go:0-0
Timestamp: 2025-10-29T05:26:34.956Z
Learning: In proxy/process.go, the loading message "llama-swap loading model: {name}" intentionally uses p.ID (Process.ID) rather than the realModelName from the request context. This is the correct design choice.

Applied to files:

proxy/process.go

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: run-tests

🔇 Additional comments (1)

proxy/process.go (1)

773-773: Verify the random duration range.

Line 773 uses 5+rand.Intn(5) which produces a range of 5-9 seconds. If you intended 5-10 seconds, change to 5+rand.Intn(6).

Based on learnings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

proxy/proxymanager.go (1)
504-508: Typo in error message

“could not ready request body” → “could not read request body”.
- pm.sendErrorResponse(c, http.StatusBadRequest, "could not ready request body")
+ pm.sendErrorResponse(c, http.StatusBadRequest, "could not read request body")

🧹 Nitpick comments (5)

proxy/config/model_config.go (1)
39-41: YAML: consider omitting nulls for cleaner round‑trips

Since SendLoadingState is a pointer, add omitempty to avoid writing sendLoadingState: null on marshal when unset.

Apply:
- SendLoadingState *bool `yaml:"sendLoadingState"`
+ SendLoadingState *bool `yaml:"sendLoadingState,omitempty"`
config.example.yaml (1)
38-45: Clarify field name to match API (“reasoning_content”)

Docs say “reasoning (thinking) field”. If streaming goes into reasoning_content, name it explicitly to avoid confusion.
-# sendLoadingState: inject loading status updates into the reasoning (thinking)
-# field
+# sendLoadingState: inject loading status updates into the reasoning_content field
Confirm the exact response field your stream writer uses and align terminology here.
proxy/process.go (3)
496-496: Rename variable for clarity.

The variable name swapCtx is misleading. This context controls the loading status streaming goroutine, not model swapping.

Apply this diff:
-	swapCtx, cancelLoadCtx := context.WithCancel(r.Context())
+	loadingCtx, cancelLoadCtx := context.WithCancel(r.Context())
And update line 505:
-		go srw.statusUpdates(swapCtx)
+		go srw.statusUpdates(loadingCtx)
541-551: Consider simplifying the nested conditional logic.

The proxy handling logic could be flattened for better readability.

Apply this diff:
-	//
-	if srw != nil && p.reverseProxy != nil {
-		// wait for it to complete
-		<-srw.complete.Done()
-		p.reverseProxy.ServeHTTP(srw, r)
-	} else {
-		if p.reverseProxy != nil {
-			p.reverseProxy.ServeHTTP(w, r)
-		} else {
-			http.Error(w, fmt.Sprintf("No reverse proxy available for %s", p.ID), http.StatusInternalServerError)
-		}
+	if p.reverseProxy == nil {
+		http.Error(w, fmt.Sprintf("No reverse proxy available for %s", p.ID), http.StatusInternalServerError)
+		return
+	}
+
+	if srw != nil {
+		// wait for it to complete
+		<-srw.complete.Done()
+		p.reverseProxy.ServeHTTP(srw, r)
+	} else {
+		p.reverseProxy.ServeHTTP(w, r)
 	}
781-789: Consider defining SSE message structures at package level.

The inline struct definitions for the SSE message format could be moved to package level for better reusability and discoverability.

Define these at package level (after the loadingRemarks variable):
type sseMessageDelta struct {
	ReasoningContent string `json:"reasoning_content"`
}

type sseMessageChoice struct {
	Delta sseMessageDelta `json:"delta"`
}

type sseMessage struct {
	Choices []sseMessageChoice `json:"choices"`
}
Then simplify the sendData method:
 func (s *statusResponseWriter) sendData(data string) {
-	// Create the proper SSE JSON structure
-	type Delta struct {
-		ReasoningContent string `json:"reasoning_content"`
-	}
-	type Choice struct {
-		Delta Delta `json:"delta"`
-	}
-	type SSEMessage struct {
-		Choices []Choice `json:"choices"`
-	}
-
-	msg := SSEMessage{
-		Choices: []Choice{
+	msg := sseMessage{
+		Choices: []sseMessageChoice{
 			{
-				Delta: Delta{
+				Delta: sseMessageDelta{
 					ReasoningContent: data,
 				},
 			},
 		},
 	}

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f852689 and 39f0b02.

📒 Files selected for processing (8)

config.example.yaml (2 hunks)
proxy/config/config.go (2 hunks)
proxy/config/config_posix_test.go (2 hunks)
proxy/config/config_windows_test.go (1 hunks)
proxy/config/model_config.go (1 hunks)
proxy/config/model_config_test.go (1 hunks)
proxy/process.go (4 hunks)
proxy/proxymanager.go (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

Fix all staticcheck-reported issues in Go code

Files:

proxy/config/model_config.go
proxy/config/config_windows_test.go
proxy/config/config_posix_test.go
proxy/proxymanager.go
proxy/process.go
proxy/config/config.go
proxy/config/model_config_test.go

🧬 Code graph analysis (3)

proxy/config/config_windows_test.go (2)

proxy/config/config.go (2)

Config (112-135)

MacroList (24-24)

proxy/config/model_config.go (1)

ModelConfig (10-41)

proxy/config/config_posix_test.go (1)

proxy/config/model_config.go (1)

ModelConfig (10-41)

proxy/config/model_config_test.go (1)

proxy/config/config.go (1)

LoadConfigFromReader (164-402)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: run-tests
GitHub Check: run-tests

🔇 Additional comments (11)

proxy/config/config.go (1)

133-135: Config flag addition looks good

Field placement and YAML tag are consistent with existing config style.

proxy/config/model_config_test.go (1)

55-74: Good coverage of global vs per‑model defaulting

Test asserts both override (false) and inherited (true) cases; aligns with config loader behavior.

config.example.yaml (1)

195-198: Per‑model override example looks good

Example override is clear and consistent with loader behavior.

proxy/config/config_windows_test.go (2)

155-156: Non‑nil per‑model pointer setup in expectations is appropriate

Matches loader’s behavior of setting per‑model pointers.

163-201: Windows expectations updated correctly

Top‑level and per‑model SendLoadingState expectations align with new config behavior.

proxy/config/config_posix_test.go (2)

163-164: Non‑nil per‑model pointer setup in expectations is appropriate

Consistent with loader defaulting approach.

176-211: POSIX expectations updated correctly

Top‑level and per‑model SendLoadingState expectations align with new behavior.

proxy/proxymanager.go (1)

560-565: The suggested fix is incomplete and cannot be applied as written.

The review references ctxKeyStreaming and ctxKeyModel which do not exist in the codebase. Additionally, the diff only shows updating the context.WithValue() calls in proxy/proxymanager.go (lines 562-563), but misses the corresponding context.Value() retrieval in proxy/process.go:502 that reads the streaming value.

If this diff is applied without:

First defining ctxKeyStreaming and ctxKeyModel struct-type constants, and

Updating the retrieval call at proxy/process.go:502 to use the new keys

the code will fail at runtime because the keys won't match between where values are stored and where they're retrieved.

proxy/process.go (3)

5-5: LGTM!

The new imports are appropriate: encoding/json for SSE message serialization and math/rand for non-cryptographic randomness in loading message selection and timing.

Also applies to: 8-8

637-698: LGTM!

The loading remarks provide good user experience during model loading with appropriate humor and variety.

814-836: LGTM!

The helper methods properly delegate to the underlying writer with appropriate safeguards:

WriteHeader prevents duplicate writes with the hasWritten flag

Flush safely checks if the underlying writer supports flushing

proxy/config/config.go

proxy/process.go

proxy/proxymanager.go

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

proxy/process.go (4)

715-732: Pass realModelName from request context to display correct model name.

Line 730 displays p.ID but should show the user-facing model name stored in the request context at proxyCtxKey("model") (consistent with the streaming flag extraction pattern on line 502).

The past review comment suggested modifying newStatusResponseWriter to accept realModelName or the request context. Apply a similar fix here:

-func newStatusResponseWriter(p *Process, w http.ResponseWriter) *statusResponseWriter {
+func newStatusResponseWriter(p *Process, w http.ResponseWriter, modelName string) *statusResponseWriter {
 	c, cx := context.WithCancel(context.Background())
 	s := &statusResponseWriter{
 		writer:   w,
 		process:  p,
 		complete: c,
 		cancel:   cx,
 	}
 
 	s.start = time.Now()
 	s.Header().Set("Content-Type", "text/event-stream") // SSE
 	s.Header().Set("Cache-Control", "no-cache")         // no-cache
 	s.Header().Set("Connection", "keep-alive")          // keep-alive
 	s.WriteHeader(http.StatusOK)                        // send status code 200
 	s.sendLine("━━━━━")
-	s.sendLine(fmt.Sprintf("llama-swap loading model: %s", p.ID))
+	s.sendLine(fmt.Sprintf("llama-swap loading model: %s", modelName))
 	return s
 }

Then update the caller on line 504:

+	realModelName, _ := r.Context().Value(proxyCtxKey("model")).(string)
 	isStreaming, _ := r.Context().Value(proxyCtxKey("streaming")).(bool)
 	if p.config.SendLoadingState != nil && *p.config.SendLoadingState && isStreaming {
-		srw = newStatusResponseWriter(p, w)
+		srw = newStatusResponseWriter(p, w, realModelName)
 		go srw.statusUpdates(swapCtx)

752-754: Correct the random duration range.

The comment states "5-10 seconds" but 2+rand.Intn(4) produces 2-5 seconds.

Apply this diff:

 	// Pick a random duration between 5-10 seconds for when to send a remark
-	nextRemarkIn := time.Duration(2+rand.Intn(4)) * time.Second
+	nextRemarkIn := time.Duration(5+rand.Intn(6)) * time.Second

773-774: Correct the random duration range.

The comment says "5-10 seconds" but 5+rand.Intn(10) produces 5-14 seconds.

Apply this diff:

 		// Pick a new random duration for the next remark
-		nextRemarkIn = time.Duration(5+rand.Intn(10)) * time.Second
+		nextRemarkIn = time.Duration(5+rand.Intn(6)) * time.Second

815-818: Check errors from Write calls.

The three consecutive Write calls don't check for errors, which could mask write failures and result in incomplete SSE messages being sent to clients.

Apply this diff:

 	// Write SSE formatted data
-	s.writer.Write([]byte("data: "))
-	s.writer.Write(jsonData)
-	s.writer.Write([]byte("\n\n"))
+	if _, err := s.writer.Write([]byte("data: ")); err != nil {
+		s.process.proxyLogger.Errorf("<%s> Failed to write SSE prefix: %v", s.process.ID, err)
+		return
+	}
+	if _, err := s.writer.Write(jsonData); err != nil {
+		s.process.proxyLogger.Errorf("<%s> Failed to write SSE data: %v", s.process.ID, err)
+		return
+	}
+	if _, err := s.writer.Write([]byte("\n\n")); err != nil {
+		s.process.proxyLogger.Errorf("<%s> Failed to write SSE suffix: %v", s.process.ID, err)
+		return
+	}
 	s.Flush()

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 39f0b02 and 6791b1b.

📒 Files selected for processing (1)

proxy/process.go (4 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

Fix all staticcheck-reported issues in Go code

Files:

proxy/process.go

🧬 Code graph analysis (1)

proxy/process.go (3)

proxy/proxymanager_api.go (1)

pm (101-198)

proxy/proxymanager_test.go (2)

TestProxyManager_StreamingEndpointsReturnNoBufferingHeader (1013-1060)

TestProxyManager_ProxiedStreamingEndpointReturnsNoBufferingHeader (1062-1085)

proxy/metrics_monitor_test.go (1)

TestMetricsMonitor_StreamingResponse (566-651)

🔇 Additional comments (6)

proxy/process.go (6)

5-5: LGTM: Appropriate imports for streaming functionality.

The encoding/json and math/rand imports are correctly added to support SSE message formatting and randomized loading remarks.

Also applies to: 8-8

493-509: LGTM: Context lifecycle properly managed.

The context handling correctly ensures cancelLoadCtx is called in all paths (error at line 513, or after process start at line 526). The goroutine receives cancellation signals via swapCtx.

511-521: LGTM: Streaming-aware error handling.

The error handling correctly distinguishes between streaming (line 516, sends error via SSE) and non-streaming (line 518, HTTP error) responses. Context is properly cancelled on line 513.

643-704: LGTM: Entertaining loading messages enhance UX.

The loading remarks provide user feedback during potentially long model load times, addressing the "dead air" concern from issue #366.

756-757: Excellent: Ticker resource leak fixed!

The defer ticker.Stop() on line 757 correctly addresses the resource leak flagged in the previous review.

821-843: LGTM: Interface methods correctly delegate to underlying writer.

The WriteHeader guard (line 830) prevents double writes, and the Flush method properly checks for the http.Flusher interface. These methods correctly implement the http.ResponseWriter interface pattern.

proxy/process.go

Swapping models can take a long time and leave a lot of silence while the model is loading. Rather than silently load the model in the background, this PR allows llama-swap to send status updates in the reasoning_content of a streaming chat response. Fixes: mostlygeek#366

mostlygeek added 5 commits October 28, 2025 20:08

first working? version

c0608c7

improve loading

805bdb6

add loading remarks

4d3aff1

improve loading remarks

e4405ef

make streaming status configurable

39f0b02

prevent possible deadlock

6791b1b

coderabbitai bot reviewed Oct 29, 2025

View reviewed changes

proxy/process.go Outdated Show resolved Hide resolved

mostlygeek added 3 commits October 28, 2025 23:45

improve timings and error handling

235792d

copy and then assign pointer for modelConfig.SendLoadingState

c10699c

fix random remark duration

467480d

mostlygeek merged commit a89b803 into main Oct 29, 2025
3 checks passed

mostlygeek deleted the show-loading-issue-366 branch October 29, 2025 07:09

This was referenced Oct 29, 2025

feat: add optional -enrich-cot flag to stream swap progress in reasoning_content #370

Closed

feature idea - adding optional CLI flag to enrich reasoning_content with live swap info #367

Closed

coderabbitai bot mentioned this pull request Dec 23, 2025

🎄 Holiday ToDo List 🎄 #433

Closed

coderabbitai bot mentioned this pull request Jan 31, 2026

feat: add timeout feature #493

Closed

coderabbitai bot mentioned this pull request Mar 26, 2026

proxy: add processingKeepalive config option #604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream loading state when swapping models#371

Stream loading state when swapping models#371
mostlygeek merged 9 commits intomainfrom
show-loading-issue-366

mostlygeek commented Oct 29, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 29, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mostlygeek commented Oct 29, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mostlygeek commented Oct 29, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 29, 2025 •

edited

Loading