Skip to content

Stream loading state when swapping models#371

Merged
mostlygeek merged 9 commits intomainfrom
show-loading-issue-366
Oct 29, 2025
Merged

Stream loading state when swapping models#371
mostlygeek merged 9 commits intomainfrom
show-loading-issue-366

Conversation

@mostlygeek
Copy link
Copy Markdown
Owner

@mostlygeek mostlygeek commented Oct 29, 2025

Swapping models can take a long time and leave a lot of silence while the model is loading. Rather than silently load the model in the background, this PR allows llama-swap to send status updates in the reasoning_content of a streaming chat response.

Fixes #366

Summary by CodeRabbit

  • New Features

    • Global config option to enable/disable streaming of loading-state messages with per-model overrides.
    • Optional streaming of startup and in-request loading feedback (progress updates, status messages, playful remarks) when enabled; requests can opt into streaming or fall back to standard responses.
  • Tests

    • Added tests validating global and per-model loading-state settings and streaming behavior.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 29, 2025

Walkthrough

Adds a global sendLoadingState config and optional per-model sendLoadingState override; loader defaults per-model nil to the global value. Implements SSE-style streaming of loading/status messages while models are starting and stores request-level stream flag and resolved model name in the request context.

Changes

Cohort / File(s) Change Summary
Config example
config.example.yaml
Added top-level sendLoadingState and a per-model sendLoadingState example under models.llama.
Config structs & loader
proxy/config/config.go, proxy/config/model_config.go
Added SendLoadingState bool to top-level Config and SendLoadingState *bool to ModelConfig; LoadConfigFromReader defaults per-model nil to the global value.
Config tests
proxy/config/config_posix_test.go, proxy/config/config_windows_test.go, proxy/config/model_config_test.go
Updated tests to include top-level and per-model SendLoadingState expectations and added TestConfig_ModelSendLoadingState validating defaulting and overrides.
Startup streaming implementation
proxy/process.go
Added statusResponseWriter SSE-style writer and background status updates loop emitting JSON SSE loading messages while a model is not Ready; integrates lifecycle/cancellation, flush semantics, and waits before proxying to upstream.
Request context propagation
proxy/proxymanager.go
Added proxyCtxKey type and logic to extract "stream" flag and resolved model name from normalized JSON body, storing both in the request context for downstream handlers.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant ProxyHandler as Proxy Handler
    participant Proc as Process Manager
    participant SSE as statusResponseWriter
    participant Upstream as Upstream LLM

    Client->>ProxyHandler: POST ... (body includes "stream": true)
    ProxyHandler->>Proc: Check model Ready?
    alt Not Ready && config.SendLoadingState == true
        ProxyHandler->>SSE: create SSE writer & send headers
        activate SSE
        loop while Not Ready & not cancelled
            SSE->>Client: SSE message (loading state / remark)
            SSE-->>SSE: wait / tick
        end
        deactivate SSE
    end
    ProxyHandler->>Upstream: forward request to upstream (reverse proxy)
    Upstream->>Client: final streamed response
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Review focus:
    • proxy/process.go: SSE writer correctness, header/flush behavior, goroutine lifecycle and timeout handling.
    • proxy/config/config.go: defaulting logic for per-model SendLoadingState pointer.
    • proxy/proxymanager.go: JSON extraction and context key usage.
    • Tests: pointer semantics and expectations for per-model overrides.

Possibly related PRs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "Stream loading state when swapping models" directly and accurately summarizes the primary change being introduced. The title clearly identifies that the feature involves streaming status updates during model swaps, which aligns with the main objective described in the PR summary. The title is concise, specific, and conveys meaningful information about the changeset without unnecessary verbosity or vague language.
Linked Issues Check ✅ Passed The changes comprehensively address the objectives stated in linked issue #366. The implementation adds a configurable sendLoadingState feature that injects loading messages into streaming chat responses via a new statusResponseWriter type that emits JSON-wrapped SSE messages during model loading. The per-model configuration allows fine-grained control over when loading state is streamed, eliminating the "dead air" silence mentioned in the issue. The implementation demonstrates content-awareness by managing the HTTP response stream through the statusResponseWriter, coordinating the streaming lifecycle with context cancellation, and properly delegating to the upstream proxy after streaming completes, which addresses the issue's concern about handling upstream content appropriately.
Out of Scope Changes Check ✅ Passed All changes in the pull request are directly related to implementing the requested feature from issue #366. The modifications include configuration support (config.example.yaml, config.go, model_config.go), comprehensive test coverage for the new fields and behavior, the core streaming implementation in proxy/process.go with statusResponseWriter, and necessary context propagation in proxy/proxymanager.go to track streaming requests and model information. No changes appear to introduce unrelated functionality or address concerns outside the scope of injecting loading state into streaming chat responses.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch show-loading-issue-366

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6791b1b and 467480d.

📒 Files selected for processing (2)
  • proxy/config/config.go (2 hunks)
  • proxy/process.go (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • proxy/config/config.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

Fix all staticcheck-reported issues in Go code

Files:

  • proxy/process.go
🧠 Learnings (1)
📚 Learning: 2025-10-29T05:26:34.956Z
Learnt from: mostlygeek
PR: mostlygeek/llama-swap#371
File: proxy/process.go:0-0
Timestamp: 2025-10-29T05:26:34.956Z
Learning: In proxy/process.go, the loading message "llama-swap loading model: {name}" intentionally uses p.ID (Process.ID) rather than the realModelName from the request context. This is the correct design choice.

Applied to files:

  • proxy/process.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: run-tests
🔇 Additional comments (1)
proxy/process.go (1)

773-773: Verify the random duration range.

Line 773 uses 5+rand.Intn(5) which produces a range of 5-9 seconds. If you intended 5-10 seconds, change to 5+rand.Intn(6).

Based on learnings


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
proxy/proxymanager.go (1)

504-508: Typo in error message

“could not ready request body” → “could not read request body”.

- pm.sendErrorResponse(c, http.StatusBadRequest, "could not ready request body")
+ pm.sendErrorResponse(c, http.StatusBadRequest, "could not read request body")
🧹 Nitpick comments (5)
proxy/config/model_config.go (1)

39-41: YAML: consider omitting nulls for cleaner round‑trips

Since SendLoadingState is a pointer, add omitempty to avoid writing sendLoadingState: null on marshal when unset.

Apply:

- SendLoadingState *bool `yaml:"sendLoadingState"`
+ SendLoadingState *bool `yaml:"sendLoadingState,omitempty"`
config.example.yaml (1)

38-45: Clarify field name to match API (“reasoning_content”)

Docs say “reasoning (thinking) field”. If streaming goes into reasoning_content, name it explicitly to avoid confusion.

-# sendLoadingState: inject loading status updates into the reasoning (thinking)
-# field
+# sendLoadingState: inject loading status updates into the reasoning_content field

Confirm the exact response field your stream writer uses and align terminology here.

proxy/process.go (3)

496-496: Rename variable for clarity.

The variable name swapCtx is misleading. This context controls the loading status streaming goroutine, not model swapping.

Apply this diff:

-	swapCtx, cancelLoadCtx := context.WithCancel(r.Context())
+	loadingCtx, cancelLoadCtx := context.WithCancel(r.Context())

And update line 505:

-		go srw.statusUpdates(swapCtx)
+		go srw.statusUpdates(loadingCtx)

541-551: Consider simplifying the nested conditional logic.

The proxy handling logic could be flattened for better readability.

Apply this diff:

-	//
-	if srw != nil && p.reverseProxy != nil {
-		// wait for it to complete
-		<-srw.complete.Done()
-		p.reverseProxy.ServeHTTP(srw, r)
-	} else {
-		if p.reverseProxy != nil {
-			p.reverseProxy.ServeHTTP(w, r)
-		} else {
-			http.Error(w, fmt.Sprintf("No reverse proxy available for %s", p.ID), http.StatusInternalServerError)
-		}
+	if p.reverseProxy == nil {
+		http.Error(w, fmt.Sprintf("No reverse proxy available for %s", p.ID), http.StatusInternalServerError)
+		return
+	}
+
+	if srw != nil {
+		// wait for it to complete
+		<-srw.complete.Done()
+		p.reverseProxy.ServeHTTP(srw, r)
+	} else {
+		p.reverseProxy.ServeHTTP(w, r)
 	}

781-789: Consider defining SSE message structures at package level.

The inline struct definitions for the SSE message format could be moved to package level for better reusability and discoverability.

Define these at package level (after the loadingRemarks variable):

type sseMessageDelta struct {
	ReasoningContent string `json:"reasoning_content"`
}

type sseMessageChoice struct {
	Delta sseMessageDelta `json:"delta"`
}

type sseMessage struct {
	Choices []sseMessageChoice `json:"choices"`
}

Then simplify the sendData method:

 func (s *statusResponseWriter) sendData(data string) {
-	// Create the proper SSE JSON structure
-	type Delta struct {
-		ReasoningContent string `json:"reasoning_content"`
-	}
-	type Choice struct {
-		Delta Delta `json:"delta"`
-	}
-	type SSEMessage struct {
-		Choices []Choice `json:"choices"`
-	}
-
-	msg := SSEMessage{
-		Choices: []Choice{
+	msg := sseMessage{
+		Choices: []sseMessageChoice{
 			{
-				Delta: Delta{
+				Delta: sseMessageDelta{
 					ReasoningContent: data,
 				},
 			},
 		},
 	}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f852689 and 39f0b02.

📒 Files selected for processing (8)
  • config.example.yaml (2 hunks)
  • proxy/config/config.go (2 hunks)
  • proxy/config/config_posix_test.go (2 hunks)
  • proxy/config/config_windows_test.go (1 hunks)
  • proxy/config/model_config.go (1 hunks)
  • proxy/config/model_config_test.go (1 hunks)
  • proxy/process.go (4 hunks)
  • proxy/proxymanager.go (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

Fix all staticcheck-reported issues in Go code

Files:

  • proxy/config/model_config.go
  • proxy/config/config_windows_test.go
  • proxy/config/config_posix_test.go
  • proxy/proxymanager.go
  • proxy/process.go
  • proxy/config/config.go
  • proxy/config/model_config_test.go
🧬 Code graph analysis (3)
proxy/config/config_windows_test.go (2)
proxy/config/config.go (2)
  • Config (112-135)
  • MacroList (24-24)
proxy/config/model_config.go (1)
  • ModelConfig (10-41)
proxy/config/config_posix_test.go (1)
proxy/config/model_config.go (1)
  • ModelConfig (10-41)
proxy/config/model_config_test.go (1)
proxy/config/config.go (1)
  • LoadConfigFromReader (164-402)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: run-tests
  • GitHub Check: run-tests
🔇 Additional comments (11)
proxy/config/config.go (1)

133-135: Config flag addition looks good

Field placement and YAML tag are consistent with existing config style.

proxy/config/model_config_test.go (1)

55-74: Good coverage of global vs per‑model defaulting

Test asserts both override (false) and inherited (true) cases; aligns with config loader behavior.

config.example.yaml (1)

195-198: Per‑model override example looks good

Example override is clear and consistent with loader behavior.

proxy/config/config_windows_test.go (2)

155-156: Non‑nil per‑model pointer setup in expectations is appropriate

Matches loader’s behavior of setting per‑model pointers.


163-201: Windows expectations updated correctly

Top‑level and per‑model SendLoadingState expectations align with new config behavior.

proxy/config/config_posix_test.go (2)

163-164: Non‑nil per‑model pointer setup in expectations is appropriate

Consistent with loader defaulting approach.


176-211: POSIX expectations updated correctly

Top‑level and per‑model SendLoadingState expectations align with new behavior.

proxy/proxymanager.go (1)

560-565: The suggested fix is incomplete and cannot be applied as written.

The review references ctxKeyStreaming and ctxKeyModel which do not exist in the codebase. Additionally, the diff only shows updating the context.WithValue() calls in proxy/proxymanager.go (lines 562-563), but misses the corresponding context.Value() retrieval in proxy/process.go:502 that reads the streaming value.

If this diff is applied without:

  1. First defining ctxKeyStreaming and ctxKeyModel struct-type constants, and
  2. Updating the retrieval call at proxy/process.go:502 to use the new keys

the code will fail at runtime because the keys won't match between where values are stored and where they're retrieved.

proxy/process.go (3)

5-5: LGTM!

The new imports are appropriate: encoding/json for SSE message serialization and math/rand for non-cryptographic randomness in loading message selection and timing.

Also applies to: 8-8


637-698: LGTM!

The loading remarks provide good user experience during model loading with appropriate humor and variety.


814-836: LGTM!

The helper methods properly delegate to the underlying writer with appropriate safeguards:

  • WriteHeader prevents duplicate writes with the hasWritten flag
  • Flush safely checks if the underlying writer supports flushing

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
proxy/process.go (4)

715-732: Pass realModelName from request context to display correct model name.

Line 730 displays p.ID but should show the user-facing model name stored in the request context at proxyCtxKey("model") (consistent with the streaming flag extraction pattern on line 502).

The past review comment suggested modifying newStatusResponseWriter to accept realModelName or the request context. Apply a similar fix here:

-func newStatusResponseWriter(p *Process, w http.ResponseWriter) *statusResponseWriter {
+func newStatusResponseWriter(p *Process, w http.ResponseWriter, modelName string) *statusResponseWriter {
 	c, cx := context.WithCancel(context.Background())
 	s := &statusResponseWriter{
 		writer:   w,
 		process:  p,
 		complete: c,
 		cancel:   cx,
 	}
 
 	s.start = time.Now()
 	s.Header().Set("Content-Type", "text/event-stream") // SSE
 	s.Header().Set("Cache-Control", "no-cache")         // no-cache
 	s.Header().Set("Connection", "keep-alive")          // keep-alive
 	s.WriteHeader(http.StatusOK)                        // send status code 200
 	s.sendLine("━━━━━")
-	s.sendLine(fmt.Sprintf("llama-swap loading model: %s", p.ID))
+	s.sendLine(fmt.Sprintf("llama-swap loading model: %s", modelName))
 	return s
 }

Then update the caller on line 504:

+	realModelName, _ := r.Context().Value(proxyCtxKey("model")).(string)
 	isStreaming, _ := r.Context().Value(proxyCtxKey("streaming")).(bool)
 	if p.config.SendLoadingState != nil && *p.config.SendLoadingState && isStreaming {
-		srw = newStatusResponseWriter(p, w)
+		srw = newStatusResponseWriter(p, w, realModelName)
 		go srw.statusUpdates(swapCtx)

752-754: Correct the random duration range.

The comment states "5-10 seconds" but 2+rand.Intn(4) produces 2-5 seconds.

Apply this diff:

 	// Pick a random duration between 5-10 seconds for when to send a remark
-	nextRemarkIn := time.Duration(2+rand.Intn(4)) * time.Second
+	nextRemarkIn := time.Duration(5+rand.Intn(6)) * time.Second

773-774: Correct the random duration range.

The comment says "5-10 seconds" but 5+rand.Intn(10) produces 5-14 seconds.

Apply this diff:

 		// Pick a new random duration for the next remark
-		nextRemarkIn = time.Duration(5+rand.Intn(10)) * time.Second
+		nextRemarkIn = time.Duration(5+rand.Intn(6)) * time.Second

815-818: Check errors from Write calls.

The three consecutive Write calls don't check for errors, which could mask write failures and result in incomplete SSE messages being sent to clients.

Apply this diff:

 	// Write SSE formatted data
-	s.writer.Write([]byte("data: "))
-	s.writer.Write(jsonData)
-	s.writer.Write([]byte("\n\n"))
+	if _, err := s.writer.Write([]byte("data: ")); err != nil {
+		s.process.proxyLogger.Errorf("<%s> Failed to write SSE prefix: %v", s.process.ID, err)
+		return
+	}
+	if _, err := s.writer.Write(jsonData); err != nil {
+		s.process.proxyLogger.Errorf("<%s> Failed to write SSE data: %v", s.process.ID, err)
+		return
+	}
+	if _, err := s.writer.Write([]byte("\n\n")); err != nil {
+		s.process.proxyLogger.Errorf("<%s> Failed to write SSE suffix: %v", s.process.ID, err)
+		return
+	}
 	s.Flush()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 39f0b02 and 6791b1b.

📒 Files selected for processing (1)
  • proxy/process.go (4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

Fix all staticcheck-reported issues in Go code

Files:

  • proxy/process.go
🧬 Code graph analysis (1)
proxy/process.go (3)
proxy/proxymanager_api.go (1)
  • pm (101-198)
proxy/proxymanager_test.go (2)
  • TestProxyManager_StreamingEndpointsReturnNoBufferingHeader (1013-1060)
  • TestProxyManager_ProxiedStreamingEndpointReturnsNoBufferingHeader (1062-1085)
proxy/metrics_monitor_test.go (1)
  • TestMetricsMonitor_StreamingResponse (566-651)
🔇 Additional comments (6)
proxy/process.go (6)

5-5: LGTM: Appropriate imports for streaming functionality.

The encoding/json and math/rand imports are correctly added to support SSE message formatting and randomized loading remarks.

Also applies to: 8-8


493-509: LGTM: Context lifecycle properly managed.

The context handling correctly ensures cancelLoadCtx is called in all paths (error at line 513, or after process start at line 526). The goroutine receives cancellation signals via swapCtx.


511-521: LGTM: Streaming-aware error handling.

The error handling correctly distinguishes between streaming (line 516, sends error via SSE) and non-streaming (line 518, HTTP error) responses. Context is properly cancelled on line 513.


643-704: LGTM: Entertaining loading messages enhance UX.

The loading remarks provide user feedback during potentially long model load times, addressing the "dead air" concern from issue #366.


756-757: Excellent: Ticker resource leak fixed!

The defer ticker.Stop() on line 757 correctly addresses the resource leak flagged in the previous review.


821-843: LGTM: Interface methods correctly delegate to underlying writer.

The WriteHeader guard (line 830) prevents double writes, and the Flush method properly checks for the http.Flusher interface. These methods correctly implement the http.ResponseWriter interface pattern.

@mostlygeek mostlygeek merged commit a89b803 into main Oct 29, 2025
3 checks passed
@mostlygeek mostlygeek deleted the show-loading-issue-366 branch October 29, 2025 07:09
0uep pushed a commit to lynxai-team/llama-swap that referenced this pull request Nov 21, 2025
Swapping models can take a long time and leave a lot of silence while the model is loading. Rather than silently load the model in the background, this PR allows llama-swap to send status updates in the reasoning_content of a streaming chat response.

Fixes: mostlygeek#366
@coderabbitai coderabbitai bot mentioned this pull request Jan 31, 2026
rohitpaul pushed a commit to rohitpaul/llama-swap that referenced this pull request Mar 29, 2026
Swapping models can take a long time and leave a lot of silence while the model is loading. Rather than silently load the model in the background, this PR allows llama-swap to send status updates in the reasoning_content of a streaming chat response.

Fixes: mostlygeek#366
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature request - use reasoning_content to show loading state

1 participant