Sleep mode by napmany · Pull Request #407 · mostlygeek/llama-swap

napmany · 2025-11-19T12:58:19Z

Add Sleep/Wake Support for Fast Model Switching

Summary

Adds sleep/wake functionality for fast model switching. Instead of killing and restarting processes, models can be put to sleep and woken up on demand, dramatically reducing swap times.

Designed for vLLM's sleep mode (blog post), but will work with any future inference servers that implement HTTP-based sleep/wake endpoints.

What Changed

State Machine:

Added 3 new states: StateSleepPending, StateAsleep, StateWaking
Automatic sleep during model swapping when configured
Falls back to traditional stop/start if sleep fails

Configuration:

# Global defaults
sleepRequestTimeout: 20
wakeRequestTimeout: 20

models:
  vllm-model:
    sleepEndpoints:
      - endpoint: /sleep?level=2
        method: POST
        timeout: 15  # optional override

    # Level 2 requires multi-step wake sequence
    wakeEndpoints:
      - endpoint: /wake_up
        method: POST
      - endpoint: /collective_rpc
        method: POST
        body: '{"method": "reload_weights"}'
      - endpoint: /reset_prefix_cache
        method: POST

Endpoints are called sequentially. Per-endpoint timeouts override global defaults.

Features:

Automatic sleep on model swap (when sleepEndpoints/wakeEndpoints configured)
API endpoint: POST /models/sleep/:model_id
UI displays sleep related states

Summary by CodeRabbit

New Features
- Sleep/wake model lifecycle: global and per-endpoint timeouts, multi-step wake sequences, new public sleep API route, and model sleep capability flag
- UI: sleep action, expanded Actions column, and new status badges for sleepPending/asleep/waking
Documentation
- vLLM sleep mode examples, config fields, and usage notes added
Bug Fixes
- CI workflow step now always runs
Tests
- Comprehensive sleep/wake configuration and behavior tests added

coderabbitai · 2025-11-19T12:58:31Z

Walkthrough

Adds configurable model sleep/wake lifecycle: config schema and defaults for per-endpoint timeouts, HTTPEndpoint model fields, process state machine changes and HTTP sequencing for multi-step sleep/wake, new API/UI controls to trigger sleep, test coverage, and documentation updates.

Changes

Cohort / File(s)	Change Summary
CI Workflow `.github/workflows/go-ci-windows.yml`	Removed conditional guard so "Create simple-responder" step always runs.
Docs & Examples `README.md`, `config.example.yaml`, `docs/configuration.md`	Added sleep/wake feature docs, new API route `/models/sleep/:model_id`, vLLM sleep mode examples (Level 1/2) showing endpoint sequences and timeout usage.
Schema `config-schema.json`	Added top-level `sleepRequestTimeout` and `wakeRequestTimeout`; added model-level `sleepEndpoints` and `wakeEndpoints` arrays with fields `endpoint`, `method`, `body`, `timeout`.
Config types & loading `proxy/config/model_config.go`, `proxy/config/config.go`, `proxy/config/config_posix_test.go`, `proxy/config/config_windows_test.go`, `proxy/config/config_test.go`	New `HTTPEndpoint` type; added `SleepEndpoints` and `WakeEndpoints` to `ModelConfig`; added global `SleepRequestTimeout` and `WakeRequestTimeout` to `Config`; macro substitution extended to endpoints; defaults applied; added `validateEndpointMacros`; tests added/updated.
Process lifecycle & runtime `proxy/process.go`, `proxy/process_test.go`	Added states `sleepPending`, `asleep`, `waking`; added methods `Sleep()`, `wake()`, `MakeIdle()`, `makeReady()`; HTTP helpers (`buildFullURL`, `sendHTTPRequest`, `checkHealthEndpoint`); multi-step sleep/wake sequencing; updated state transitions and wait-group handling; tests updated/added.
Process group & API `proxy/processgroup.go`, `proxy/proxymanager_api.go`	`ProcessGroup.SleepProcess(modelID)` added; ProxyRequest uses `MakeIdle` instead of `Stop` for active process; new API endpoint `/api/models/sleep/*model` and `Model.SleepEnabled` flag in status responses.
Test responder `cmd/simple-responder/simple-responder.go`	Added HTTP endpoints used by tests: `/sleep`, `/wake_up`, `/wake_up_fail`, `/collective_rpc`, `/reset_prefix_cache`.
Frontend: API & UI `ui/src/contexts/APIProvider.tsx`, `ui/src/pages/Models.tsx`, `ui/src/index.css`	Extended ModelStatus with `sleepPending`, `asleep`, `waking`; added `sleepEnabled` on Model; added `sleepModel` API method; updated Models UI to expose sleep/wake actions and new status badges; CSS classes added for new states and button utility.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as ProxyManager API
    participant PG as ProcessGroup
    participant P as Process
    participant Remote as Remote Model Endpoint

    Note over Client,Remote `#D6EAF8`: Sleep flow (high-level)
    Client->>API: POST /api/models/sleep/{model}
    API->>PG: SleepProcess(modelID)
    PG->>P: Sleep()
    P->>P: isSleepEnabled()
    alt enabled
        P->>Remote: sendSleepRequests() (step1..N)
        Remote-->>P: 200 OK
        P->>P: swapState Ready→sleepPending→asleep
        P-->>PG: success
        PG-->>API: 200 OK
    else not enabled
        P-->>PG: error
        PG-->>API: 400/ERR
    end

sequenceDiagram
    participant Client
    participant API as ProxyManager API
    participant PG as ProcessGroup
    participant P as Process
    participant Remote as Remote Model Endpoint

    Note over Client,Remote `#F9EBEA`: Wake on request (high-level)
    Client->>API: Any request for model while asleep
    API->>PG: makeReady()
    PG->>P: wake() / makeReady()
    P->>Remote: sendWakeRequests() (step1..N)
    alt all OK
        Remote-->>P: 200 OK
        P->>P: swapState waking→ready
        P-->>PG: Ready
        PG-->>API: Forward client request
        API-->>Client: Response
    else failure
        Remote-->>P: error
        P->>P: fallback to full start (stop/start)
        P-->>PG: Stopped/Error
        PG-->>API: Error to client
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas to focus review on:
- proxy/process.go — state machine changes, wait-group coordination, HTTP sequencing and fallback behavior
- proxy/config/config.go & proxy/config/model_config.go — macro substitution, validation, defaulting for endpoint arrays
- proxy/proxymanager_api.go and ProcessGroup.SleepProcess — API handler correctness and error handling
- Tests in proxy/config/config_test.go and proxy/process_test.go for coverage and edge cases

Possibly related PRs

llama-swap should never need to be restarted due to upstream issues #155 — overlaps changes to process lifecycle and state transitions in proxy/process.go.
New UI #164 — overlaps model status and API handlers in proxy/proxymanager_api.go.
Load models in the UI without navigating the page #173 — overlaps UI APIProvider and Models page additions for model actions.

Suggested labels

enhancement

Suggested reviewers

mostlygeek

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 34.62% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Sleep mode' directly aligns with the PR's primary objective of adding sleep/wake functionality for fast model switching, which is the main change across the entire changeset.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78d33ff and f6b9320.

📒 Files selected for processing (2)

proxy/process.go (15 hunks)
proxy/process_test.go (3 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-10-29T05:26:34.964Z

Learnt from: mostlygeek
Repo: mostlygeek/llama-swap PR: 371
File: proxy/process.go:0-0
Timestamp: 2025-10-29T05:26:34.964Z
Learning: In proxy/process.go, the loading message "llama-swap loading model: {name}" intentionally uses p.ID (Process.ID) rather than the realModelName from the request context. This is the correct design choice.

Applied to files:

proxy/process_test.go
proxy/process.go

📚 Learning: 2025-08-29T05:03:26.638Z

Learnt from: mostlygeek
Repo: mostlygeek/llama-swap PR: 278
File: proxy/process.go:368-379
Timestamp: 2025-08-29T05:03:26.638Z
Learning: For health check implementations in the llama-swap codebase, prefer simple and explicit HTTP client configurations over performance optimizations like preserving DefaultTransport settings, as health checks only need to verify HTTP 200 responses and clarity is more important than HTTP/2 support or connection pooling.

Applied to files:

proxy/process.go

🧬 Code graph analysis (2)

proxy/process_test.go (2)

proxy/process.go (8)

StateReady (30-30)

StateSleepPending (37-37)

StateAsleep (38-38)

StateStopping (31-31)

StateStopped (28-28)

StateWaking (39-39)

ErrInvalidStateTransition (167-167)

NewProcess (99-143)

proxy/config/model_config.go (1)

HTTPEndpoint (12-17)

proxy/process.go (1)

proxy/config/model_config.go (1)

HTTPEndpoint (12-17)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: run-tests

🔇 Additional comments (16)

proxy/process_test.go (4)

106-106: LGTM!

The error message correctly reflects the new makeReady() method introduced in process.go.

263-274: LGTM!

The state transition test cases comprehensively cover the new sleep/wake states, including both valid and invalid transitions. The test cases align correctly with the isValidTransition() logic in process.go.

581-670: LGTM!

The first three sleep/wake test functions provide excellent coverage:

Basic sleep/wake cycle validation

Multi-step wake sequences (simulating vLLM level-2 sleep)

Verification that MakeIdle() uses Sleep when configured

All tests follow proper patterns with cleanup and state assertions.

672-703: LGTM!

The test correctly validates the wake failure fallback behavior. When wake endpoints fail, the implementation falls back to StopImmediately() followed by start() (process.go lines 592-595), which correctly results in a nil error and StateReady state after restart.

proxy/process.go (12)

36-43: LGTM!

The new sleep/wake state constants follow the existing naming conventions, and the httpDialTimeout constant provides a reasonable connection timeout for HTTP operations.

83-87: LGTM!

The new wait groups follow the existing pattern established by waitStarting and are correctly managed in the state transition logic.

188-197: LGTM!

The atomic increment of wait groups when entering transitional states ensures proper synchronization and prevents race conditions.

205-224: LGTM!

The state transition logic correctly enforces the sleep/wake state machine, requiring transitions through intermediate states (SleepPending, Waking) and allowing graceful termination from any sleep state.

252-259: LGTM!

MakeIdle() provides a clean abstraction that automatically chooses between sleep and stop based on configuration, simplifying model swapping logic.

379-418: LGTM!

The modifications correctly ensure that TTL monitoring continues during sleep states, which is appropriate since sleeping processes still consume resources and should be subject to idle timeout policies.

435-447: LGTM!

The variable rename from currentState to initState improves code clarity by explicitly indicating this is the state before the stopping transition.

463-510: LGTM!

The sleep/wake request sequencing correctly executes endpoints in order with proper logging and error handling. The isSleepEnabled() check correctly requires both sleep and wake endpoints to be configured.

512-565: LGTM!

The Sleep() implementation correctly handles:

Concurrent sleep operations via wait groups

Inflight request completion before sleeping

Sequential endpoint execution with fallback on failure

Proper state transitions with error handling

567-604: LGTM!

The wake() implementation correctly handles concurrent wake operations, executes wake sequences, and gracefully falls back to a full restart if wake endpoints fail. The state transitions and wait group management are properly synchronized.

628-704: LGTM!

The HTTP helper methods provide clean abstractions with proper timeout handling, error checking, and separation of concerns. The refactoring successfully consolidates health checks and sleep/wake requests into a common HTTP request path.

718-718: LGTM!

The ProxyRequest changes correctly:

Block incoming requests during StateSleepPending to prevent race conditions

Use makeReady() instead of start() to enable automatic wake-on-request

Update error messages to match the new method name

Also applies to: 757-758

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

proxy/config/config.go (1)

112-141: Global sleep/wake timeout config fields lack non-negative validation

SleepRequestTimeout and WakeRequestTimeout are added to Config with defaults set in LoadConfigFromReader (lines 179–180). However, unlike HealthCheckTimeout (lines 191–194) and StartPort (lines 196–198), these timeouts receive no validation. A negative value in YAML would pass unmarshal, propagate as-is through the int field, and later convert to a negative time.Duration at line 660 (time.Duration(endpoint.Timeout) * time.Second), potentially causing unexpected HTTP client behavior.

Add bounds validation after line 194—e.g., clamp to 0 or reject values < 0 with a config error—to match the pattern used for HealthCheckTimeout and prevent silent propagation of invalid timeouts.
proxy/process.go (1)
653-697: Validate global timeout defaults are non-zero to prevent indefinite hangs

The concern is valid: if sleepRequestTimeout or wakeRequestTimeout are configured to 0 in YAML, endpoints with unspecified timeouts get normalized to these zero values, resulting in http.Client.Timeout = 0 (no timeout) and potential indefinite hangs.

While the code validates individual endpoint timeouts cannot be negative (proxy/config/model_config.go:131) and normalizes zero endpoints to global defaults (proxy/config/config.go:403-410), there is no validation preventing the global timeout values themselves from being zero or negative.

Recommend adding validation for SleepRequestTimeout and WakeRequestTimeout in the config loading phase (proxy/config/config.go, after line 194) to ensure they're positive, similar to the existing HealthCheckTimeout minimum enforcement:
if config.HealthCheckTimeout < 15 {
	config.HealthCheckTimeout = 15
}
+
+if config.SleepRequestTimeout <= 0 {
+	config.SleepRequestTimeout = 10
+}
+if config.WakeRequestTimeout <= 0 {
+	config.WakeRequestTimeout = 10
+}

🧹 Nitpick comments (11)

cmd/simple-responder/simple-responder.go (1)
11-11: Sleep/wake test endpoints are wired correctly; consider minimal error handling

The added endpoints give you the right knobs for exercising sleep, wake, failure, and collective RPC behavior; using strings.Contains for "reload_weights" is sufficient for this test helper.

If you want a bit more robustness without much noise, you could at least check the read error:
-	r.POST("/collective_rpc", func(c *gin.Context) {
-		body, _ := io.ReadAll(c.Request.Body)
-		if strings.Contains(string(body), "reload_weights") {
+	r.POST("/collective_rpc", func(c *gin.Context) {
+		body, err := io.ReadAll(c.Request.Body)
+		if err != nil {
+			c.Status(http.StatusBadRequest)
+			return
+		}
+		if strings.Contains(string(body), "reload_weights") {
 			c.Status(http.StatusOK)
 		} else {
 			c.Status(http.StatusBadRequest)
 		}
 	})
Purely optional given this is a local test responder.

Also applies to: 268-292
ui/src/pages/Models.tsx (1)

40-40: Action column logic correctly reflects sleep/wake lifecycle

Pulling sleepModel from useAPI and mapping actions by model.state/sleepEnabled gives a clear, predictable UI:

stopped → Load.

asleep → Wake (via loadModel) + Unload.

ready → optional Sleep (when sleepEnabled) + Unload.

Transitional/other states → disabled button.

This matches the new state machine behavior and prevents conflicting actions while transitions are in progress. If you ever want to polish UX further, a tiny wrapper that awaits these async calls and disables the row while pending would avoid double-clicks, but it’s not required for correctness.

Also applies to: 167-199

docs/configuration.md (1)

75-85: Minor wording tweak: hyphenate “event‑driven”

In the features table, consider changing “event driven functionality” to “event‑driven functionality” for standard phrasing.

proxy/processgroup.go (1)

115-133: SleepProcess helper is safe and avoids holding the group lock during sleep

The flow—lookup under lock, capability check, unlock, then process.Sleep()—avoids blocking the group mutex on a potentially long-running operation and is safe given the process map is not mutated after initialization.

You might optionally enrich the error to include the model ID, e.g. "model %s does not support sleep mode", to ease debugging when called via the API.

proxy/process_test.go (2)

581-610: Basic Sleep/Wake test exercises happy path but not HTTP failure modes

TestProcess_SleepAndWakeBasic correctly validates the Ready→Asleep→Ready flow via Sleep() and wake(), but it only asserts state changes and not that the configured sleep/wake endpoints are actually hit or that HTTP failures are surfaced.

You might consider a follow-up test that uses a test responder configured to fail one of the sleep or wake endpoints and asserts that Sleep()/wake() behave as expected (e.g., falling back to StopImmediately() and ending in StateStopped), similar to TestProcess_WakeFailureFallsBackToStart, but for sleep as well.

641-670: Swap-related sleep test only checks final state, not side effects

TestProcess_SleepInsteadOfStopWithSwap confirms that MakeIdle() uses sleep (ending in StateAsleep) when sleep/wake is configured, which matches the intended behavior during swaps. It does not, however, assert that the upstream process remains running (as opposed to being stopped), so if you want stronger guarantees around “sleep vs stop” semantics, you could add assertions based on process PID or log content.

proxy/config/config.go (2)

176-185: Default sleep/wake timeouts are reasonable but 0 means “no timeout”

The defaults of 10 seconds for both SleepRequestTimeout and WakeRequestTimeout are sensible for local sleep/wake endpoints. Note that because endpoints with timeout: 0 inherit these values, but a global timeout of 0 will in turn leave endpoints at 0, this effectively means “no HTTP client timeout” for sleep/wake operations.

If unbounded sleep/wake HTTP requests are undesirable, you may want to treat 0 as “use default 10” (or enforce a minimum) and only allow explicit large values rather than “no timeout”.

371-377: Endpoint macro validation is strict and returns early on first issue

validateEndpointMacros ensures that any remaining ${...} in sleep/wake endpoint paths or bodies after substitution are treated as errors, with reserved macros (PORT, MODEL_ID) required to have been fully expanded. This matches the existing validation for other fields. Note that it reports only the first offending macro; if you want more user-friendly diagnostics, a future enhancement could aggregate all errors in a single pass.

proxy/config/config_test.go (1)

873-923: Multi-step wake config test encodes default wake timeout as a contract

TestConfig_SleepWakeMultiStepWake verifies a vLLM-style multi-step wake sequence and asserts that unspecified wake timeouts default to 10 seconds, effectively pinning the default defined in config.go. This is helpful, but be aware that changing the default in code will require updating this test as well.

If you anticipate changing default timeouts in the future, you might factor the default value into a shared constant used both in code and tests to avoid magic numbers.

proxy/process.go (2)

438-453: StopImmediately now respects current state but repeats an older race pattern

Using initState := p.CurrentState() and feeding that into swapState(initState, StateStopping) broadens StopImmediately to work from all valid origin states (Starting, Ready, SleepPending, Asleep, Waking). However, as with Stop(), Stop() waits for inFlightRequests before transitioning into StateStopping, which leaves a small window where new requests can start between Wait() returning and the state change, and those requests will not be counted in the wait.

If you want stronger guarantees, consider flipping the order to transition to StateStopping first (so ProxyRequest rejects new work) and then Wait() for existing in-flight requests to drain before calling StopImmediately(). This would close the TOCTOU window for both stop and TTL-triggered unloads.

634-651: buildFullURL composes sleep/wake/health URLs robustly, but assumes a valid proxy base

buildFullURL correctly uses url.Parse and ResolveReference to combine the proxy base with per-endpoint paths, handling absolute vs relative endpoint paths. If config.Proxy is empty or malformed, though, all sleep/wake/health requests will fail at runtime with parse errors rather than being rejected at config load time.

You may want to strengthen config validation (in config.go) to reject models that define sleep/wake endpoints without a valid, non-empty proxy URL, so misconfiguration is caught early.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86e9b93 and e81ef6f.

📒 Files selected for processing (18)

.github/workflows/go-ci-windows.yml (0 hunks)
README.md (1 hunks)
cmd/simple-responder/simple-responder.go (2 hunks)
config-schema.json (2 hunks)
config.example.yaml (2 hunks)
docs/configuration.md (3 hunks)
proxy/config/config.go (7 hunks)
proxy/config/config_posix_test.go (1 hunks)
proxy/config/config_test.go (1 hunks)
proxy/config/config_windows_test.go (1 hunks)
proxy/config/model_config.go (3 hunks)
proxy/process.go (15 hunks)
proxy/process_test.go (3 hunks)
proxy/processgroup.go (2 hunks)
proxy/proxymanager_api.go (5 hunks)
ui/src/contexts/APIProvider.tsx (4 hunks)
ui/src/index.css (3 hunks)
ui/src/pages/Models.tsx (3 hunks)

💤 Files with no reviewable changes (1)

.github/workflows/go-ci-windows.yml

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-10-29T05:26:34.964Z

Learnt from: mostlygeek
Repo: mostlygeek/llama-swap PR: 371
File: proxy/process.go:0-0
Timestamp: 2025-10-29T05:26:34.964Z
Learning: In proxy/process.go, the loading message "llama-swap loading model: {name}" intentionally uses p.ID (Process.ID) rather than the realModelName from the request context. This is the correct design choice.

Applied to files:

proxy/proxymanager_api.go
proxy/process_test.go
proxy/processgroup.go
proxy/process.go

📚 Learning: 2025-10-07T05:41:52.728Z

Learnt from: mostlygeek
Repo: mostlygeek/llama-swap PR: 337
File: proxy/config/config.go:268-291
Timestamp: 2025-10-07T05:41:52.728Z
Learning: In the mostlygeek/llama-swap repository's proxy/config/config.go, macro-in-macro substitution requires that referenced macros be declared earlier in the YAML file (declaration order matters). A macro can only reference macros that appear before it in the configuration. The YAML order is preserved by the UnmarshalYAML implementation for MacroList, and the LIFO substitution approach is intentional.

Applied to files:

proxy/config/config.go

📚 Learning: 2025-08-29T05:03:26.638Z

Learnt from: mostlygeek
Repo: mostlygeek/llama-swap PR: 278
File: proxy/process.go:368-379
Timestamp: 2025-08-29T05:03:26.638Z
Learning: For health check implementations in the llama-swap codebase, prefer simple and explicit HTTP client configurations over performance optimizations like preserving DefaultTransport settings, as health checks only need to verify HTTP 200 responses and clarity is more important than HTTP/2 support or connection pooling.

Applied to files:

proxy/process.go

🧬 Code graph analysis (6)

proxy/proxymanager_api.go (2)

proxy/process.go (3)

StateSleepPending (37-37)

StateAsleep (38-38)

StateWaking (39-39)

ui/src/contexts/APIProvider.tsx (1)

Model (7-14)

proxy/config/config.go (1)

proxy/config/model_config.go (2)

ModelConfig (19-54)

HTTPEndpoint (12-17)

proxy/config/config_test.go (1)

proxy/config/config.go (1)

LoadConfigFromReader (170-453)

ui/src/pages/Models.tsx (1)

ui/src/contexts/APIProvider.tsx (1)

useAPI (291-297)

proxy/process_test.go (2)

proxy/process.go (8)

StateReady (30-30)

StateSleepPending (37-37)

StateAsleep (38-38)

StateStopping (31-31)

StateStopped (28-28)

StateWaking (39-39)

ErrInvalidStateTransition (167-167)

NewProcess (99-143)

proxy/config/model_config.go (1)

HTTPEndpoint (12-17)

proxy/process.go (1)

proxy/config/model_config.go (1)

HTTPEndpoint (12-17)

🪛 LanguageTool

docs/configuration.md

(EN_COMPOUND_ADJECTIVE_INTERNAL)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: run-tests

🔇 Additional comments (42)

README.md (2)

32-32: Documenting /models/sleep/:model_id looks consistent

The new route description matches the intended behavior and clearly calls out the sleep/wake configuration requirement. No changes needed.

39-39: Feature bullet matches new sleep/wake capability

The “Fast model switching with sleep/wake support” text is accurate and gives a good mental model (offloading vs full restart). Reads well as-is.

ui/src/index.css (1)

136-138: New status and button utility styles align with the sleep/wake states

The added .status-badge and the status--sleepPending / status--waking / status--asleep / status--shutdown modifiers line up with the extended ModelStatus union, and .btn--action gives a consistent min width for the action buttons. Styling is cohesive with the existing status and button patterns.

Also applies to: 153-165, 179-181

ui/src/contexts/APIProvider.tsx (2)

4-4: API types cleanly model new sleep-related state and capabilities

Extending ModelStatus with sleep-related values and adding sleepEnabled to Model, plus exposing sleepModel on APIProviderType, keeps the type surface in sync with the backend behavior and the Models UI. The typing is coherent and should make misuse of unsupported states harder.

Also applies to: 7-14, 16-29

256-268: sleepModel wiring is consistent with existing load/unload helpers

The sleepModel helper mirrors the patterns used by unloadAllModels/loadModel: POST to the new /api/models/sleep/{model} endpoint, throw on non-OK, log before rethrowing, and is correctly threaded into the memoized context value and deps. Looks good and ready for the UI consumer.

Also applies to: 277-285

ui/src/pages/Models.tsx (1)

149-151: Status column uses the new badge styles consistently

Adding the “Actions” header and switching the state cell to status-badge text-center status status--${model.state} ties neatly into the new CSS variants. All defined ModelStatus values now have a visual treatment, with unknowns still falling back to the base .status styling.

Also applies to: 200-202

config.example.yaml (2)

24-35: Global sleep/wake timeout defaults look consistent

The new sleepRequestTimeout / wakeRequestTimeout settings and comments are clear and line up with the schema semantics (global default, per-endpoint override, used during swap). No issues from a config/UX perspective.

258-335: vLLM sleep mode examples are well-scoped and match the config surface

The vllm-sleep-level1 and vllm-sleep-level2 examples correctly illustrate:

Use of sleepEndpoints / wakeEndpoints

Per-endpoint HTTP method/body/timeout

Macro usage and level 1 vs level 2 behavior

These are good, concrete templates for users to copy without surprises.

config-schema.json (2)

51-62: Root sleep/wake timeout schema matches intended behavior

sleepRequestTimeout / wakeRequestTimeout are correctly modeled as integers with minimum: 1, default: 10, and descriptions that explain the global vs per-endpoint behavior. This aligns with the usage shown in YAML examples and tests.

230-297: sleepEndpoints/wakeEndpoints schema is precise and validator-friendly

The new sleepEndpoints / wakeEndpoints definitions:

Require endpoint

Constrain method to a small enum with sensible default

Use timeout >= 0 with 0 clearly documented as “use global timeout”

Disallow additionalProperties

This should give good feedback from schema-aware editors while matching the Go HTTPEndpoint type and config behavior.

docs/configuration.md (2)

124-135: Global sleep/wake timeout docs are clear and match behavior

The description of sleepRequestTimeout / wakeRequestTimeout here matches the config schema and example YAML: global defaults, per-endpoint override via timeout, and usage during swap/wake. Good level of detail for users.

321-398: vLLM sleep mode documentation aligns with config examples

The Level 1 and Level 2 vLLM sleep mode examples in the docs correctly reflect:

The same YAML structure as config.example.yaml

Level-specific endpoints and multi-step wake sequence

Optional per-endpoint timeout overrides

Nice balance of explanation and copy‑pasteable snippets.

proxy/config/config_windows_test.go (1)

204-207: Windows config defaults extended correctly for sleep/wake

Adding SleepRequestTimeout: 10 and WakeRequestTimeout: 10 alongside existing HealthCheckTimeout and MetricsMaxInMemory keeps Windows expectations in sync with Posix and the new global defaults. Looks good.

proxy/config/config_posix_test.go (1)

215-218: Posix config defaults updated to include sleep/wake timeouts

Including SleepRequestTimeout: 10 and WakeRequestTimeout: 10 in the expected Config struct keeps this test aligned with the new global defaults and the Windows test. No functional concerns.

proxy/processgroup.go (1)

68-70: Switching from Stop() to MakeIdle() on last-used process is reasonable

Using lastProcess.MakeIdle() when swapping models keeps the group logic the same while delegating finer-grained lifecycle control to Process (which now understands sleep/wake). No issues spotted in the locking or flow here.

proxy/proxymanager_api.go (2)

15-22: Model.SleepEnabled + sleep-related states are wired through cleanly

The addition of SleepEnabled and the extra state mappings (sleepPending, asleep, waking) fit naturally into getModelStatus:

sleepEnabled is computed directly from process.isSleepEnabled(), so the UI can avoid offering sleep where unsupported.

Existing models default to SleepEnabled: false without breaking compatibility.

State string values mirror the new ProcessState constants.

This is a straightforward, backward-compatible extension of the status API.

Also applies to: 52-84

243-263: *New /api/models/sleep/model endpoint mirrors unload flow and looks correct

The apiSleepSingleModelHandler:

Resolves aliases via RealModelName, just like unload.

Validates the process group and propagates errors from SleepProcess.

Returns "OK" on success.

Given that callers can already inspect sleepEnabled from the model list, this endpoint behavior is consistent and predictable.

proxy/config/model_config.go (1)

11-17: Sleep/wake endpoint config validation is solid and matches documented semantics

The additions here are well-structured:

HTTPEndpoint cleanly models the YAML structure used in examples and schema.

SleepEndpoints / WakeEndpoints on ModelConfig give per-model control without impacting existing configs (empty by default).

UnmarshalYAML:

Preserves existing defaults.

Enforces the “both-or-none” rule for sleep/wake endpoints, which guards against partially configured models.

Validates each endpoint with clear, indexed error messages.

validateEndpoint:

Requires a non-empty endpoint path.

Defaults Method to POST, uppercases it, and restricts to a small safe set.

Ensures Timeout is non-negative, aligning with 0 as “use global timeout”.

Overall this is a robust, backward-compatible extension of the model config surface.

Also applies to: 30-33, 56-109, 111-135

proxy/process_test.go (4)

100-112: Assertion string now correctly matches makeReady error path

Updating the expected substring to "unable to makeReady process" aligns this test with the new ProxyRequest error message when makeReady() fails, so the test remains meaningful with the refactored startup/wake logic.

241-275: Sleep/wake state transition table looks consistent with isValidTransition

The added cases for SleepPending, Asleep, and Waking match the transition rules in isValidTransition and exercise both valid and invalid edges (e.g., Ready→SleepPending, SleepPending→Asleep, Asleep→Waking, and invalid Ready→Asleep/Asleep→Ready), which gives good coverage of the new lifecycle states.

612-639: Multi-step wake test nicely validates ordered endpoint execution

TestProcess_MultiStepWakeSequence does a good job verifying that a multi-step wake sequence succeeds end-to-end and that the process ends up back in StateReady after executing all three configured endpoints; this is a solid coverage point for the multi-endpoint wake feature.

672-704: Wake-failure fallback behavior is tested clearly

TestProcess_WakeFailureFallsBackToStart correctly asserts that a failing wake endpoint produces an error mentioning “wake” and leaves the process in StateStopped, matching the fallback-to-stop behavior in wake(). This gives good coverage for the error path.

proxy/config/config.go (4)

269-302: Macro expansion into sleep/wake endpoints is consistent with existing fields

Extending the LIFO macro substitution pass into SleepEndpoints and WakeEndpoints (for both Endpoint and Body) mirrors the behavior for Cmd, Proxy, and Filters.StripParams. This makes endpoint configuration fully macro-aware, including MODEL_ID and other custom macros, and is aligned with the existing substitution rules.

304-344: PORT macro substitution into sleep/wake endpoints matches core behavior

Substituting ${PORT} into SleepEndpoints and WakeEndpoints in the same pass that handles Cmd, CmdStop, and Proxy ensures endpoint URLs and bodies stay in sync with the allocated port. This avoids subtle mismatches between inference endpoints and sleep/wake URLs when macros are used.

400-411: Defaulting per-endpoint timeouts from global config works as intended

The loops that fill SleepEndpoints[i].Timeout and WakeEndpoints[i].Timeout when they are 0 correctly implement “inherit from global timeout unless explicitly overridden,” which is exactly what the new tests assert. This centralizes timeout behavior and keeps endpoint YAML concise.

615-630: validateEndpointMacros silently skips empty fields; behavior seems fine

validateEndpointMacros checks both Endpoint and Body strings per entry and ignores them when empty, which is appropriate for optional bodies. The function also correctly distinguishes between reserved macros not being substituted and unknown macros. The current behavior is sound; no changes needed.

proxy/config/config_test.go (6)

765-808: Basic sleep/wake config test maps well to new global defaults

TestConfig_SleepWakeBasicConfiguration validates that global sleepRequestTimeout and wakeRequestTimeout are read correctly and applied to per-endpoint timeouts when not overridden. This directly tests the defaulting logic in LoadConfigFromReader and is a good sanity check.

805-836: Macro substitution test gives strong coverage for endpoints and bodies

TestConfig_SleepWakeMacroSubstitution confirms that both global (SLEEP_LEVEL) and implicit (MODEL_ID, PORT) macros are correctly expanded into sleepEndpoints and wakeEndpoints, including within JSON bodies. This tightly couples the test to the intended substitution semantics and should catch regressions in macro handling for endpoints.

838-871: Timeout override test correctly captures inheritance vs override behavior

TestConfig_SleepWakeTimeoutOverrides clearly asserts that explicit timeout values on endpoints override global settings while endpoints without timeout inherit from wakeRequestTimeout. This matches the logic in config.go and is a good behavioral spec.

925-945: Unknown macro test in endpoints correctly exercises new validation

TestConfig_SleepWakeUnknownMacroInEndpoints ensures that leftover ${UNKNOWN_MACRO} references in sleepEndpoints produce a configuration error mentioning the macro name, model ID, and field (sleepEndpoints). This directly verifies validateEndpointMacros and should prevent misconfigured sleep/wake URLs from slipping through.

947-970: Default timeouts test confirms config.go’s 10s default behavior

TestConfig_SleepWakeDefaultTimeouts verifies that when no global or per-endpoint timeouts are specified, both sleep and wake endpoints receive a default of 10 seconds. This matches the defaults in LoadConfigFromReader and strengthens confidence in the implicit behavior.

972-1008: Model-level macro precedence test nicely mirrors existing macro behavior

TestConfig_SleepWakeModelLevelMacros confirms that model-level macros override global macros when substituting into sleepEndpoints, just like they do for other fields. This is an important coverage point to ensure consistency across all macro-using config fields.

proxy/process.go (10)

27-43: New sleep/wake states and httpDialTimeout introduce clear lifecycle phases

Adding StateSleepPending, StateAsleep, and StateWaking plus a dedicated httpDialTimeout constant cleanly models the sleep/wake lifecycle and decouples TCP connection timeout from per-endpoint request timeouts. The chosen 500ms dial timeout is reasonable for local inference servers.

80-88: Additional WaitGroups for Sleep/Wake operations follow the existing start pattern

Introducing waitSleeping and waitWaking mirrors waitStarting and enables deduplication of concurrent Sleep() / wake() calls. This structure is appropriate for coordinating concurrent callers around transitional states.

170-202: WaitGroup increments inside swapState are subtle but generally sound

Incrementing waitStarting, waitSleeping, and waitWaking inside swapState when entering transitional states ensures the counter is incremented before other goroutines can observe the new state and call Wait(), which is key to avoiding races in the “already starting/sleeping/waking” fast paths. Reuse of these WaitGroups across cycles follows the same pattern as existing start logic and appears correct.

205-225: isValidTransition correctly encodes new sleep/wake transitions

The updated isValidTransition allows:

Ready→SleepPending,

SleepPending→Asleep/Stopping/Stopped,

Asleep→Waking/Stopping,

Waking→Ready/Stopping/Stopped,
while still enforcing no transitions from StateShutdown. This matches the intended lifecycle and is consistent with the new tests in TestProcess_SwapState.

394-424: TTL monitoring correctly accounts for new sleep/wake states

startUnloadMonitoring now treats StateReady, StateSleepPending, StateAsleep, and StateWaking as valid “alive” states and continues TTL checks across sleep/wake transitions, while still skipping when there are in-flight requests. This keeps the “UnloadAfter” behavior consistent even when models are put to sleep.

469-511: sendSleepRequests/sendWakeRequests sequencing and logging are straightforward

Both helpers iterate through configured endpoints in order, log each step with method/URL/timeout, and abort on the first failure, which is then surfaced back to Sleep()/wake() and ultimately to callers. This strongly matches the intended “multi-step vLLM wake sequence” semantics.

518-571: Sleep implementation is mostly solid but relies on Ready→SleepPending swap

Sleep():

Verifies sleep is configured,

Handles concurrent callers by waiting on waitSleeping when already StateSleepPending,

Waits for in-flight requests, and

Transitions Ready→SleepPending→Asleep, falling back to StopImmediately() on HTTP failure or invalid state transition.

The only nuance is that it always calls swapState(StateReady, StateSleepPending), so a concurrent transition out of Ready (e.g., to Stopping) will cause a mismatch and early return with a warning, which is acceptable but worth keeping in mind when orchestrating higher-level flows.

573-610: wake() concurrency handling pairs well with waitWaking

wake() uses swapState(StateAsleep, StateWaking) to initiate the transition and, on expected-state mismatches, will:

Block on waitWaking if another wake is already in progress, returning success if the final state is StateReady, or

Error out if the process is in any other state.

On success, it transitions Waking→Ready and logs duration; on failure of HTTP steps it falls back to StopImmediately() and returns an error. Combined with the recommended makeReady fix, this provides a robust, deduplicated wake path.

699-710: checkHealthEndpoint reuses the generic HTTP helper cleanly

Wrapping health checks as a synthetic HTTPEndpoint with a fixed 5-second timeout and feeding it through sendHTTPRequest keeps the HTTP logic centralized and enforces the same connection/timeout behavior as sleep/wake endpoints. This is a nice simplification compared to hand-rolling separate health-check code.

722-777: ProxyRequest now blocks SleepPending and correctly delegates readiness to makeReady

The updated ProxyRequest:

Rejects requests when the process is Shutdown, Stopping, or SleepPending,

Applies concurrency limits and in-flight tracking, and

Uses makeReady() (with SSE loading streams when enabled) whenever the state isn’t Ready.

This is a good fit for the new sleep/wake model and keeps the “start vs wake” decision fully encapsulated inside makeReady().

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

proxy/process.go (1)
653-697: Refactor to improve HTTP client efficiency and handle response bodies properly

Two improvements recommended:

Response body handling: The response body is never read (only closed at line 690), which can prevent HTTP/1.1 connection reuse. Reading and discarding the body is best practice.

Client creation overhead: Creating a new http.Client for each endpoint call (line 663) prevents connection pooling. While acceptable for infrequent sleep/wake operations, consider reusing a client or using a package-level client for better performance in multi-step sleep/wake sequences.

Apply this diff to properly handle response bodies:
 	resp, err := client.Do(req)
 	if err != nil {
 		return err
 	}
 	defer resp.Body.Close()
+	
+	// Read and discard body to enable connection reuse
+	_, _ = io.Copy(io.Discard, resp.Body)
 
 	if resp.StatusCode != http.StatusOK {
 		return fmt.Errorf("status code: %d", resp.StatusCode)
 	}
For the client reuse, consider creating a reusable client in NewProcess() and storing it in the Process struct, since the learnings indicate preferring simplicity for health checks. However, for sleep/wake operations that may involve multiple sequential calls, connection reuse could provide better performance.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e81ef6f and 78d33ff.

📒 Files selected for processing (2)

proxy/config/config.go (8 hunks)
proxy/process.go (15 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

proxy/config/config.go

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-08-29T05:03:26.638Z

Learnt from: mostlygeek
Repo: mostlygeek/llama-swap PR: 278
File: proxy/process.go:368-379
Timestamp: 2025-08-29T05:03:26.638Z
Learning: For health check implementations in the llama-swap codebase, prefer simple and explicit HTTP client configurations over performance optimizations like preserving DefaultTransport settings, as health checks only need to verify HTTP 200 responses and clarity is more important than HTTP/2 support or connection pooling.

Applied to files:

proxy/process.go

📚 Learning: 2025-10-29T05:26:34.964Z

Learnt from: mostlygeek
Repo: mostlygeek/llama-swap PR: 371
File: proxy/process.go:0-0
Timestamp: 2025-10-29T05:26:34.964Z
Learning: In proxy/process.go, the loading message "llama-swap loading model: {name}" intentionally uses p.ID (Process.ID) rather than the realModelName from the request context. This is the correct design choice.

Applied to files:

proxy/process.go

🧬 Code graph analysis (1)

proxy/process.go (1)

proxy/config/model_config.go (1)

HTTPEndpoint (12-17)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: run-tests

🔇 Additional comments (4)

proxy/process.go (4)

469-511: LGTM! Sequential endpoint execution is well-implemented

The sleep and wake request handlers correctly execute endpoints in sequence with clear logging and error reporting. The step-by-step approach (lines 477-484, 499-506) ensures multi-step sleep/wake operations complete in order, which is essential for operations like level-2 sleep that require multiple API calls.

518-571: LGTM! Sleep() correctly handles concurrent calls and state transitions

The implementation properly:

Blocks concurrent Sleep() calls using waitSleeping (lines 528-537)

Waits for in-flight requests before sleeping (line 546)

Falls back to StopImmediately() if sleep requests fail (line 559)

Transitions through StateSleepPending → StateAsleep with proper error handling

573-610: LGTM! wake() implementation follows established patterns

The method correctly handles:

Concurrent wake() calls by waiting on waitWaking (lines 580-585)

State transition from StateAsleep → StateWaking → StateReady

Error recovery by falling back to StopImmediately() (line 600)

The implementation pattern matches start() method's concurrent-call handling (lines 285-291), ensuring consistency.

634-651: LGTM! Clean refactoring of health check logic

The refactoring successfully:

Extracts URL building logic into buildFullURL() using proper url.Parse and ResolveReference

Unifies HTTP request handling through sendHTTPRequest() by creating an HTTPEndpoint config (lines 702-707)

Sets a reasonable 5-second timeout for health checks (line 705)

This abstraction eliminates code duplication and makes the health check logic consistent with sleep/wake endpoint handling.

Also applies to: 699-710

proxy/process.go

mostlygeek · 2025-11-19T16:33:15Z

If I understand vllm sleep correctly, it would require the vllm process continue running. API calls tell it to sleep the model using one of two stages.

Most people use llama-swap with llama-server. When llama-server also has support for the sleep api/functionality that’s when it’s time to consider a feature like this.

napmany · 2025-11-19T22:45:16Z

If I understand vllm sleep correctly, it would require the vllm process continue running. API calls tell it to sleep the model using one of two stages.

Most people use llama-swap with llama-server. When llama-server also has support for the sleep api/functionality that’s when it’s time to consider a feature like this.

Hey @mostlygeek,
I'd like to address the "too early" concern and explain why I think this is actually the right time to merge this feature:
1. vLLM Sleep mode is a Production ready feature
vLLM's sleep mode was released in their blog and is being actively used. This isn't experimental - it's a production feature that reduces swap times from ~30-90 seconds to ~1-5 seconds. llama-swap users running vLLM would benefit immediately.
2. Market Opportunity: Attract vLLM Users
There's currently a lack of lightweight solutions that support vLLM's sleep mode without requiring k8s. Most enterprise solutions are heavy and overkill for single-node or small-scale deployments. llama-swap is perfectly positioned to fill this gap.
This feature could significantly expand llama-swap's user base and establish it as the go-to solution for vLLM model swapping.
3. Future-Proof, Generic Design
I intentionally designed this to be provider-agnostic and avoid tight coupling with vLLM.
The implementation uses generic HTTP Endpoint configurations - just method, URL, body, and timeout. When llama.cpp or other servers add sleep APIs, users can configure them without or with just small llama-swap code changes.
4. Zero Impact on Existing Users
This implementation is completely opt-in. No breaking changes to existing configs
5. Significant Performance Benefit
For users running multiple vLLM models, this eliminates the biggest pain point: swap latency. Model switching becomes ~30x faster, which dramatically improves user experience during model switches.

What would it take to get this merged? Are there specific concerns I can address, or changes you'd like to see?

mostlygeek · 2025-11-20T01:56:32Z

@napmany

I’m not going to merge this until llama-swap for now. Please maintain it in your fork for now until/if the sleep api becomes more widely supported.

napmany added 2 commits November 19, 2025 13:33

add sleep mode

1519d38

fix go-ci-windows simple-responder build

e81ef6f

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

napmany added 2 commits November 19, 2025 16:02

min values for SleepRequestTimeout and WakeRequestTimeout

cba053a

fix process state check in makeReady func

78d33ff

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

proxy/process.go Show resolved Hide resolved

fix wake() fall back to start()

f6b9320

mostlygeek closed this Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sleep mode#407

Sleep mode#407
napmany wants to merge 5 commits intomostlygeek:mainfrom
napmany:sleep-wake

napmany commented Nov 19, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

mostlygeek commented Nov 19, 2025

Uh oh!

napmany commented Nov 19, 2025

Uh oh!

mostlygeek commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

napmany commented Nov 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Sleep/Wake Support for Fast Model Switching

Summary

What Changed

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mostlygeek commented Nov 19, 2025

Uh oh!

napmany commented Nov 19, 2025

Uh oh!

mostlygeek commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

napmany commented Nov 19, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 19, 2025 •

edited

Loading