Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
3989c17
add timeout feature
overcuriousity Jan 30, 2026
c34372c
implement first draft of new feature
overcuriousity Jan 30, 2026
29ef364
proxy/config: fix RPC endpoint parsing on Windows
overcuriousity Jan 30, 2026
ac074d1
fix unit test
overcuriousity Jan 30, 2026
c8f2761
rework web interface
overcuriousity Jan 30, 2026
6f023c7
fix error assumption healthy
overcuriousity Jan 30, 2026
c17df42
proxy: make RPC health checks independent of process state
overcuriousity Jan 30, 2026
4987daf
WIP: web config changes
overcuriousity Jan 30, 2026
e6f9f9a
proxy: fix requestTimeout feature to actually terminate requests
overcuriousity Jan 31, 2026
0e86bbc
docs: add requestTimeout to README features list
overcuriousity Jan 31, 2026
97976a6
Merge pull request #10 from overcuriousity/feat--web-config
overcuriousity Jan 31, 2026
fc33fdf
Merge pull request #11 from overcuriousity/feat--timeout
overcuriousity Jan 31, 2026
88f02d7
Merge branch 'new-features' into feat--conditional-rpc-healthcheck
overcuriousity Jan 31, 2026
7187493
Merge pull request #12 from overcuriousity/feat--conditional-rpc-heal…
overcuriousity Jan 31, 2026
fe96ae4
proxy: improve RPC health check reliability and fix security issues
Jan 31, 2026
79332e3
ui-svelte: improve Config editor dark mode styling
Jan 31, 2026
26d7c89
proxy: fix stopCommand hang on startup timeout
Jan 31, 2026
5f31c89
proxy/config: fix RPC endpoint parsing for Windows quoted args
Jan 31, 2026
7ca1977
remove test config file
Jan 31, 2026
9ab8bd8
Merge branch 'mostlygeek:main' into feat--web-config
overcuriousity Jan 31, 2026
e762485
Merge branch 'mostlygeek:main' into feat--conditional-rpc-healthcheck
overcuriousity Jan 31, 2026
15a6aa7
Merge branch 'mostlygeek:main' into new-features
overcuriousity Jan 31, 2026
6c14013
proxy: fix data race and startup interrupt hang
Jan 31, 2026
59db9f0
ui-svelte: fix Config editor compartment collision and error handling
Jan 31, 2026
4e14a0d
proxy: fix race conditions in Stop and test assertions
Jan 31, 2026
a502ebd
proxy: fix Windows timeout command conflict
Jan 31, 2026
b733ee4
proxy: fix request timeout context handling
Jan 31, 2026
60f599b
Merge branch 'new-features' into feat--web-config
overcuriousity Jan 31, 2026
04a8886
Merge pull request #15 from overcuriousity/feat--web-config
overcuriousity Jan 31, 2026
a7aa251
Merge pull request #14 from overcuriousity/feat--timeout
overcuriousity Jan 31, 2026
f4fd37f
Merge pull request #13 from overcuriousity/feat--conditional-rpc-heal…
overcuriousity Jan 31, 2026
960e78d
Merge branch 'mostlygeek:main' into feat--timeout
overcuriousity Jan 31, 2026
febbe97
proxy: ignore I/O timeout in RPC health checks
Jan 31, 2026
79cf3df
Merge pull request #16 from overcuriousity/feat--conditional-rpc-heal…
overcuriousity Jan 31, 2026
8e62ce1
ui-svelte: fix Config editor cursor jumping on input
Jan 31, 2026
ceeebbc
Merge pull request #17 from overcuriousity/feat--web-config
overcuriousity Jan 31, 2026
7d68a64
Merge branch 'mostlygeek:main' into feat--timeout
overcuriousity Feb 1, 2026
256e576
Merge pull request #21 from mostlygeek/main
overcuriousity Feb 3, 2026
5c9069d
Merge pull request #24 from overcuriousity/feat--timeout
overcuriousity Feb 3, 2026
cbaa55d
Merge PR #23: feat--conditional-rpc-healthcheck
Feb 3, 2026
f5e13d2
test: fix NewProcess calls to include context parameter
Feb 3, 2026
f7b80b4
Merge remote-tracking branch 'origin/feat--web-config'
overcuriousity Feb 3, 2026
842f966
fix: correct node_modules path in Makefile (ui-svelte not ui)
overcuriousity Feb 3, 2026
65761bd
fix: make node_modules depend on package.json to trigger npm install …
overcuriousity Feb 3, 2026
427fe4e
Merge upstream/main into new-features
overcuriousity Mar 5, 2026
e295d15
Merge new-features into main
overcuriousity Mar 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,12 @@ test: proxy/ui_dist/placeholder.txt
test-all: proxy/ui_dist/placeholder.txt
go test -race -count=1 ./proxy/...

ui/node_modules:
ui-svelte/node_modules: ui-svelte/package.json ui-svelte/package-lock.json
cd ui-svelte && npm install
touch ui-svelte/node_modules

# build react UI
ui: ui/node_modules
# build svelte UI
ui: ui-svelte/node_modules
cd ui-svelte && npm run build

# Build OSX binary
Expand Down
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,9 +42,11 @@ Built in Go for performance and simplicity, llama-swap has zero dependencies and
- ✅ API Key support - define keys to restrict access to API endpoints
- ✅ Customizable
- Run multiple models at once with `Groups` ([#107](https://github.com/mostlygeek/llama-swap/issues/107))
- Automatic unloading of models after timeout by setting a `ttl`
- Automatic unloading of models after idle timeout by setting a `ttl`
- Request timeout protection with `requestTimeout` to prevent runaway inference
- Reliable Docker and Podman support using `cmd` and `cmdStop` together
- Preload models on startup with `hooks` ([#235](https://github.com/mostlygeek/llama-swap/pull/235))
- RPC health checking for distributed inference - conditionally expose models based on RPC server availability

### Web UI

Expand Down Expand Up @@ -189,6 +191,7 @@ Almost all configuration settings are optional and can be added one step at a ti
- `useModelName` to override model names sent to upstream servers
- `${PORT}` automatic port variables for dynamic port assignment
- `filters` rewrite parts of requests before sending to the upstream server
- `rpcHealthCheck` monitor RPC server health for distributed inference models

See the [configuration documentation](docs/configuration.md) for all options.

Expand Down
11 changes: 11 additions & 0 deletions config-schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -237,10 +237,21 @@
"type": "boolean",
"description": "Overrides the global sendLoadingState for this model. Ommitting this property will use the global setting."
},
"requestTimeout": {
"type": "integer",
"minimum": 0,
"default": 0,
"description": "Maximum time in seconds for a single request to complete before forcefully killing the model process. This prevents runaway inference processes from blocking the GPU indefinitely. 0 disables timeout (default). When exceeded, the process is terminated and must be restarted for the next request."
},
"unlisted": {
"type": "boolean",
"default": false,
"description": "If true the model will not show up in /v1/models responses. It can still be used as normal in API requests."
},
"rpcHealthCheck": {
"type": "boolean",
"default": false,
"description": "Enable TCP health checks for RPC endpoints specified in cmd. When enabled, parses --rpc host:port[,host:port,...] from cmd and performs health checks every 30 seconds. Models with unhealthy RPC endpoints are filtered from /v1/models and return 503 on inference requests."
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema description says RPC health checks run "every 30 seconds", but the current implementation introduced in proxy/process.go uses a 10-second ticker. Please update the schema description (or the code) so they match.

Suggested change
"description": "Enable TCP health checks for RPC endpoints specified in cmd. When enabled, parses --rpc host:port[,host:port,...] from cmd and performs health checks every 30 seconds. Models with unhealthy RPC endpoints are filtered from /v1/models and return 503 on inference requests."
"description": "Enable TCP health checks for RPC endpoints specified in cmd. When enabled, parses --rpc host:port[,host:port,...] from cmd and performs health checks every 10 seconds. Models with unhealthy RPC endpoints are filtered from /v1/models and return 503 on inference requests."

Copilot uses AI. Check for mistakes.
}
}
}
Expand Down
28 changes: 28 additions & 0 deletions config.example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,16 @@ models:
# - recommended to be omitted and the default used
concurrencyLimit: 0

# requestTimeout: maximum time in seconds for a single request to complete
# - optional, default: 0 (no timeout)
# - useful for preventing runaway inference processes that never complete
# - when exceeded, the model process is forcefully stopped
# - protects against GPU overheating and blocking from stuck processes
# - the process must be restarted for the next request
# - set to 0 to disable timeout
# - recommended for models that may have infinite loops or excessive generation
requestTimeout: 0 # disabled by default, set to e.g., 300 for 5 minutes

# sendLoadingState: overrides the global sendLoadingState setting for this model
# - optional, default: undefined (use global setting)
sendLoadingState: false
Expand All @@ -293,6 +303,24 @@ models:
unlisted: true
cmd: llama-server --port ${PORT} -m Llama-3.2-1B-Instruct-Q4_K_M.gguf -ngl 0

# RPC health check example for distributed inference:
"qwen-distributed":
# rpcHealthCheck: enable TCP health checks for RPC endpoints
# - optional, default: false
# - when enabled, parses --rpc host:port[,host:port,...] from cmd
# - performs TCP connectivity checks every 30 seconds
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says RPC connectivity checks run "every 30 seconds", but the implementation introduced in proxy/process.go runs the ticker every 10 seconds. Please align the example documentation with the actual behavior (or make the interval configurable).

Suggested change
# - performs TCP connectivity checks every 30 seconds
# - performs TCP connectivity checks every 10 seconds

Copilot uses AI. Check for mistakes.
# - model is only listed in /v1/models when ALL RPC endpoints are healthy
# - inference requests to unhealthy models return HTTP 503
# - useful for distributed inference with llama.cpp's rpc-server
rpcHealthCheck: true
cmd: |
llama-server --port ${PORT}
--rpc 192.168.1.10:50051,192.168.1.11:50051
-m Qwen2.5-32B-Instruct-Q4_K_M.gguf
-ngl 99
name: "Qwen 32B (Distributed)"
description: "Large model using distributed RPC inference"

# Docker example:
# container runtimes like Docker and Podman can be used reliably with
# a combination of cmd, cmdStop, and ${MODEL_ID}
Expand Down
14 changes: 14 additions & 0 deletions config_embed.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
package main

import (
"bytes"
_ "embed"
)

//go:embed config.example.yaml
var configExampleYAML []byte

// GetConfigExampleYAML returns the embedded example config file
func GetConfigExampleYAML() []byte {
return bytes.Clone(configExampleYAML)
}
31 changes: 21 additions & 10 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,16 +72,17 @@ models:

llama-swap supports many more features to customize how you want to manage your environment.

| Feature | Description |
| --------- | ---------------------------------------------- |
| `ttl` | automatic unloading of models after a timeout |
| `macros` | reusable snippets to use in configurations |
| `groups` | run multiple models at a time |
| `hooks` | event driven functionality |
| `env` | define environment variables per model |
| `aliases` | serve a model with different names |
| `filters` | modify requests before sending to the upstream |
| `...` | And many more tweaks |
| Feature | Description |
| ----------------- | ------------------------------------------------------- |
| `ttl` | automatic unloading of models after a timeout |
| `macros` | reusable snippets to use in configurations |
| `groups` | run multiple models at a time |
| `hooks` | event driven functionality |
| `env` | define environment variables per model |
| `aliases` | serve a model with different names |
| `filters` | modify requests before sending to the upstream |
| `rpcHealthCheck` | monitor RPC server health for distributed inference |
| `...` | And many more tweaks |

## Full Configuration Example

Expand Down Expand Up @@ -319,6 +320,16 @@ models:
# - recommended to be omitted and the default used
concurrencyLimit: 0

# requestTimeout: maximum time in seconds for a single request to complete
# - optional, default: 0 (no timeout)
# - useful for preventing runaway inference processes that never complete
# - when exceeded, the model process is forcefully stopped
# - protects against GPU overheating and blocking from stuck processes
# - the process must be restarted for the next request
# - set to 0 to disable timeout
# - recommended for models that may have infinite loops or excessive generation
requestTimeout: 300 # 5 minutes

# sendLoadingState: overrides the global sendLoadingState setting for this model
# - optional, default: undefined (use global setting)
sendLoadingState: false
Expand Down
18 changes: 12 additions & 6 deletions llama-swap.go
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,8 @@ func main() {
currentPM.Shutdown()
newPM := proxy.New(conf)
newPM.SetVersion(date, commit, version)
newPM.SetConfigPath(*configPath)
newPM.SetConfigExample(GetConfigExampleYAML())
srv.Handler = newPM
fmt.Println("Configuration Reloaded")

Expand All @@ -114,20 +116,24 @@ func main() {
}
newPM := proxy.New(conf)
newPM.SetVersion(date, commit, version)
newPM.SetConfigPath(*configPath)
newPM.SetConfigExample(GetConfigExampleYAML())
srv.Handler = newPM
}
}

// load the initial proxy manager
reloadProxyManager()
debouncedReload := debounce(time.Second, reloadProxyManager)
if *watchConfig {
defer event.On(func(e proxy.ConfigFileChangedEvent) {
if e.ReloadingState == proxy.ReloadingStateStart {
debouncedReload()
}
})()

// Always listen for API-triggered config changes
defer event.On(func(e proxy.ConfigFileChangedEvent) {
if e.ReloadingState == proxy.ReloadingStateStart {
debouncedReload()
}
})()

if *watchConfig {
fmt.Println("Watching Configuration for changes")
go func() {
absConfigPath, err := filepath.Abs(*configPath)
Expand Down
65 changes: 65 additions & 0 deletions proxy/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ package config
import (
"fmt"
"io"
"net"
"net/url"
"os"
"regexp"
Expand Down Expand Up @@ -596,6 +597,70 @@ func SanitizeCommand(cmdStr string) ([]string, error) {
return args, nil
}

// ParseRPCEndpoints extracts RPC endpoints from command string
// Handles: --rpc host:port,host2:port2 or --rpc=host:port or -rpc host:port
func ParseRPCEndpoints(cmdStr string) ([]string, error) {
args, err := SanitizeCommand(cmdStr)
if err != nil {
return nil, err
}

var endpoints []string
for i, arg := range args {
if arg == "--rpc" || arg == "-rpc" {
// Collect all non-flag arguments after --rpc
// This handles Windows where shlex splits single-quoted strings with spaces
var parts []string
for j := i + 1; j < len(args) && !strings.HasPrefix(args[j], "-"); j++ {
parts = append(parts, args[j])
}
if len(parts) > 0 {
// Join parts with space and parse as a single endpoint list
endpoints = parseEndpointList(strings.Join(parts, " "))
}
} else if strings.HasPrefix(arg, "--rpc=") {
endpoints = parseEndpointList(strings.TrimPrefix(arg, "--rpc="))
} else if strings.HasPrefix(arg, "-rpc=") {
endpoints = parseEndpointList(strings.TrimPrefix(arg, "-rpc="))
}
}

// Validate each endpoint
for _, ep := range endpoints {
if _, _, err := net.SplitHostPort(ep); err != nil {
return nil, fmt.Errorf("invalid RPC endpoint %q: %w", ep, err)
}
}

return endpoints, nil
}

func parseEndpointList(s string) []string {
s = strings.TrimSpace(s)

// Strip surrounding quotes (both single and double) from the whole string
// if they match. This handles cases like: "host:port,host2:port2"
if len(s) >= 2 {
if (s[0] == '\'' && s[len(s)-1] == '\'') || (s[0] == '"' && s[len(s)-1] == '"') {
s = s[1 : len(s)-1]
}
}

parts := strings.Split(s, ",")
var result []string
for _, p := range parts {
p = strings.TrimSpace(p)
// Strip any remaining leading/trailing quotes from individual parts
// This handles Windows where shlex doesn't handle single quotes and
// may split 'host:port, host2:port' into "'host:port," and "host2:port'"
p = strings.Trim(p, "'\"")
if p != "" {
result = append(result, p)
}
}
return result
}

func StripComments(cmdStr string) string {
var cleanedLines []string
for _, line := range strings.Split(cmdStr, "\n") {
Expand Down
105 changes: 105 additions & 0 deletions proxy/config/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1438,3 +1438,108 @@ models:
})

}

func TestParseRPCEndpoints_ValidFormats(t *testing.T) {
tests := []struct {
name string
cmd string
expected []string
}{
{
name: "single endpoint with --rpc",
cmd: "llama-server --rpc localhost:50051 -ngl 99",
expected: []string{"localhost:50051"},
},
{
name: "single endpoint with --rpc=",
cmd: "llama-server --rpc=192.168.1.100:50051 -ngl 99",
expected: []string{"192.168.1.100:50051"},
},
{
name: "single endpoint with -rpc",
cmd: "llama-server -rpc localhost:50051 -ngl 99",
expected: []string{"localhost:50051"},
},
{
name: "single endpoint with -rpc=",
cmd: "llama-server -rpc=localhost:50051 -ngl 99",
expected: []string{"localhost:50051"},
},
{
name: "multiple endpoints comma-separated",
cmd: "llama-server --rpc 192.168.1.10:50051,192.168.1.11:50051 -ngl 99",
expected: []string{"192.168.1.10:50051", "192.168.1.11:50051"},
},
{
name: "multiple endpoints with spaces trimmed",
cmd: "llama-server --rpc '192.168.1.10:50051, 192.168.1.11:50051' -ngl 99",
expected: []string{"192.168.1.10:50051", "192.168.1.11:50051"},
},
{
name: "IPv6 endpoint",
cmd: "llama-server --rpc [::1]:50051 -ngl 99",
expected: []string{"[::1]:50051"},
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
endpoints, err := ParseRPCEndpoints(tt.cmd)
assert.NoError(t, err)
assert.Equal(t, tt.expected, endpoints)
})
}
}

func TestParseRPCEndpoints_NoRPCFlag(t *testing.T) {
cmd := "llama-server -ngl 99 -m model.gguf"
endpoints, err := ParseRPCEndpoints(cmd)
assert.NoError(t, err)
assert.Empty(t, endpoints)
}

func TestParseRPCEndpoints_InvalidFormats(t *testing.T) {
tests := []struct {
name string
cmd string
wantErr string
}{
{
name: "missing port",
cmd: "llama-server --rpc localhost -ngl 99",
wantErr: "invalid RPC endpoint",
},
{
name: "invalid host:port format",
cmd: "llama-server --rpc not-a-valid-endpoint -ngl 99",
wantErr: "invalid RPC endpoint",
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
_, err := ParseRPCEndpoints(tt.cmd)
assert.Error(t, err)
assert.Contains(t, err.Error(), tt.wantErr)
})
}
}

func TestParseRPCEndpoints_EmptyEndpointsFiltered(t *testing.T) {
// Empty strings after commas are filtered out
cmd := "llama-server --rpc 'localhost:50051,,' -ngl 99"
endpoints, err := ParseRPCEndpoints(cmd)
assert.NoError(t, err)
assert.Equal(t, []string{"localhost:50051"}, endpoints)
}

func TestParseRPCEndpoints_MultilineCommand(t *testing.T) {
cmd := `llama-server \
--rpc localhost:50051 \
-ngl 99 \
-m model.gguf`

endpoints, err := ParseRPCEndpoints(cmd)
assert.NoError(t, err)
assert.Equal(t, []string{"localhost:50051"}, endpoints)
}
Loading
Loading