Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug-report.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ assignees: ''

---
> [!IMPORTANT]
> If you have questions about llama-swap please post in the Q&A in Discussions. Use bug reports when you've found a defect and wish to discuss a fix.
> If you have questions about llmsnap please post in the Q&A in Discussions. Use bug reports when you've found a defect and wish to discuss a fix.

**Describe the bug**
A clear and concise description of what the bug is.
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ jobs:
uses: peter-evans/repository-dispatch@v2
with:
token: ${{ secrets.TAP_REPO_PAT }}
repository: mostlygeek/homebrew-llama-swap
repository: napmany/homebrew-llmsnap
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

❓ Verification inconclusive

Verify the new Homebrew tap repository exists and is accessible.

The repository reference has been updated from mostlygeek/homebrew-llama-swap to napmany/homebrew-llmsnap. Ensure that:

  1. The new repository exists at napmany/homebrew-llmsnap
  2. The TAP_REPO_PAT secret has appropriate permissions for the new repository
  3. The repository is properly configured to receive repository dispatch events

Run the following script to verify the repository exists:


🏁 Script executed:

#!/bin/bash
# Description: Verify the new Homebrew tap repository exists

# Check if the repository exists and is accessible
gh repo view napmany/homebrew-llmsnap --json name,owner,visibility

Length of output: 97


The new Homebrew tap repository does not exist and must be created or the reference must be corrected.

The repository napmany/homebrew-llmsnap referenced at line 72 cannot be found on GitHub. The release workflow will fail when attempting to deploy to this non-existent repository. Either:

  1. Create the repository at napmany/homebrew-llmsnap with proper access configuration before merging this change, or
  2. Update the repository reference to point to an existing Homebrew tap repository that is properly configured to receive releases
🤖 Prompt for AI Agents
.github/workflows/release.yml around line 72: the workflow references a
non-existent Homebrew tap repo "napmany/homebrew-llmsnap", which will cause
releases to fail; either create the GitHub repository "napmany/homebrew-llmsnap"
with appropriate write access for the workflow runner, or change the repository
field to an existing, correctly configured Homebrew tap (e.g.,
org/user/existing-tap) and ensure the workflow has permission (PAT or
actions/checkout/write) to push there before merging.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

napmany/homebrew-llmsnap repo is on the way

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@napmany, great! Thanks for confirming that you're setting up the napmany/homebrew-llmsnap repository.

Just a reminder: once the repository is created, please make sure that the TAP_REPO_PAT secret in this repository has the necessary permissions to dispatch events to the new Homebrew tap repository. The workflow will need write access to trigger the tap update successfully.

event-type: new-release
client-payload: |
{
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@ dist/
.vscode
.DS_Store
.dev/
config*.yaml
!config.example.yaml
4 changes: 2 additions & 2 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Project: llama-swap
# Project: llmsnap

## Project Description:

llama-swap is a light weight, transparent proxy server that provides automatic model swapping to llama.cpp's server.
llmsnap is a light weight, transparent proxy server that provides automatic model swapping to vllm, llama.cpp and other OpenAI compatible servers.

## Tech stack

Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Define variables for the application
APP_NAME = llama-swap
APP_NAME = llmsnap
BUILD_DIR = build

# Get the current Git hash
Expand Down
91 changes: 39 additions & 52 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
![llama-swap header image](header2.png)
![GitHub Downloads (all assets, all releases)](https://img.shields.io/github/downloads/mostlygeek/llama-swap/total)
![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/mostlygeek/llama-swap/go-ci.yml)
![GitHub Repo stars](https://img.shields.io/github/stars/mostlygeek/llama-swap)
<!-- TODO: Header image needs redesign with llmsnap branding -->
![llmsnap header image](header.jpeg)
![GitHub Downloads (all assets, all releases)](https://img.shields.io/github/downloads/napmany/llmsnap/total)
![GitHub Actions Workflow Status](https://img.shields.io/github/actions/workflow/status/napmany/llmsnap/go-ci.yml)
![GitHub Repo stars](https://img.shields.io/github/stars/napmany/llmsnap)

# llama-swap
# llmsnap

Run multiple LLM models on your machine and hot-swap between them as needed. llama-swap works with any OpenAI API-compatible server, giving you the flexibility to switch models without restarting your applications.
Run multiple LLM models on your machine and hot-swap between them as needed. llmsnap works with any OpenAI API-compatible server, giving you the flexibility to switch models without restarting your applications.

Built in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file.
Built in Go for performance and simplicity, llmsnap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file.

## Features:

Expand All @@ -25,7 +26,7 @@ Built in Go for performance and simplicity, llama-swap has zero dependencies and
- `v1/rerank`, `v1/reranking`, `/rerank`
- `/infill` - for code infilling
- `/completion` - for completion endpoint
- ✅ llama-swap API
- ✅ llmsnap API
- `/ui` - web UI
- `/upstream/:model_id` - direct access to upstream server ([demo](https://github.com/mostlygeek/llama-swap/pull/31))
- `/models/unload` - manually unload running models ([#58](https://github.com/mostlygeek/llama-swap/issues/58))
Expand All @@ -42,7 +43,7 @@ Built in Go for performance and simplicity, llama-swap has zero dependencies and

### Web UI

llama-swap includes a real time web interface for monitoring logs and controlling models:
llmsnap includes a real time web interface for monitoring logs and controlling models:

<img width="1164" height="745" alt="image" src="https://github.com/user-attachments/assets/bacf3f9d-819f-430b-9ed2-1bfaa8d54579" />

Expand All @@ -53,26 +54,25 @@ The Activity Page shows recent requests:

## Installation

llama-swap can be installed in multiple ways
llmsnap can be installed in multiple ways

1. Docker
2. Homebrew (OSX and Linux)
3. WinGet
4. From release binaries
5. From source
3. From release binaries
4. From source

### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))
### Docker Install ([download images](https://github.com/napmany/llmsnap/pkgs/container/llmsnap))

Nightly container images with llama-swap and llama-server are built for multiple platforms (cuda, vulkan, intel, etc).
Nightly container images with llmsnap and llama-server are built for multiple platforms (cuda, vulkan, intel, etc).

```shell
$ docker pull ghcr.io/mostlygeek/llama-swap:cuda
$ docker pull ghcr.io/napmany/llmsnap:cuda

# run with a custom configuration and models directory
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
-v /path/to/models:/models \
-v /path/to/custom/config.yaml:/app/config.yaml \
ghcr.io/mostlygeek/llama-swap:cuda
ghcr.io/napmany/llmsnap:cuda
```

<details>
Expand All @@ -82,14 +82,14 @@ more examples

```shell
# pull latest images per platform
docker pull ghcr.io/mostlygeek/llama-swap:cpu
docker pull ghcr.io/mostlygeek/llama-swap:cuda
docker pull ghcr.io/mostlygeek/llama-swap:vulkan
docker pull ghcr.io/mostlygeek/llama-swap:intel
docker pull ghcr.io/mostlygeek/llama-swap:musa
docker pull ghcr.io/napmany/llmsnap:cpu
docker pull ghcr.io/napmany/llmsnap:cuda
docker pull ghcr.io/napmany/llmsnap:vulkan
docker pull ghcr.io/napmany/llmsnap:intel
docker pull ghcr.io/napmany/llmsnap:musa

# tagged llama-swap, platform and llama-server version images
docker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795
# tagged llmsnap, platform and llama-server version images
docker pull ghcr.io/napmany/llmsnap:v166-cuda-b6795

```

Expand All @@ -98,34 +98,21 @@ docker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795
### Homebrew Install (macOS/Linux)

```shell
brew tap mostlygeek/llama-swap
brew install llama-swap
llama-swap --config path/to/config.yaml --listen localhost:8080
```

### WinGet Install (Windows)

> [!NOTE]
> WinGet is maintained by community contributor [Dvd-Znf](https://github.com/Dvd-Znf) ([#327](https://github.com/mostlygeek/llama-swap/issues/327)). It is not an official part of llama-swap.

```shell
# install
C:\> winget install llama-swap

# upgrade
C:\> winget upgrade llama-swap
brew tap napmany/llmsnap
brew install llmsnap
llmsnap --config path/to/config.yaml --listen localhost:8080
```

### Pre-built Binaries

Binaries are available on the [release](https://github.com/mostlygeek/llama-swap/releases) page for Linux, Mac, Windows and FreeBSD.
Binaries are available on the [release](https://github.com/napmany/llmsnap/releases) page for Linux, Mac, Windows and FreeBSD.

### Building from source

1. Building requires Go and Node.js (for UI).
1. `git clone https://github.com/mostlygeek/llama-swap.git`
1. `git clone https://github.com/napmany/llmsnap.git`
1. `make clean all`
1. look in the `build/` subdirectory for the llama-swap binary
1. look in the `build/` subdirectory for the llmsnap binary

## Configuration

Expand Down Expand Up @@ -161,35 +148,35 @@ Almost all configuration settings are optional and can be added one step at a ti

See the [configuration documentation](docs/configuration.md) for all options.

## How does llama-swap work?
## How does llmsnap work?

When a request is made to an OpenAI compatible endpoint, llama-swap will extract the `model` value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to handle the request correctly.
When a request is made to an OpenAI compatible endpoint, llmsnap will extract the `model` value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to handle the request correctly.

In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, the `groups` feature allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.
In the most basic configuration llmsnap handles one model at a time. For more advanced use cases, the `groups` feature allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.

## Reverse Proxy Configuration (nginx)

If you deploy llama-swap behind nginx, disable response buffering for streaming endpoints. By default, nginx buffers responses which breaks Server‑Sent Events (SSE) and streaming chat completion. ([#236](https://github.com/mostlygeek/llama-swap/issues/236))
If you deploy llmsnap behind nginx, disable response buffering for streaming endpoints. By default, nginx buffers responses which breaks Server‑Sent Events (SSE) and streaming chat completion. ([#236](https://github.com/mostlygeek/llama-swap/issues/236))

Recommended nginx configuration snippets:

```nginx
# SSE for UI events/logs
location /api/events {
proxy_pass http://your-llama-swap-backend;
proxy_pass http://your-llmsnap-backend;
proxy_buffering off;
proxy_cache off;
}

# Streaming chat completions (stream=true)
location /v1/chat/completions {
proxy_pass http://your-llama-swap-backend;
proxy_pass http://your-llmsnap-backend;
proxy_buffering off;
proxy_cache off;
}
```

As a safeguard, llama-swap also sets `X-Accel-Buffering: no` on SSE responses. However, explicitly disabling `proxy_buffering` at your reverse proxy is still recommended for reliable streaming behavior.
As a safeguard, llmsnap also sets `X-Accel-Buffering: no` on SSE responses. However, explicitly disabling `proxy_buffering` at your reverse proxy is still recommended for reliable streaming behavior.

## Monitoring Logs on the CLI

Expand All @@ -215,7 +202,7 @@ curl -Ns 'http://host/logs/stream?no-history'

## Do I need to use llama.cpp's server (llama-server)?

Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.
Any OpenAI compatible server would work.

For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to `SIGTERM` signals for proper shutdown.

Expand All @@ -224,4 +211,4 @@ For Python based inference servers like vllm or tabbyAPI it is recommended to ru
> [!NOTE]
> ⭐️ Star this project to help others discover it!

[![Star History Chart](https://api.star-history.com/svg?repos=mostlygeek/llama-swap&type=Date)](https://www.star-history.com/#mostlygeek/llama-swap&Date)
[![Star History Chart](https://api.star-history.com/svg?repos=napmany/llmsnap&type=Date)](https://www.star-history.com/#napmany/llmsnap&Date)
22 changes: 11 additions & 11 deletions ai-plans/issue-264-add-metadata.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,8 @@ The metadata will be schemaless, allowing users to define any key-value pairs th

**Required Changes:**

- Add metadata to each model record under the key `llamaswap_meta`
- Only include `llamaswap_meta` if metadata is non-empty
- Add metadata to each model record under the key `llmsnap_meta`
- Only include `llmsnap_meta` if metadata is non-empty
- Preserve all types when marshaling to JSON
- Maintain existing sorting by model ID

Expand All @@ -100,10 +100,10 @@ The metadata will be schemaless, allowing users to define any key-value pairs th
"id": "llama",
"object": "model",
"created": 1234567890,
"owned_by": "llama-swap",
"owned_by": "llmsnap",
"name": "llama 3.1 8B",
"description": "A small but capable model",
"llamaswap_meta": {
"llmsnap_meta": {
"port": 10001,
"temperature": 0.7,
"note": "The llama is running on port 10001 temp=0.7, context=16384",
Expand Down Expand Up @@ -180,8 +180,8 @@ The metadata will be schemaless, allowing users to define any key-value pairs th

**Test Cases:**

- Model with metadata → verify `llamaswap_meta` key appears
- Model without metadata → verify `llamaswap_meta` key is absent
- Model with metadata → verify `llmsnap_meta` key appears
- Model without metadata → verify `llmsnap_meta` key is absent
- Verify all types are correctly marshaled to JSON
- Verify nested structures are preserved
- Verify macro substitution has occurred before serialization
Expand Down Expand Up @@ -230,8 +230,8 @@ The metadata will be schemaless, allowing users to define any key-value pairs th
### API Response Changes

- [x] Modify `listModelsHandler()` in [proxy/proxymanager.go:350](proxy/proxymanager.go#L350)
- [x] Add `llamaswap_meta` field to model records when metadata exists
- [x] Ensure empty metadata results in omitted `llamaswap_meta` key
- [x] Add `llmsnap_meta` field to model records when metadata exists
- [x] Ensure empty metadata results in omitted `llmsnap_meta` key
- [x] Verify JSON marshaling preserves all types correctly

### Testing - Config Package
Expand All @@ -257,7 +257,7 @@ The metadata will be schemaless, allowing users to define any key-value pairs th
- [x] Update `TestProxyManager_ListModelsHandler` in [proxy/proxymanager_test.go](proxy/proxymanager_test.go)
- [x] Add test case for model with metadata
- [x] Add test case for model without metadata
- [x] Verify `llamaswap_meta` key presence/absence
- [x] Verify `llmsnap_meta` key presence/absence
- [x] Verify type preservation in JSON output
- [x] Verify macro substitution has occurred

Expand All @@ -274,10 +274,10 @@ None identified. The plan references the correct existing example in [config.exa

### Design Decisions

1. **Why `llamaswap_meta` instead of merging into record?**
1. **Why `llmsnap_meta` instead of merging into record?**

- Avoids potential collisions with OpenAI API standard fields
- Makes it clear this is llama-swap specific metadata
- Makes it clear this is llmsnap specific metadata
- Easier for clients to distinguish standard vs. custom fields

2. **Why support nested structures?**
Expand Down
2 changes: 1 addition & 1 deletion cmd/misc/benchmark-chatcompletion/main.go
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
package main

// created for issue: #252 https://github.com/mostlygeek/llama-swap/issues/252
// this simple benchmark tool sends a lot of small chat completion requests to llama-swap
// this simple benchmark tool sends a lot of small chat completion requests to llmsnap
// to make sure all the requests are accounted for.
//
// requests can be sent in parallel, and the tool will report the results.
Expand Down
2 changes: 1 addition & 1 deletion cmd/simple-responder/simple-responder.go
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ func main() {
})

// for issue #62 to check model name strips profile slug
// has to be one of the openAI API endpoints that llama-swap proxies
// has to be one of the openAI API endpoints that llmsnap proxies
// curl http://localhost:8080/v1/audio/speech -d '{"model":"profile:TheExpectedModel"}'
r.POST("/v1/audio/speech", func(c *gin.Context) {
body, err := io.ReadAll(c.Request.Body)
Expand Down
4 changes: 2 additions & 2 deletions cmd/wol-proxy/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# wol-proxy

wol-proxy automatically wakes up a suspended llama-swap server using Wake-on-LAN when requests are received.
wol-proxy automatically wakes up a suspended llmsnap server using Wake-on-LAN when requests are received.

When a request arrives and llama-swap is unavailable, wol-proxy sends a WOL packet and holds the request until the server becomes available. If the server doesn't respond within the timeout period (default: 60 seconds), the request is dropped.
When a request arrives and llmsnap is unavailable, wol-proxy sends a WOL packet and holds the request until the server becomes available. If the server doesn't respond within the timeout period (default: 60 seconds), the request is dropped.

This utility helps conserve energy by allowing GPU-heavy servers to remain suspended when idle, as they can consume hundreds of watts even when not actively processing requests.

Expand Down
8 changes: 4 additions & 4 deletions config-schema.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
{
"$schema": "https://json-schema.org/draft-07/schema#",
"$id": "llama-swap-config-schema.json",
"title": "llama-swap configuration",
"description": "Configuration file for llama-swap",
"$id": "llmsnap-config-schema.json",
"title": "llmsnap configuration",
"description": "Configuration file for llmsnap",
"type": "object",
"required": [
"models"
Expand Down Expand Up @@ -164,7 +164,7 @@
"type": "string",
"default": "http://localhost:${PORT}",
"format": "uri",
"description": "URL where llama-swap routes API requests. If custom port is used in cmd, this must be set."
"description": "URL where llmsnap routes API requests. If custom port is used in cmd, this must be set."
},
"aliases": {
"type": "array",
Expand Down
Loading