Skip to content

Commit 8d319da

Browse files
committed
improve README organization (i think...)
1 parent be7c502 commit 8d319da

File tree

1 file changed

+64
-60
lines changed

1 file changed

+64
-60
lines changed

README.md

+64-60
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
llama-swap is a light weight, transparent proxy server that provides automatic model swapping to llama.cpp's server.
66

7-
Written in golang, it is very easy to install (single binary with no dependancies) and configure (single yaml file).
7+
Written in golang, it is very easy to install (single binary with no dependancies) and configure (single yaml file). To get started, download a pre-built binary or use the provided docker images.
88

99
## Features:
1010

@@ -26,69 +26,12 @@ Written in golang, it is very easy to install (single binary with no dependancie
2626
- ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, etc)
2727
- ✅ Direct access to upstream HTTP server via `/upstream/:model_id` ([demo](https://github.com/mostlygeek/llama-swap/pull/31))
2828

29-
## Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))
30-
31-
Docker is the quickest way to try out llama-swap:
32-
33-
```
34-
# use CPU inference
35-
$ docker run -it --rm -p 9292:8080 ghcr.io/mostlygeek/llama-swap:cpu
36-
37-
38-
# qwen2.5 0.5B
39-
$ curl -s http://localhost:9292/v1/chat/completions \
40-
-H "Content-Type: application/json" \
41-
-H "Authorization: Bearer no-key" \
42-
-d '{"model":"qwen2.5","messages": [{"role": "user","content": "tell me a joke"}]}' | \
43-
jq -r '.choices[0].message.content'
44-
45-
46-
# SmolLM2 135M
47-
$ curl -s http://localhost:9292/v1/chat/completions \
48-
-H "Content-Type: application/json" \
49-
-H "Authorization: Bearer no-key" \
50-
-d '{"model":"smollm2","messages": [{"role": "user","content": "tell me a joke"}]}' | \
51-
jq -r '.choices[0].message.content'
52-
```
53-
54-
Docker images are [published nightly](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap) that include the latest llama-swap and llama-server:
55-
56-
- `ghcr.io/mostlygeek/llama-swap:cpu`
57-
- `ghcr.io/mostlygeek/llama-swap:cuda`
58-
- `ghcr.io/mostlygeek/llama-swap:intel`
59-
- `ghcr.io/mostlygeek/llama-swap:vulkan`
60-
- ROCm disabled until fixed in llama.cpp container
61-
- musa disabled until requested.
62-
63-
Specific versions are also available and are tagged with the llama-swap, architecture and llama.cpp versions. For example: `ghcr.io/mostlygeek/llama-swap:v89-cuda-b4716`
64-
65-
Beyond the demo you will likely want to run the containers with your downloaded models and custom configuration.
66-
67-
```
68-
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
69-
-v /path/to/models:/models \
70-
-v /path/to/custom/config.yaml:/app/config.yaml \
71-
ghcr.io/mostlygeek/llama-swap:cuda
72-
```
73-
74-
## Bare metal Install ([download](https://github.com/mostlygeek/llama-swap/releases))
75-
76-
Pre-built binaries are available for Linux, FreeBSD and Darwin (OSX). These are automatically published and are likely a few hours ahead of the docker releases. The baremetal install works with any OpenAI compatible server, not just llama-server.
77-
78-
You can also build llama-swap yourself from source with `make clean all`.
79-
8029
## How does llama-swap work?
8130

8231
When a request is made to an OpenAI compatible endpoint, lama-swap will extract the `model` value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to the correct one to serve the request.
8332

8433
In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, the `profiles` feature can load multiple models at the same time. You have complete control over how your system resources are used.
8534

86-
## Do I need to use llama.cpp's server (llama-server)?
87-
88-
Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.
89-
90-
For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to `SIGTERM` signals to shutdown.
91-
9235
## config.yaml
9336

9437
llama-swap's configuration is purposefully simple.
@@ -110,7 +53,8 @@ models:
11053
--port 9999
11154
```
11255
113-
But can grow to specific use cases:
56+
<details>
57+
<summary>But also very powerful ...</summary>
11458
11559
```yaml
11660
# Seconds to wait for llama.cpp to load and be ready to serve requests
@@ -188,7 +132,61 @@ profiles:
188132
- [Speculative Decoding](examples/speculative-decoding/README.md) - using a small draft model can increase inference speeds from 20% to 40%. This example includes a configurations Qwen2.5-Coder-32B (2.5x increase) and Llama-3.1-70B (1.4x increase) in the best cases.
189133
- [Optimizing Code Generation](examples/benchmark-snakegame/README.md) - find the optimal settings for your machine. This example demonstrates defining multiple configurations and testing which one is fastest.
190134

191-
### Installation
135+
</details>
136+
137+
## Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))
138+
139+
Docker is the quickest way to try out llama-swap:
140+
141+
```
142+
# use CPU inference
143+
$ docker run -it --rm -p 9292:8080 ghcr.io/mostlygeek/llama-swap:cpu
144+
145+
146+
# qwen2.5 0.5B
147+
$ curl -s http://localhost:9292/v1/chat/completions \
148+
-H "Content-Type: application/json" \
149+
-H "Authorization: Bearer no-key" \
150+
-d '{"model":"qwen2.5","messages": [{"role": "user","content": "tell me a joke"}]}' | \
151+
jq -r '.choices[0].message.content'
152+
153+
154+
# SmolLM2 135M
155+
$ curl -s http://localhost:9292/v1/chat/completions \
156+
-H "Content-Type: application/json" \
157+
-H "Authorization: Bearer no-key" \
158+
-d '{"model":"smollm2","messages": [{"role": "user","content": "tell me a joke"}]}' | \
159+
jq -r '.choices[0].message.content'
160+
```
161+
162+
<details>
163+
<summary>Docker images are nightly ...</summary>
164+
165+
They include:
166+
167+
- `ghcr.io/mostlygeek/llama-swap:cpu`
168+
- `ghcr.io/mostlygeek/llama-swap:cuda`
169+
- `ghcr.io/mostlygeek/llama-swap:intel`
170+
- `ghcr.io/mostlygeek/llama-swap:vulkan`
171+
- ROCm disabled until fixed in llama.cpp container
172+
- musa disabled until requested.
173+
174+
Specific versions are also available and are tagged with the llama-swap, architecture and llama.cpp versions. For example: `ghcr.io/mostlygeek/llama-swap:v89-cuda-b4716`
175+
176+
Beyond the demo you will likely want to run the containers with your downloaded models and custom configuration.
177+
178+
```
179+
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
180+
-v /path/to/models:/models \
181+
-v /path/to/custom/config.yaml:/app/config.yaml \
182+
ghcr.io/mostlygeek/llama-swap:cuda
183+
```
184+
185+
</details>
186+
187+
## Bare metal Install ([download](https://github.com/mostlygeek/llama-swap/releases))
188+
189+
Pre-built binaries are available for Linux, FreeBSD and Darwin (OSX). These are automatically published and are likely a few hours ahead of the docker releases. The baremetal install works with any OpenAI compatible server, not just llama-server.
192190
193191
1. Create a configuration file, see [config.example.yaml](config.example.yaml)
194192
1. Download a [release](https://github.com/mostlygeek/llama-swap/releases) appropriate for your OS and architecture.
@@ -222,6 +220,12 @@ curl -Ns http://host/logs/stream | grep 'eval time'
222220
curl -Ns 'http://host/logs/stream?no-history'
223221
```
224222
223+
## Do I need to use llama.cpp's server (llama-server)?
224+
225+
Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.
226+
227+
For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to `SIGTERM` signals to shutdown.
228+
225229
## Systemd Unit Files
226230
227231
Use this unit file to start llama-swap on boot. This is only tested on Ubuntu.

0 commit comments

Comments
 (0)