Efficient, easy-to-use platform for inference and serving local LLMs including an OpenAI compatible API server.
- OpenAI compatible API server provided for serving LLMs.
- Highly extensible trait-based system to allow rapid implementation of new module pipelines,
- Streaming support in generation.
- Efficient management of key-value cache with PagedAttention.
- Continuous batching.
Currently, candle-vllm supports chat serving for the following models.
Model ID | Model Type | Supported | Speed (A100, BF16) |
---|---|---|---|
#1 | LLAMA/LLAMA2/LLaMa3 | ✅ | 71 tks/s (7B) |
#2 | Mistral | TBD | TBD |
#3 | Phi (v1, v1.5, v2) | TBD | TBD |
#4 | Phi-3 (3.8B, 7B) | ✅ | 99 tks/s (3.8B) |
#5 | Yi | TBD | TBD |
#6 | StableLM | TBD | TBD |
#7 | BigCode/StarCode | TBD | TBD |
#8 | ChatGLM | TBD | TBD |
#9 | QWen | TBD | TBD |
#10 | Google Gemma | TBD | TBD |
#11 | Blip-large (Multimodal) | TBD | TBD |
#12 | Moondream-2 (Multimodal LLM) | TBD | TBD |
See this folder for some examples.
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo apt install libssl-dev
sudo apt install pkg-config
git clone [email protected]:EricLBuehler/candle-vllm.git
cd candle-vllm
cargo run --release -- --port 2000 --weight-path /home/llama2_7b/ llama7b --repeat-last-n 64
Install ChatUI and its dependencies:
git clone [email protected]:guoqingbao/candle-vllm-demo.git
cd candle-vllm-demo
apt install npm #install npm if needed
npm install n -g #update node js if needed
n stable #update node js if needed
npm i -g pnpm #install pnpm manager
pnpm install #install ChatUI dependencies
Launching the ChatUI:
pnpm run dev # run the ChatUI
curl -X POST "http://127.0.0.1:2000/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "llama7b",
"messages": [
{"role": "user", "content": "Explain how to best learn Rust."}
],
"temperature": 0.7,
"max_tokens": 128,
"stop": {"Single":"</s>"}
}'
Sample response:
{"id":"cmpl-53092967-c9cf-40e0-ae26-d7ac786d59e8","choices":[{"message":{"content":" Learning any programming language requires a combination of theory, practice, and dedication. Here are some steps and resources to help you learn Rust effectively:\n\n1. Start with the basics:\n\t* Understand the syntax and basic structure of Rust programs.\n\t* Learn about variables, data types, loops, and control structures.\n\t* Familiarize yourself with Rust's ownership system and borrowing mechanism.\n2. Read the Rust book:\n\t* The Rust book is an official resource that provides a comprehensive introduction to the language.\n\t* It covers topics such","role":"[INST]"},"finish_reason":"length","index":0,"logprobs":null}],"created":1718784498,"model":"llama7b","object":"chat.completion","usage":{"completion_tokens":129,"prompt_tokens":29,"total_tokens":158}}
In your terminal, install the openai
Python package by running pip install openai
. I use version 1.3.5
.
Then, create a new Python file and write the following code:
import openai
openai.api_key = "EMPTY"
openai.base_url = "http://localhost:2000/v1/"
completion = openai.chat.completions.create(
model="llama7b",
messages=[
{
"role": "user",
"content": "Explain how to best learn Rust.",
},
],
max_tokens = 64,
)
print(completion.choices[0].message.content)
After the candle-vllm
service is running, run the Python script and enjoy efficient inference with an OpenAI compatible API server!
For general configuration help, run cargo run -- --help
.
For model-specific help, run cargo run -- --port 1234 <MODEL NAME> --help
For local model weights, run cargo run --release -- --port 2000 --weight-path /home/llama2_7b/ llama7b --repeat-last-n 64
, change the path when needed.
For kvcache configuration, set kvcache_mem_cpu
and kvcache_mem_gpu
, default 4GB CPU memory and 4GB GPU memory for kvcache.
For chat history settings, set record_conversation
to true
to let candle-vllm remember chat history. By default
, candle-vllm does not
record chat history; instead, the client sends both the messages and the contextual history to candle-vllm. If record_conversation is set to true
, the client sends only new chat messages to candle-vllm, and candle-vllm is responsible for recording the previous chat messages. However, this approach requires per-session chat recording, which is not yet implemented, so the default approach record_conversation=false
is recommended.
For chat streaming, the stream
flag in chat request need to be set to True
.
Installing candle-vllm
is as simple as the following steps. If you have any problems, please create an
issue.
The following features are planned to be implemented, but contributions are especially welcome:
- Sampling methods:
- Beam search (huggingface/candle#1319)
- More pipelines (from
candle-transformers
)
- Python implementation:
vllm-project
vllm
paper