ChatFlameBackend is an innovative backend solution for chat applications, leveraging the power of the Candle AI framework with a focus on the Mistral model
cargo build --release
Run the server
cargo run --release
Run one of the models
cargo run --release -- --model phi-v2 --prompt 'write me fibonacci in rust'
docker-compose up --build
Visit http://localhost:8080/swagger-ui for the swagger ui.
cargo test
or with curl
curl -X POST http://localhost:8080/generate \
-H "Content-Type: application/json" \
-d '{"inputs": "Your text prompt here"}'
or the stream endpoint
curl -X POST -H "Content-Type: application/json" -d '{"inputs": "Your input text"}' http://localhost:8080/generate_stream
You can find a detailed documentation on how to use the python client on huggingface.
virtualenv .venv
source .venv/bin/activate
pip install huggingface-hub
python test.py
The backend is written in rust. The models are loaded using the candle framework. To serve the models on an http endpoint, axum is used. Utoipa is used to provide a swagger ui for the api.
The following table shows the performance metrics of the model on different systems:
Model | System | Tokens per Second |
---|---|---|
7b-open-chat-3.5 | AMD 7900X3D (12 Core) 64GB | 9.4 tokens/s |
7b-open-chat-3.5 | AMD 5600G (8 Core VM) 16GB | 2.8 tokens/s |
13b (llama2 13b) | AMD 7900X3D (12 Core) 64GB | 5.2 tokens/s |
phi-2 | AMD 7900X3D (12 Core) 64GB | 20.6 tokens/s |
phi-2 | AMD 5600G (8 Core VM) 16GB | 5.3 tokens/s |
phi-2 | Apple M2 (10 Core) 16GB | 24.0 tokens/s |
The performance of the model is highly dependent on the memory bandwidth of the system. While getting 20.6 tokens/s for the Phi-2 Model on a AMD 7900X3D with 64GB of DDR5-4800 memory, the performance could be increased to 21.8 tokens/s by overclocking the memory to DDR5-5600.
- implement api for https://huggingface.github.io/text-generation-inference/#/
- model configuration
- generate stream
- docker image and docker-compose
- add tests
- add documentation
- fix stop token