-
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add llamacpp as cpu accelerate service #168
Conversation
I would like to have the volumes/models/ folder which we used for local models. |
We will have two type of models. *.safetensors and *.gguf in the future. The first one will be the customised model and it can be used for CPT. gguf for inference on CPU. We should add them to the folder under |
Signed-off-by: Aisuko <[email protected]>
Signed-off-by: Aisuko <[email protected]>
The chat API currently use llamacpp. However, I comment the code like download models and score. We need to discuss these features, let them work well with new inference method. Below is the test log, I use dev container development environment on AWS. The model I am using is gpt2-minimal prompt below
Response
|
Signed-off-by: Aisuko <[email protected]>
Signed-off-by: Aisuko <[email protected]>
Signed-off-by: Aisuko <[email protected]>
Currently, the RAG feature is unable to use. I'm thinking we need an efficient encoder and decoder that support both CPU and GPU. And I'm working on it. @Micost Is ok we merge this one first? (we do not release version only merge the PR) Currently, the deployment process without download any model. |
Signed-off-by: Aisuko <[email protected]>
Signed-off-by: Aisuko <[email protected]>
Signed-off-by: Aisuko <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
This PR
fixed: #153
fixed: #162
And depends on SkywardAI/containers#32
Notes for Reviewers
I'm trying to add customise llama.cpp image to the backend as the inference engine. It has some advantages:
And also a new API
version
was added in this PRIn the future, I have a plan to bounding or replacing it with Rust, but currently it can provides us smoothly inference experience on CPU.
Test commands
For more parameters: https://github.com/SkywardAI/llama.cpp/tree/master/examples/server
Inference on 8 CPU with a reasonable speed
Models address:
Question
Shall we use
volume
? I don't see any pre-defined docker volume. I'd like to add a small model 110MB size to the repo. It can be tracked with gitlfs
. And after execution the make up command, docker compose will load the models from local. What is your opinion? @MicostSigned commits