Add llamacpp as cpu accelerate service #168

Aisuko · 2024-06-26T12:06:15Z

Description

This PR

fixed: #153
fixed: #162

And depends on SkywardAI/containers#32

Notes for Reviewers

I'm trying to add customise llama.cpp image to the backend as the inference engine. It has some advantages:

FP16 CPU/GPU accelerate
Support arm64

And also a new API version was added in this PR

In the future, I have a plan to bounding or replacing it with Rust, but currently it can provides us smoothly inference experience on CPU.

Test commands

For more parameters: https://github.com/SkywardAI/llama.cpp/tree/master/examples/server

docker run -p 8080:8080 -v ./models:/models gclub/llama.cpp:server--b1-7a6db5c -m models/gpt2-117m-Q4_K_M-v2.gguf -c 512 --port 8080 --host 0.0.0.0

curl --request POST     --url http://0.0.0.0:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

Inference on 8 CPU with a reasonable speed

~/workspace/chat-backend/models/gguf$ curl --request POST     --url http://0.0.0.0:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
{"content":"\n\n2) Create a simple-to-use template and a content template\n\nIf you get a little flaky and forget about the more difficult stuff, you're correct. Creating a template is much more difficult.\n\nTo succeed, you need to perform the hardest (and the best) work. Because of the inefficiency of a template, a more difficult and less convenient setup may be in your interest. These are the tasks that we can work on instead of doing.\n\nSetting up a site is like the two-part \"solution\" of building a blog. Each of these is difficult and quite expensive.","id_slot":0,"stop":true,"model":"models/gpt2-117m-Q4_K_M-v2.gguf","tokens_predicted":128,"tokens_evaluated":11,"generation_settings":{"n_ctx":512,"n_predict":-1,"model":"models/gpt2-117m-Q4_K_M-v2.gguf","seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"penalty_prompt_tokens":[],"use_penalty_prompt_tokens":false,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typical_p","top_p","min_p","temperature"]},"prompt":"Building a website can be done in 10 simple steps:","truncated":false,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":"","tokens_cached":138,"timings":{"prompt_n":11,"prompt_ms":15.658,"prompt_per_token_ms":1.4234545454545453,"prompt_per_second":702.5162856048026,"predicted_n":128,"predicted_ms":383.604,"predicted_per_token_ms":2.99690625,"predicted_per_second":333.6774381914683}}

Models address:

Question

Shall we use volume? I don't see any pre-defined docker volume. I'd like to add a small model 110MB size to the repo. It can be tracked with git lfs. And after execution the make up command, docker compose will load the models from local. What is your opinion? @Micost

Signed commits

Yes, I signed my commits.

Micost · 2024-06-26T12:28:50Z

I would like to have the volumes/models/ folder which we used for local models.
Please take a look to the docker-compose.yaml
Here are the lines where we saved model
volumes:
- ./backend/:/app/
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/models:/models

Aisuko · 2024-06-26T12:34:27Z

We will have two type of models. *.safetensors and *.gguf in the future. The first one will be the customised model and it can be used for CPT. gguf for inference on CPU.

We should add them to the folder under volumes

Signed-off-by: Aisuko <[email protected]>

Aisuko · 2024-06-27T05:35:31Z

The chat API currently use llamacpp. However, I comment the code like download models and score. We need to discuss these features, let them work well with new inference method.

Below is the test log, I use dev container development environment on AWS. The model I am using is gpt2-minimal

prompt below

{
  "accountID": 0,
  "sessionId": 0,
  "message": "Building a website can be done in 10 simple steps:"
}

Response

2024-06-27 05:29:38.539 | INFO     | src.repository.rag.chat:inference:199 - inference answer value:{'content': '\n\n1) Request the .html on the web\n\n2) Add an HTML element and edit the "Content-Type" for it.\n\n3) Type in a new "Content-Type" that is the same as the "Content-Image" you downloaded from the server.\n\n4) Within a few seconds, the "Content-Type" that you clicked on is the same as the "Content-Image" that is on your desktop or a Google Sheet.\n\nIn the future you can recreate the content and edit the content using this template.\n\nI hope this has helped, and if not', 'id_slot': 0, 'stop': True, 'model': 'models/gpt2-minimal-Q4_K_M-v2.gguf', 'tokens_predicted': 128, 'tokens_evaluated': 11, 'generation_settings': {'n_ctx': 512, 'n_predict': -1, 'model': 'models/gpt2-minimal-Q4_K_M-v2.gguf', 'seed': 4294967295, 'temperature': 0.800000011920929, 'dynatemp_range': 0.0, 'dynatemp_exponent': 1.0, 'top_k': 40, 'top_p': 0.949999988079071, 'min_p': 0.05000000074505806, 'tfs_z': 1.0, 'typical_p': 1.0, 'repeat_last_n': 64, 'repeat_penalty': 1.0, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'penalty_prompt_tokens': [], 'use_penalty_prompt_tokens': False, 'mirostat': 0, 'mirostat_tau': 5.0, 'mirostat_eta': 0.10000000149011612, 'penalize_nl': False, 'stop': [], 'n_keep': 0, 'n_discard': 0, 'ignore_eos': False, 'stream': False, 'logit_bias': [], 'n_probs': 0, 'min_keep': 0, 'grammar': '', 'samplers': ['top_k', 'tfs_z', 'typical_p', 'top_p', 'min_p', 'temperature']}, 'prompt': 'Building a website can be done in 10 simple steps:', 'truncated': False, 'stopped_eos': False, 'stopped_word': False, 'stopped_limit': True, 'stopping_word': '', 'tokens_cached': 138, 'timings': {'prompt_n': 11, 'prompt_ms': 15.344, 'prompt_per_token_ms': 1.3949090909090909, 'prompt_per_second': 716.8925964546403, 'predicted_n': 128, 'predicted_ms': 444.658, 'predicted_per_token_ms': 3.473890625, 'predicted_per_second': 287.8616824615772}}
INFO:     127.0.0.1:55970 - "POST /api/chat HTTP/1.1" 201 Created

Signed-off-by: Aisuko <[email protected]>

Aisuko · 2024-06-28T01:13:07Z

Currently, the RAG feature is unable to use. I'm thinking we need an efficient encoder and decoder that support both CPU and GPU. And I'm working on it.

@Micost Is ok we merge this one first? (we do not release version only merge the PR) Currently, the deployment process without download any model.
And we can focus on the CRM features with @cbh778899 I mean it can be easier for him deploy backend to his local env.

Signed-off-by: Aisuko <[email protected]>

Micost

LGTM

backend/src/api/routes/version.py

Aisuko force-pushed the feat/llamacpp branch from 992e76a to 2b94904 Compare June 26, 2024 12:07

Aisuko added apis neural network container labels Jun 26, 2024

Add llamacpp as cpu accelerate service

f0fdb59

Signed-off-by: Aisuko <[email protected]>

Aisuko force-pushed the feat/llamacpp branch from 2b94904 to f0fdb59 Compare June 26, 2024 13:23

Add configuration for llamacpp

e04b33c

Signed-off-by: Aisuko <[email protected]>

Add inference with llama.cpp based on CPU demo

c9c9781

Signed-off-by: Aisuko <[email protected]>

Aisuko force-pushed the feat/llamacpp branch from f4b7eb1 to c9c9781 Compare June 27, 2024 05:37

Aisuko added 2 commits June 27, 2024 13:34

Update customize llamacpp image

06336b7

Signed-off-by: Aisuko <[email protected]>

Add version api and remove loading model automatically

e67f509

Signed-off-by: Aisuko <[email protected]>

Aisuko requested review from Micost and cbh778899 June 28, 2024 01:57

Aisuko modified the milestone: v0.1.11(backend) Jun 28, 2024

Aisuko requested a review from jinronga June 28, 2024 02:00

Aisuko added 2 commits June 28, 2024 05:45

remove un-used variables

4364492

Signed-off-by: Aisuko <[email protected]>

Bump base image version

4d69971

Signed-off-by: Aisuko <[email protected]>

Aisuko marked this pull request as ready for review June 28, 2024 06:01

Add openai client demo

938aebb

Signed-off-by: Aisuko <[email protected]>

Micost approved these changes Jun 29, 2024

View reviewed changes

backend/src/api/routes/version.py Show resolved Hide resolved

Aisuko merged commit f7d699d into main Jun 29, 2024
6 checks passed

Aisuko deleted the feat/llamacpp branch June 29, 2024 05:39

This was referenced Jun 29, 2024

[feature]: 4-bit inference on CPU SkywardAI/kimchima#31

Closed

[feature] pipeline support quantization directly SkywardAI/kimchima#59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llamacpp as cpu accelerate service #168

Add llamacpp as cpu accelerate service #168

Aisuko commented Jun 26, 2024 •

edited

Loading

Micost commented Jun 26, 2024

Aisuko commented Jun 26, 2024 •

edited

Loading

Aisuko commented Jun 27, 2024 •

edited

Loading

Aisuko commented Jun 28, 2024 •

edited

Loading

Micost left a comment

Add llamacpp as cpu accelerate service #168

Add llamacpp as cpu accelerate service #168

Conversation

Aisuko commented Jun 26, 2024 • edited Loading

Test commands

Inference on 8 CPU with a reasonable speed

Models address:

Question

Micost commented Jun 26, 2024

Aisuko commented Jun 26, 2024 • edited Loading

Aisuko commented Jun 27, 2024 • edited Loading

Aisuko commented Jun 28, 2024 • edited Loading

Micost left a comment

Choose a reason for hiding this comment

Aisuko commented Jun 26, 2024 •

edited

Loading

Aisuko commented Jun 26, 2024 •

edited

Loading

Aisuko commented Jun 27, 2024 •

edited

Loading

Aisuko commented Jun 28, 2024 •

edited

Loading