Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llamacpp as cpu accelerate service #168

Merged
merged 8 commits into from
Jun 29, 2024
Merged

Add llamacpp as cpu accelerate service #168

merged 8 commits into from
Jun 29, 2024

Conversation

Aisuko
Copy link
Member

@Aisuko Aisuko commented Jun 26, 2024

Description

This PR

fixed: #153
fixed: #162

And depends on SkywardAI/containers#32

Notes for Reviewers

I'm trying to add customise llama.cpp image to the backend as the inference engine. It has some advantages:

  • FP16 CPU/GPU accelerate
  • Support arm64

And also a new API version was added in this PR

In the future, I have a plan to bounding or replacing it with Rust, but currently it can provides us smoothly inference experience on CPU.

Test commands

For more parameters: https://github.com/SkywardAI/llama.cpp/tree/master/examples/server

docker run -p 8080:8080 -v ./models:/models gclub/llama.cpp:server--b1-7a6db5c -m models/gpt2-117m-Q4_K_M-v2.gguf -c 512 --port 8080 --host 0.0.0.0

curl --request POST     --url http://0.0.0.0:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

Inference on 8 CPU with a reasonable speed

~/workspace/chat-backend/models/gguf$ curl --request POST     --url http://0.0.0.0:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
{"content":"\n\n2) Create a simple-to-use template and a content template\n\nIf you get a little flaky and forget about the more difficult stuff, you're correct. Creating a template is much more difficult.\n\nTo succeed, you need to perform the hardest (and the best) work. Because of the inefficiency of a template, a more difficult and less convenient setup may be in your interest. These are the tasks that we can work on instead of doing.\n\nSetting up a site is like the two-part \"solution\" of building a blog. Each of these is difficult and quite expensive.","id_slot":0,"stop":true,"model":"models/gpt2-117m-Q4_K_M-v2.gguf","tokens_predicted":128,"tokens_evaluated":11,"generation_settings":{"n_ctx":512,"n_predict":-1,"model":"models/gpt2-117m-Q4_K_M-v2.gguf","seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"penalty_prompt_tokens":[],"use_penalty_prompt_tokens":false,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typical_p","top_p","min_p","temperature"]},"prompt":"Building a website can be done in 10 simple steps:","truncated":false,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":"","tokens_cached":138,"timings":{"prompt_n":11,"prompt_ms":15.658,"prompt_per_token_ms":1.4234545454545453,"prompt_per_second":702.5162856048026,"predicted_n":128,"predicted_ms":383.604,"predicted_per_token_ms":2.99690625,"predicted_per_second":333.6774381914683}}

Models address:

Question

Shall we use volume? I don't see any pre-defined docker volume. I'd like to add a small model 110MB size to the repo. It can be tracked with git lfs. And after execution the make up command, docker compose will load the models from local. What is your opinion? @Micost

Signed commits

  • Yes, I signed my commits.

@Micost
Copy link
Collaborator

Micost commented Jun 26, 2024

I would like to have the volumes/models/ folder which we used for local models.
Please take a look to the docker-compose.yaml
Here are the lines where we saved model
volumes:
- ./backend/:/app/
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/models:/models

@Aisuko
Copy link
Member Author

Aisuko commented Jun 26, 2024

We will have two type of models. *.safetensors and *.gguf in the future. The first one will be the customised model and it can be used for CPT. gguf for inference on CPU.

We should add them to the folder under volumes

@Aisuko
Copy link
Member Author

Aisuko commented Jun 27, 2024

The chat API currently use llamacpp. However, I comment the code like download models and score. We need to discuss these features, let them work well with new inference method.

Below is the test log, I use dev container development environment on AWS. The model I am using is gpt2-minimal

prompt below

{
  "accountID": 0,
  "sessionId": 0,
  "message": "Building a website can be done in 10 simple steps:"
}

Response

2024-06-27 05:29:38.539 | INFO     | src.repository.rag.chat:inference:199 - inference answer value:{'content': '\n\n1) Request the .html on the web\n\n2) Add an HTML element and edit the "Content-Type" for it.\n\n3) Type in a new "Content-Type" that is the same as the "Content-Image" you downloaded from the server.\n\n4) Within a few seconds, the "Content-Type" that you clicked on is the same as the "Content-Image" that is on your desktop or a Google Sheet.\n\nIn the future you can recreate the content and edit the content using this template.\n\nI hope this has helped, and if not', 'id_slot': 0, 'stop': True, 'model': 'models/gpt2-minimal-Q4_K_M-v2.gguf', 'tokens_predicted': 128, 'tokens_evaluated': 11, 'generation_settings': {'n_ctx': 512, 'n_predict': -1, 'model': 'models/gpt2-minimal-Q4_K_M-v2.gguf', 'seed': 4294967295, 'temperature': 0.800000011920929, 'dynatemp_range': 0.0, 'dynatemp_exponent': 1.0, 'top_k': 40, 'top_p': 0.949999988079071, 'min_p': 0.05000000074505806, 'tfs_z': 1.0, 'typical_p': 1.0, 'repeat_last_n': 64, 'repeat_penalty': 1.0, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'penalty_prompt_tokens': [], 'use_penalty_prompt_tokens': False, 'mirostat': 0, 'mirostat_tau': 5.0, 'mirostat_eta': 0.10000000149011612, 'penalize_nl': False, 'stop': [], 'n_keep': 0, 'n_discard': 0, 'ignore_eos': False, 'stream': False, 'logit_bias': [], 'n_probs': 0, 'min_keep': 0, 'grammar': '', 'samplers': ['top_k', 'tfs_z', 'typical_p', 'top_p', 'min_p', 'temperature']}, 'prompt': 'Building a website can be done in 10 simple steps:', 'truncated': False, 'stopped_eos': False, 'stopped_word': False, 'stopped_limit': True, 'stopping_word': '', 'tokens_cached': 138, 'timings': {'prompt_n': 11, 'prompt_ms': 15.344, 'prompt_per_token_ms': 1.3949090909090909, 'prompt_per_second': 716.8925964546403, 'predicted_n': 128, 'predicted_ms': 444.658, 'predicted_per_token_ms': 3.473890625, 'predicted_per_second': 287.8616824615772}}
INFO:     127.0.0.1:55970 - "POST /api/chat HTTP/1.1" 201 Created

@Aisuko
Copy link
Member Author

Aisuko commented Jun 28, 2024

Screenshot 2024-06-28 at 11 12 33 AM

Currently, the RAG feature is unable to use. I'm thinking we need an efficient encoder and decoder that support both CPU and GPU. And I'm working on it.

@Micost Is ok we merge this one first? (we do not release version only merge the PR) Currently, the deployment process without download any model.
And we can focus on the CRM features with @cbh778899 I mean it can be easier for him deploy backend to his local env.

@Aisuko Aisuko requested review from Micost and cbh778899 June 28, 2024 01:57
@Aisuko Aisuko modified the milestone: v0.1.11(backend) Jun 28, 2024
@Aisuko Aisuko requested a review from jinronga June 28, 2024 02:00
@Aisuko Aisuko marked this pull request as ready for review June 28, 2024 06:01
Copy link
Collaborator

@Micost Micost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

backend/src/api/routes/version.py Show resolved Hide resolved
@Aisuko Aisuko merged commit f7d699d into main Jun 29, 2024
6 checks passed
@Aisuko Aisuko deleted the feat/llamacpp branch June 29, 2024 05:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants