-
Notifications
You must be signed in to change notification settings - Fork 161
server: cache prompt to host memory #954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
change similarity calculation and prompt save conditions Remove unneeded token limit rename variable Separate prompt save and load logic change default values change log remove truncate prompt logic
ikawrakow
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general comment: the server compile time is slowly starting to get out of hand, so we should think about breaking it up into pieces. You could kick off this by having the prompt cache in separate .h and .cpp files (and not having the implementation be in the .h file).
But this can also be done later. Leave it up to you.
| // | ||
| // other common utils | ||
| // | ||
| static float get_slot_similarity(size_t lcp, size_t prompt_length, size_t cache_length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, basically the Dice coefficient
Yeah, best to do the refactoring of entire server code in another PR. |
|
Ready to merge? |
|
Should be ok. Just pushed a new commit. Tested fine for me. |
This PR is a port of ggml-org/llama.cpp#16391 with a few changes. With this PR, there is no need to use -np N for multiple conversations. When user starts a new conversation, the old conversation's kv cache will be saved in ram and can be retrieved later. This greatly reduces prompt processing time when switching between conversations and can have as many conversation as your ram is allowed.
-cram, --cache-ram N: size of ram used to cache prompt. -1: no limit. 0: disabled, 8192: 8192 MiBNew args to have more control of prompt cache behavior:
-crs, --cache-ram-similarity N: If the percentage of tokens that will be evicted from cache is below this value, cached prompt will be saved to ram. Default is 0.5. Matches mainline.--cache-ram-n-min N: the prompt must be great than this value to be saved to ram to avoid saving too many small conversations. Default is 0. Matches mainline.Other changes:
2* longest common tokens size/(prompt token size+cache tokens size)to select slot and cached prompt. This formula can be changed later.