Skip to content

Conversation

@firecoperana
Copy link
Collaborator

This PR is a port of ggml-org/llama.cpp#16391 with a few changes. With this PR, there is no need to use -np N for multiple conversations. When user starts a new conversation, the old conversation's kv cache will be saved in ram and can be retrieved later. This greatly reduces prompt processing time when switching between conversations and can have as many conversation as your ram is allowed.
-cram, --cache-ram N: size of ram used to cache prompt. -1: no limit. 0: disabled, 8192: 8192 MiB
New args to have more control of prompt cache behavior:
-crs, --cache-ram-similarity N: If the percentage of tokens that will be evicted from cache is below this value, cached prompt will be saved to ram. Default is 0.5. Matches mainline.
--cache-ram-n-min N: the prompt must be great than this value to be saved to ram to avoid saving too many small conversations. Default is 0. Matches mainline.
Other changes:

  1. Mainline has a limit that combined tokens in all checkpoints cannot exceed context size. Not sure why they have this, but I don't think it's needed here, so remove token limit for cache-ram from mainline.
  2. Mainline use different logic to choose best slot and best cached prompt and some arbitrary threshold, which does not look right to me. The fix here is to use same calculation called slot similarity which is 2* longest common tokens size/(prompt token size+cache tokens size) to select slot and cached prompt. This formula can be changed later.
  3. Change slot-prompt-similarity default value from 0.5 to 0.1 due to Eval bug: gpt-oss model reprocess the entire prompt from beginning. ggml-org/llama.cpp#15894
  4. Remove truncate prompt code when prompt exceeds context size

firecoperana added 2 commits November 13, 2025 17:48
change similarity calculation and prompt save conditions

Remove unneeded token limit

rename variable

Separate prompt save and load logic

change default values

change log

remove truncate prompt logic
Copy link
Owner

@ikawrakow ikawrakow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general comment: the server compile time is slowly starting to get out of hand, so we should think about breaking it up into pieces. You could kick off this by having the prompt cache in separate .h and .cpp files (and not having the implementation be in the .h file).

But this can also be done later. Leave it up to you.

//
// other common utils
//
static float get_slot_similarity(size_t lcp, size_t prompt_length, size_t cache_length) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, basically the Dice coefficient

@firecoperana
Copy link
Collaborator Author

As a general comment: the server compile time is slowly starting to get out of hand, so we should think about breaking it up into pieces. You could kick off this by having the prompt cache in separate .h and .cpp files (and not having the implementation be in the .h file).

But this can also be done later. Leave it up to you.

Yeah, best to do the refactoring of entire server code in another PR.

@ikawrakow
Copy link
Owner

Ready to merge?

@firecoperana
Copy link
Collaborator Author

Should be ok. Just pushed a new commit. Tested fine for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants