server: cache prompt to host memory #954

firecoperana · 2025-11-14T00:40:41Z

This PR is a port of ggml-org/llama.cpp#16391 with a few changes. With this PR, there is no need to use -np N for multiple conversations. When user starts a new conversation, the old conversation's kv cache will be saved in ram and can be retrieved later. This greatly reduces prompt processing time when switching between conversations and can have as many conversation as your ram is allowed.
-cram, --cache-ram N: size of ram used to cache prompt. -1: no limit. 0: disabled, 8192: 8192 MiB
New args to have more control of prompt cache behavior:
-crs, --cache-ram-similarity N: If the percentage of tokens that will be evicted from cache is below this value, cached prompt will be saved to ram. Default is 0.5. Matches mainline.
--cache-ram-n-min N: the prompt must be great than this value to be saved to ram to avoid saving too many small conversations. Default is 0. Matches mainline.
Other changes:

Mainline has a limit that combined tokens in all checkpoints cannot exceed context size. Not sure why they have this, but I don't think it's needed here, so remove token limit for cache-ram from mainline.
Mainline use different logic to choose best slot and best cached prompt and some arbitrary threshold, which does not look right to me. The fix here is to use same calculation called slot similarity which is 2* longest common tokens size/(prompt token size+cache tokens size) to select slot and cached prompt. This formula can be changed later.
Change slot-prompt-similarity default value from 0.5 to 0.1 due to Eval bug: gpt-oss model reprocess the entire prompt from beginning. ggml-org/llama.cpp#15894
Remove truncate prompt code when prompt exceeds context size

change similarity calculation and prompt save conditions Remove unneeded token limit rename variable Separate prompt save and load logic change default values change log remove truncate prompt logic

ikawrakow

As a general comment: the server compile time is slowly starting to get out of hand, so we should think about breaking it up into pieces. You could kick off this by having the prompt cache in separate .h and .cpp files (and not having the implementation be in the .h file).

But this can also be done later. Leave it up to you.

examples/server/server.cpp

ikawrakow · 2025-11-14T05:25:57Z

examples/server/utils.hpp

 //
 // other common utils
 //
+static float get_slot_similarity(size_t lcp, size_t prompt_length, size_t cache_length) {


So, basically the Dice coefficient

firecoperana · 2025-11-14T14:49:09Z

As a general comment: the server compile time is slowly starting to get out of hand, so we should think about breaking it up into pieces. You could kick off this by having the prompt cache in separate .h and .cpp files (and not having the implementation be in the .h file).

But this can also be done later. Leave it up to you.

Yeah, best to do the refactoring of entire server code in another PR.

ikawrakow · 2025-11-14T14:53:00Z

Ready to merge?

firecoperana · 2025-11-14T15:00:20Z

Should be ok. Just pushed a new commit. Tested fine for me.

firecoperana added 2 commits November 13, 2025 17:48

server : host-memory prompt caching

2831c5d

change similarity calculation and prompt save conditions Remove unneeded token limit rename variable Separate prompt save and load logic change default values change log remove truncate prompt logic

add description

3cc96c0

firecoperana requested a review from ikawrakow November 14, 2025 00:41

ikawrakow approved these changes Nov 14, 2025

View reviewed changes

ikawrakow mentioned this pull request Nov 14, 2025

common: Generalized XML-style tool-call parsing with streaming support #958

Merged

4 tasks

bug fixes

ed65b77

remove token limit in init

bed986e

ikawrakow merged commit 0cb6dcc into main Nov 14, 2025

firecoperana mentioned this pull request Nov 17, 2025

Server: Handle context shift better to reduce prompt processing time #973

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: cache prompt to host memory #954

server: cache prompt to host memory #954

Uh oh!

firecoperana commented Nov 14, 2025

Uh oh!

ikawrakow left a comment

Uh oh!

Uh oh!

Uh oh!

ikawrakow Nov 14, 2025

Uh oh!

firecoperana commented Nov 14, 2025

Uh oh!

ikawrakow commented Nov 14, 2025

Uh oh!

firecoperana commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

server: cache prompt to host memory #954

server: cache prompt to host memory #954

Uh oh!

Conversation

firecoperana commented Nov 14, 2025

Uh oh!

ikawrakow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ikawrakow Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

firecoperana commented Nov 14, 2025

Uh oh!

ikawrakow commented Nov 14, 2025

Uh oh!

firecoperana commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants