-
Notifications
You must be signed in to change notification settings - Fork 13.3k
server : host-memory prompt caching #16391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0787f03
to
5c0cec4
Compare
This comment was marked as spam.
This comment was marked as spam.
5c0cec4
to
1440ec5
Compare
9de8392
to
cf7dd4b
Compare
65e8991
to
264d2c3
Compare
I've been testing this with Claude Code and Codex and haven't spotted any issues. After a few more rounds of testing today, planning to merge. |
* server : add option to debug the slot contents * Update tools/server/server.cpp --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>
* master: (113 commits) webui: updated the chat service to only include max_tokens in the req… (ggml-org#16489) cpu : optimize the ggml NORM operation (ggml-org#15953) server : host-memory prompt caching (ggml-org#16391) No markdown in cot (ggml-org#16483) model-conversion : add support for SentenceTransformers (ggml-org#16387) ci: add ARM64 Kleidiai build and test support (ggml-org#16462) CANN: Improve ACL graph matching (ggml-org#16166) kleidiai: kernel interface refactoring (ggml-org#16460) [SYCL] refactor soft_max, add soft_max_back (ggml-org#16472) model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (ggml-org#16367) refactor: centralize CoT parsing in backend for streaming mode (ggml-org#16394) Disable CUDA host buffers on integrated GPUs (ggml-org#16308) server : fix cancel pending task (ggml-org#16467) metal : mark FA blocks (ggml-org#16372) server : improve context checkpoint logic (ggml-org#16440) ggml webgpu: profiling, CI updates, reworking of command submission (ggml-org#16452) llama : support LiquidAI LFM2-MoE hybrid model (ggml-org#16464) server : add `/v1/health` endpoint (ggml-org#16461) webui : added download action (ggml-org#13552) (ggml-org#16282) presets : fix pooling param for embedding models (ggml-org#16455) ...
Could this PR cause any problems with RPC (or RPC with speculative decoding)? I've been testing I can't do any more tests until next week now, but it happened several times and the last time I just cleared the context apart from the system message and then said "hi" and this seemed to trigger it to hang after the prompt processing but before token generation. |
Hi, since this was merged, is |
Doubt it. I tested speculative decoding with Gemma models before merging and it was working as expected.
No, |
I've been using this for some smaller, agent-like workflows where I'll have a few longer prompts and lots of shorter ones mixed in between and thus far it's been working like a treat. This PR has made me a very happy dev and has saved me probably hours of prompt processing already. |
Yeah, it's quite a game changer for agentic workflows and tools such as Claude Code and |
For just a bit more information, I've got basically a MoE-maxxing server with a couple of 3090s and 768GB of 12 channel DDR5, so my local model of choice tends to be R1 0528 or more recently GLM-4.6. Both of which fit comfortably with mixed inference offloading, but I hover around 150-200 tk/s PP and 8-14 tk/s TG depending on context fill level. Since I can only fit so many layers on VRAM, I usually end up cranking the context up instead to 65536+, and I've still got gigs of RAM left over and I've been setting So this capability to offload the already processed prompt in a sort of pause / resume fashion means I don't have to compromise with eg, IMO, this is a really huge feature that'll open up a lot of new local workflows for llama.cpp users! |
I wonder if what I thought were "hangs" were actually just the cache getting resent through the network from the RPC main host - I will try and retest next week. |
I haven't really managed to get to the bottom of it, but don't think it's related to this PR at least - it seems to be something related to the RPC |
You can set |
Yeah, I will try and see if I can figure it out. |
Small follow up after using this some more, I think that there should be a change with how
Specifically because I'm seeing the following in my logs:
As mentioned before, I've set Not being too familiar with the llama.cpp codebase, I'm not sure if there's a way to do something similar to (pseudocode)
or the like, because realistically I'm only using ~19,000MiB of the 65,536MiB I've configured. |
target #16440
rel #16117
Initial version of automatic memory offloading to host memory using an extended logic for minimizing the prompt reprocessing. The host-memory prompt cache acts as "extra slots" with which we can calculate prefix similarity and decide to hot-swap them into the
llama_context
if it would reduce the processing. The cache is stored in regular RAM.The RAM size that is used for caching prompts has 2 limits:
--cache-ram, -cram
CLI arg)--context-size
)The server logs provide detailed prompt cache information each time the cache is updated:
A small QoL improvement is that
update_slots()
now also logs the old and new prompt for each task aroundn_past
(up to 10 tokens) so we can have a better understanding what caused the particular choice of then_past
value for the new task.Setting
LLAMA_SERVER_SLOTS_DEBUG=1
env will make the/slots
endpoint output a more detailed output containing the prompt and the generated text of the current or last task. This is useful for debugging purposes.Note: mtmd workarounds are starting to cause some headaches. For example
server_tokens
is not copyable which complicates the cache logic and makes the prompt caching feature incompatible with mtmd.Usage
Server refactor
server_slot
members with a singleserver_task
server_slot.n_predict
slot.task
is nowconst ptr
to reflect that the task parameters should not change when it is passed to the slotTODOs