Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Oct 2, 2025

target #16440
rel #16117

Initial version of automatic memory offloading to host memory using an extended logic for minimizing the prompt reprocessing. The host-memory prompt cache acts as "extra slots" with which we can calculate prefix similarity and decide to hot-swap them into the llama_context if it would reduce the processing. The cache is stored in regular RAM.

The RAM size that is used for caching prompts has 2 limits:

  • Max size in bytes (controlled with new --cache-ram, -cram CLI arg)
  • Max number of cached tokens (by default, equal to --context-size)

The server logs provide detailed prompt cache information each time the cache is updated:

image
  • A small QoL improvement is that update_slots() now also logs the old and new prompt for each task around n_past (up to 10 tokens) so we can have a better understanding what caused the particular choice of the n_past value for the new task.

  • Setting LLAMA_SERVER_SLOTS_DEBUG=1 env will make the /slots endpoint output a more detailed output containing the prompt and the generated text of the current or last task. This is useful for debugging purposes.

Note: mtmd workarounds are starting to cause some headaches. For example server_tokens is not copyable which complicates the cache logic and makes the prompt caching feature incompatible with mtmd.

Usage

# use 8192 MiB of host RAM for caching prompts
llama-server ... --cache-ram 8192

# use as much host RAM is available (i.e. no limit)
llama-server ... -cram -1

# disable prompt caching in RAM
llama-server ... -cram 0

Server refactor

  • Replace server_slot members with a single server_task
  • Remove server_slot.n_predict
  • Remove prompt truncation logic (obsolete and not useful anymore)
  • slot.task is now const ptr to reflect that the task parameters should not change when it is passed to the slot
  • Bump default context checkpoints from 3 to 8

TODOs

  • Set memory limit for the host-memory cache from CLI
  • Clean-up implementation
  • Test with agentic workflows
  • Multi-slot tests
  • Fix progress report

@ggerganov ggerganov force-pushed the gg/prompt-cache-ext branch 2 times, most recently from 0787f03 to 5c0cec4 Compare October 3, 2025 18:49
@tommarques56

This comment was marked as spam.

@ggerganov ggerganov force-pushed the gg/prompt-cache-ext branch from 5c0cec4 to 1440ec5 Compare October 7, 2025 07:40
@ggerganov ggerganov changed the base branch from master to gg/server-checkpoints-improve October 7, 2025 07:41
@github-actions github-actions bot added the python python script changes label Oct 7, 2025
@ggerganov ggerganov mentioned this pull request Oct 7, 2025
3 tasks
@ggerganov ggerganov force-pushed the gg/prompt-cache-ext branch from 9de8392 to cf7dd4b Compare October 7, 2025 15:09
@ggerganov
Copy link
Member Author

Looking for some feedback of how this new logic performs in different use cases. I've been testing it with the llama.vscode agent and it significantly improves the experience since we can now use a single server slot without trashing the prompt cache.

The current implementation should work with any model (dense, MoE, SWA, SSM, etc.). I think the default settings should be good for most use cases, though we'll probably add some options to adjust cache limits if needed.

Pay attention to these new messages in the logs:

image

Interested in testing agentic use cases, such as Claude Code and similar, where we have a single large context with various auxilary calls (keyword extraction, summarization, etc.) interleaved. The expectation is that prompt reprocessing should be significantly reduces in such cases.

Base automatically changed from gg/server-checkpoints-improve to master October 8, 2025 07:57
@ggerganov ggerganov force-pushed the gg/prompt-cache-ext branch from 65e8991 to 264d2c3 Compare October 8, 2025 08:24
@ggerganov ggerganov marked this pull request as ready for review October 8, 2025 12:53
@ggerganov ggerganov requested a review from ngxson as a code owner October 8, 2025 12:53
@ggerganov
Copy link
Member Author

I've been testing this with Claude Code and Codex and haven't spotted any issues. After a few more rounds of testing today, planning to merge.

ggerganov and others added 2 commits October 9, 2025 16:18
* server : add option to debug the slot contents

* Update tools/server/server.cpp

---------

Co-authored-by: Xuan-Son Nguyen <[email protected]>
@ggerganov ggerganov merged commit d00cbea into master Oct 9, 2025
70 of 71 checks passed
@ggerganov ggerganov deleted the gg/prompt-cache-ext branch October 9, 2025 15:54
anyshu pushed a commit to anyshu/llama.cpp that referenced this pull request Oct 10, 2025
* master: (113 commits)
  webui: updated the chat service to only include max_tokens in the req… (ggml-org#16489)
  cpu : optimize the ggml NORM operation (ggml-org#15953)
  server : host-memory prompt caching (ggml-org#16391)
  No markdown in cot (ggml-org#16483)
  model-conversion : add support for SentenceTransformers (ggml-org#16387)
  ci: add ARM64 Kleidiai build and test support (ggml-org#16462)
  CANN: Improve ACL graph matching (ggml-org#16166)
  kleidiai: kernel interface refactoring (ggml-org#16460)
  [SYCL] refactor soft_max, add soft_max_back (ggml-org#16472)
  model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (ggml-org#16367)
  refactor: centralize CoT parsing in backend for streaming mode (ggml-org#16394)
  Disable CUDA host buffers on integrated GPUs (ggml-org#16308)
  server : fix cancel pending task (ggml-org#16467)
  metal : mark FA blocks (ggml-org#16372)
  server : improve context checkpoint logic (ggml-org#16440)
  ggml webgpu: profiling, CI updates, reworking of command submission (ggml-org#16452)
  llama : support LiquidAI LFM2-MoE hybrid model (ggml-org#16464)
  server : add `/v1/health` endpoint (ggml-org#16461)
  webui : added download action (ggml-org#13552) (ggml-org#16282)
  presets : fix pooling param for embedding models (ggml-org#16455)
  ...
@jukofyork
Copy link
Collaborator

jukofyork commented Oct 10, 2025

Could this PR cause any problems with RPC (or RPC with speculative decoding)?

I've been testing glm-4.5 using RPC with speculative decoding, and had it hang a few times right after clearing the context but with a common system message that seems to be getting saved.

I can't do any more tests until next week now, but it happened several times and the last time I just cleared the context apart from the system message and then said "hi" and this seemed to trigger it to hang after the prompt processing but before token generation.

@ddh0
Copy link
Contributor

ddh0 commented Oct 12, 2025

Hi, since this was merged, is cache_reuse obsolete now?

@ggerganov
Copy link
Member Author

Could this PR cause any problems with RPC (or RPC with speculative decoding)?

Doubt it. I tested speculative decoding with Gemma models before merging and it was working as expected.

Hi, since this was merged, is cache_reuse obsolete now?

No, cache_reuse is still useful. For example you still need it for the advanced FIM used in https://github.com/ggml-org/llama.vim.

@AesSedai
Copy link

I've been using this for some smaller, agent-like workflows where I'll have a few longer prompts and lots of shorter ones mixed in between and thus far it's been working like a treat. This PR has made me a very happy dev and has saved me probably hours of prompt processing already.

@ggerganov
Copy link
Member Author

Yeah, it's quite a game changer for agentic workflows and tools such as Claude Code and llama.vscode. Thanks for the feedback.

@AesSedai
Copy link

For just a bit more information, I've got basically a MoE-maxxing server with a couple of 3090s and 768GB of 12 channel DDR5, so my local model of choice tends to be R1 0528 or more recently GLM-4.6. Both of which fit comfortably with mixed inference offloading, but I hover around 150-200 tk/s PP and 8-14 tk/s TG depending on context fill level. Since I can only fit so many layers on VRAM, I usually end up cranking the context up instead to 65536+, and I've still got gigs of RAM left over and I've been setting --cache-ram 65536.

So this capability to offload the already processed prompt in a sort of pause / resume fashion means I don't have to compromise with eg, --parallel 2 to get an "agentic slot" any longer and I can swap between tasks in cline / roo code as well without thrashing the cache in its entirety now on long contexts.

IMO, this is a really huge feature that'll open up a lot of new local workflows for llama.cpp users!

@jukofyork
Copy link
Collaborator

Could this PR cause any problems with RPC (or RPC with speculative decoding)?

Doubt it. I tested speculative decoding with Gemma models before merging and it was working as expected.

I wonder if what I thought were "hangs" were actually just the cache getting resent through the network from the RPC main host - I will try and retest next week.

@jukofyork
Copy link
Collaborator

Could this PR cause any problems with RPC (or RPC with speculative decoding)?

Doubt it. I tested speculative decoding with Gemma models before merging and it was working as expected.

I wonder if what I thought were "hangs" were actually just the cache getting resent through the network from the RPC main host - I will try and retest next week.

I haven't really managed to get to the bottom of it, but don't think it's related to this PR at least - it seems to be something related to the RPC --cache option, but also found glm-4.6 doesn't seem to like large batch sizes (possibly related to the recent CUDA backend changes to GQA flash-attention kernels).

@rgerganov
Copy link
Collaborator

I haven't really managed to get to the bottom of it, but don't think it's related to this PR at least - it seems to be something related to the RPC --cache option, but also found glm-4.6 doesn't seem to like large batch sizes (possibly related to the recent CUDA backend changes to GQA flash-attention kernels).

You can set GGML_RPC_DEBUG=1 and start rpc-server to see the commands being received and executed. If you find any issues, please file a bug.

@jukofyork
Copy link
Collaborator

I haven't really managed to get to the bottom of it, but don't think it's related to this PR at least - it seems to be something related to the RPC --cache option, but also found glm-4.6 doesn't seem to like large batch sizes (possibly related to the recent CUDA backend changes to GQA flash-attention kernels).

You can set GGML_RPC_DEBUG=1 and start rpc-server to see the commands being received and executed. If you find any issues, please file a bug.

Yeah, I will try and see if I can figure it out.

@AesSedai
Copy link

Small follow up after using this some more, I think that there should be a change with how n_ctx is used here:

prompt_cache = std::make_unique<server_prompt_cache>(params_base.cache_ram_mib, n_ctx);

Specifically because I'm seeing the following in my logs:

106.34.711.475 W srv  get_availabl: updating prompt cache
106.34.711.736 W srv   prompt_save:  - saving prompt with length 7745, total state size = 2783.450 MiB
106.36.246.614 W srv          load:  - looking for better prompt, base f_keep = 0.001, sim = 0.001
106.36.246.637 W srv        update:  - cache token limit reached, removing oldest entry (size = 4769.420 MiB)
106.36.382.409 W srv        update:  - cache state: 10 prompts, 18965.563 MiB (limits: 65536.000 MiB, 65536 tokens)
106.36.382.422 W srv        update:    - prompt 0x46352d50:   13469 tokens, checkpoints:  0,  4840.578 MiB
106.36.382.422 W srv        update:    - prompt 0x4eef33d0:   14177 tokens, checkpoints:  0,  5095.024 MiB
106.36.382.423 W srv        update:    - prompt 0x46346d90:    2861 tokens, checkpoints:  0,  1028.207 MiB
106.36.382.423 W srv        update:    - prompt 0x4634f8a0:    1948 tokens, checkpoints:  0,   700.087 MiB
106.36.382.425 W srv        update:    - prompt 0x4eedacf0:    2784 tokens, checkpoints:  0,  1000.534 MiB
106.36.382.426 W srv        update:    - prompt 0x4faa65d0:    1814 tokens, checkpoints:  0,   651.929 MiB
106.36.382.426 W srv        update:    - prompt 0x46377f80:     791 tokens, checkpoints:  0,   284.277 MiB
106.36.382.427 W srv        update:    - prompt 0x46340a00:     562 tokens, checkpoints:  0,   201.977 MiB
106.36.382.428 W srv        update:    - prompt 0x4fcc79a0:    6621 tokens, checkpoints:  0,  2379.500 MiB
106.36.382.428 W srv        update:    - prompt 0x4eee2480:    7745 tokens, checkpoints:  0,  2783.450 MiB
106.36.382.432 W srv  get_availabl: prompt cache update took 1670.96 ms

As mentioned before, I've set --cache-ram 65536 and that's reflected correctly, but the context size I'm configuring (65536 tokens) leads to prompts being evicted out of the cache prior to it actually using all of the available space I've allocated, because the sum of cached prompt tokens exceeds n_ctx.

Not being too familiar with the llama.cpp codebase, I'm not sure if there's a way to do something similar to (pseudocode)

cache_token_count = max(n_ctx, params_base.cache_ram_mib / size_per_token);
prompt_cache = std::make_unique<server_prompt_cache>(params_base.cache_ram_mib, cache_token_count);

or the like, because realistically I'm only using ~19,000MiB of the 65,536MiB I've configured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants