server: preserve context checkpoint coverage#22826
Conversation
Instead of always removing the oldest context checkpoint, remove the one that appears most redundant based on the distance between its neighbors.
|
Hi @jacekpoplawski, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
The idea is OK, but it is still a "poor-man" solution. The most optimal way to do the checkpoints is to leverage the changes in #21885 and take into account the structure of the conversation. |
If I understand correctly, #21885 would tell us where the important positions are, and checkpoint removal should prefer keeping checkpoints around those positions. Or do you mean that this information should be used when creating checkpoints instead? |
Yes, the information should be used for creating the checkpoints right before user inputs. |
|
Upstream PR ggml-org/llama.cpp#22826 changes context-checkpoint eviction from FIFO (oldest first) to "evict the most redundant interior checkpoint" based on the n_tokens gap between its neighbours. This preserves coverage across the prompt history so a slot resuming near the start of a long prompt does not have to re-process the full prefix. Previously skipped because the upstream patch references a single .data field on common_prompt_checkpoint that does not exist in our struct (we carry data_tgt + data_dft for the dflash split). Bridged by swapping the upstream `cur.data.size()` for our existing `old.size()` helper at common/common.h:1042-1051 (sums tgt + dft), and renaming the loop variable to `old` to match upstream phrasing. Hot path edit lives in tools/server/server-context.cpp:2017-2051 (create_checkpoint). Behaviour is identical when checkpoints.size() < 3 (still evicts begin()), so the change is a no-op for tiny ckpt budgets and only kicks in once at least 3 interior candidates exist — matching upstream semantics. Smoke test on eliza-1-0_8b-32k: model loads, generates coherent text, prompt 140 t/s / gen 43 t/s (no regression vs baseline). Refs: ggml-org/llama.cpp#22826
Instead of always removing the oldest context checkpoint when the checkpoint limit is reached, remove the checkpoint that appears most redundant based on the distance between its neighbors.
Overview
This is my attempt to fix
forcing full prompt re-processing due to lack of cache dataThis changes the checkpoint removal policy: when the limit is reached, it removes an interior checkpoint whose neighboring checkpoints are closest together.
Additional information
I use the following arguments:
--ctx-checkpoints 24 --checkpoint-every-n-tokens 8192 --cache-ram 65536After just a few prompts in a pi coding agent, I see:
the server needed a checkpoint around
n_past = 3579, but all available checkpoints were much later, from20479to32656, causing full prompt re-processing.The root cause seems to be that checkpoints are not only created at the
--checkpoint-every-n-tokensinterval. Additional checkpoints can be created near prompt/request boundaries, and with the previous FIFO removal policy these dense recent checkpoints can erase older checkpoints.I first tried disabling the additional checkpoint creation, but that did not work well.
I tested this change with
--ctx-checkpoints 8to trigger checkpoint removal sooner and I could not reproduce theforcing full prompt re-processing due to lack of cache dataRequirements