Skip to content

server: preserve context checkpoint coverage#22826

Open
jacekpoplawski wants to merge 1 commit into
ggml-org:masterfrom
jacekpoplawski:checkpoint-coverage
Open

server: preserve context checkpoint coverage#22826
jacekpoplawski wants to merge 1 commit into
ggml-org:masterfrom
jacekpoplawski:checkpoint-coverage

Conversation

@jacekpoplawski
Copy link
Copy Markdown
Contributor

Instead of always removing the oldest context checkpoint when the checkpoint limit is reached, remove the checkpoint that appears most redundant based on the distance between its neighbors.

Overview

This is my attempt to fix forcing full prompt re-processing due to lack of cache data

This changes the checkpoint removal policy: when the limit is reached, it removes an interior checkpoint whose neighboring checkpoints are closest together.

Additional information

I use the following arguments: --ctx-checkpoints 24 --checkpoint-every-n-tokens 8192 --cache-ram 65536

After just a few prompts in a pi coding agent, I see:

slot launch_slot_: id  0 | task 6130 | processing task, is_child = 0
slot update_slots: id  0 | task 6130 | new prompt, n_ctx_slot = 200192, n_keep = 4096, task.n_tokens = 31544
slot update_slots: id  0 | task 6130 | n_past = 3579, slot.prompt.tokens.size() = 33081, seq_id = 0, pos_min = 33080, n_swa = 0
slot update_slots: id  0 | task 6130 | Checking checkpoint with [32656, 32656] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [32531, 32531] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [32436, 32436] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [32361, 32361] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [31837, 31837] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [31325, 31325] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [30750, 30750] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [30660, 30660] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [30473, 30473] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [30371, 30371] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [30008, 30008] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [29496, 29496] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [29027, 29027] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [28942, 28942] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [28830, 28830] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [28278, 28278] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [27757, 27757] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [27188, 27188] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [26676, 26676] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [23367, 23367] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [23187, 23187] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [21341, 21341] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [20918, 20918] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [20479, 20479] against 3579...
slot update_slots: id  0 | task 6130 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)

the server needed a checkpoint around n_past = 3579, but all available checkpoints were much later, from 20479 to 32656, causing full prompt re-processing.

The root cause seems to be that checkpoints are not only created at the --checkpoint-every-n-tokens interval. Additional checkpoints can be created near prompt/request boundaries, and with the previous FIFO removal policy these dense recent checkpoints can erase older checkpoints.

I first tried disabling the additional checkpoint creation, but that did not work well.

I tested this change with --ctx-checkpoints 8 to trigger checkpoint removal sooner and I could not reproduce the forcing full prompt re-processing due to lack of cache data

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - initial research and final code polish

Instead of always removing the oldest context checkpoint,
remove the one that appears most redundant based on the distance between its neighbors.
@jacekpoplawski jacekpoplawski requested a review from a team as a code owner May 8, 2026 00:53
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented May 8, 2026

Hi @jacekpoplawski, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@ggerganov
Copy link
Copy Markdown
Member

The idea is OK, but it is still a "poor-man" solution. The most optimal way to do the checkpoints is to leverage the changes in #21885 and take into account the structure of the conversation.

@jacekpoplawski
Copy link
Copy Markdown
Contributor Author

The idea is OK, but it is still a "poor-man" solution. The most optimal way to do the checkpoints is to leverage the changes in #21885 and take into account the structure of the conversation.

If I understand correctly, #21885 would tell us where the important positions are, and checkpoint removal should prefer keeping checkpoints around those positions. Or do you mean that this information should be used when creating checkpoints instead?

@ggerganov
Copy link
Copy Markdown
Member

Or do you mean that this information should be used when creating checkpoints instead?

Yes, the information should be used for creating the checkpoints right before user inputs.

@jacekpoplawski
Copy link
Copy Markdown
Contributor Author

Or do you mean that this information should be used when creating checkpoints instead?

Yes, the information should be used for creating the checkpoints right before user inputs.

#22929

lalalune added a commit to elizaOS/llama.cpp that referenced this pull request May 15, 2026
Upstream PR ggml-org/llama.cpp#22826 changes context-checkpoint eviction
from FIFO (oldest first) to "evict the most redundant interior checkpoint"
based on the n_tokens gap between its neighbours. This preserves coverage
across the prompt history so a slot resuming near the start of a long
prompt does not have to re-process the full prefix.

Previously skipped because the upstream patch references a single
.data field on common_prompt_checkpoint that does not exist in our
struct (we carry data_tgt + data_dft for the dflash split). Bridged by
swapping the upstream `cur.data.size()` for our existing `old.size()`
helper at common/common.h:1042-1051 (sums tgt + dft), and renaming the
loop variable to `old` to match upstream phrasing.

Hot path edit lives in tools/server/server-context.cpp:2017-2051
(create_checkpoint). Behaviour is identical when checkpoints.size() < 3
(still evicts begin()), so the change is a no-op for tiny ckpt budgets
and only kicks in once at least 3 interior candidates exist — matching
upstream semantics.

Smoke test on eliza-1-0_8b-32k: model loads, generates coherent text,
prompt 140 t/s / gen 43 t/s (no regression vs baseline).

Refs: ggml-org/llama.cpp#22826
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants