server: preserve context checkpoint coverage by jacekpoplawski · Pull Request #22826 · ggml-org/llama.cpp

jacekpoplawski · 2026-05-08T00:53:34Z

Instead of always removing the oldest context checkpoint when the checkpoint limit is reached, remove the checkpoint that appears most redundant based on the distance between its neighbors.

Overview

This is my attempt to fix forcing full prompt re-processing due to lack of cache data

This changes the checkpoint removal policy: when the limit is reached, it removes an interior checkpoint whose neighboring checkpoints are closest together.

Additional information

I use the following arguments: --ctx-checkpoints 24 --checkpoint-every-n-tokens 8192 --cache-ram 65536

After just a few prompts in a pi coding agent, I see:

slot launch_slot_: id  0 | task 6130 | processing task, is_child = 0
slot update_slots: id  0 | task 6130 | new prompt, n_ctx_slot = 200192, n_keep = 4096, task.n_tokens = 31544
slot update_slots: id  0 | task 6130 | n_past = 3579, slot.prompt.tokens.size() = 33081, seq_id = 0, pos_min = 33080, n_swa = 0
slot update_slots: id  0 | task 6130 | Checking checkpoint with [32656, 32656] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [32531, 32531] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [32436, 32436] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [32361, 32361] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [31837, 31837] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [31325, 31325] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [30750, 30750] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [30660, 30660] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [30473, 30473] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [30371, 30371] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [30008, 30008] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [29496, 29496] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [29027, 29027] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [28942, 28942] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [28830, 28830] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [28278, 28278] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [27757, 27757] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [27188, 27188] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [26676, 26676] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [23367, 23367] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [23187, 23187] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [21341, 21341] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [20918, 20918] against 3579...
slot update_slots: id  0 | task 6130 | Checking checkpoint with [20479, 20479] against 3579...
slot update_slots: id  0 | task 6130 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)

the server needed a checkpoint around n_past = 3579, but all available checkpoints were much later, from 20479 to 32656, causing full prompt re-processing.

The root cause seems to be that checkpoints are not only created at the --checkpoint-every-n-tokens interval. Additional checkpoints can be created near prompt/request boundaries, and with the previous FIFO removal policy these dense recent checkpoints can erase older checkpoints.

I first tried disabling the additional checkpoint creation, but that did not work well.

I tested this change with --ctx-checkpoints 8 to trigger checkpoint removal sooner and I could not reproduce the forcing full prompt re-processing due to lack of cache data

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - initial research and final code polish

Instead of always removing the oldest context checkpoint, remove the one that appears most redundant based on the distance between its neighbors.

ggml-gh-bot · 2026-05-08T00:57:50Z

Hi @jacekpoplawski, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

ggerganov · 2026-05-08T03:47:58Z

The idea is OK, but it is still a "poor-man" solution. The most optimal way to do the checkpoints is to leverage the changes in #21885 and take into account the structure of the conversation.

jacekpoplawski · 2026-05-08T05:19:40Z

The idea is OK, but it is still a "poor-man" solution. The most optimal way to do the checkpoints is to leverage the changes in #21885 and take into account the structure of the conversation.

If I understand correctly, #21885 would tell us where the important positions are, and checkpoint removal should prefer keeping checkpoints around those positions. Or do you mean that this information should be used when creating checkpoints instead?

ggerganov · 2026-05-08T06:11:31Z

Or do you mean that this information should be used when creating checkpoints instead?

Yes, the information should be used for creating the checkpoints right before user inputs.

jacekpoplawski · 2026-05-11T01:36:41Z

Or do you mean that this information should be used when creating checkpoints instead?

Yes, the information should be used for creating the checkpoints right before user inputs.

#22929

Upstream PR ggml-org/llama.cpp#22826 changes context-checkpoint eviction from FIFO (oldest first) to "evict the most redundant interior checkpoint" based on the n_tokens gap between its neighbours. This preserves coverage across the prompt history so a slot resuming near the start of a long prompt does not have to re-process the full prefix. Previously skipped because the upstream patch references a single .data field on common_prompt_checkpoint that does not exist in our struct (we carry data_tgt + data_dft for the dflash split). Bridged by swapping the upstream `cur.data.size()` for our existing `old.size()` helper at common/common.h:1042-1051 (sums tgt + dft), and renaming the loop variable to `old` to match upstream phrasing. Hot path edit lives in tools/server/server-context.cpp:2017-2051 (create_checkpoint). Behaviour is identical when checkpoints.size() < 3 (still evicts begin()), so the change is a no-op for tiny ckpt budgets and only kicks in once at least 3 interior candidates exist — matching upstream semantics. Smoke test on eliza-1-0_8b-32k: model loads, generates coherent text, prompt 140 t/s / gen 43 t/s (no regression vs baseline). Refs: ggml-org/llama.cpp#22826

server: preserve context checkpoint coverage

baae345

Instead of always removing the oldest context checkpoint, remove the one that appears most redundant based on the distance between its neighbors.

jacekpoplawski requested a review from a team as a code owner May 8, 2026 00:53

github-actions Bot added examples server labels May 8, 2026

This was referenced May 8, 2026

Eval bug: Qwen 3.6 27B forcing full prompt re-processing due to lack of cache data #22746

Open

server: fix checkpoints creation #22929

Merged

jacekpoplawski mentioned this pull request May 14, 2026

Prompt cache is not reused for repeated identical chat/completions request with Qwen3.6-35B-A3B #23030

Open

jacekpoplawski mentioned this pull request Jun 2, 2026

server: enhance FIFO prompt cache eviction with second-chance algorithm #23666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: preserve context checkpoint coverage#22826

server: preserve context checkpoint coverage#22826
jacekpoplawski wants to merge 1 commit into
ggml-org:masterfrom
jacekpoplawski:checkpoint-coverage

jacekpoplawski commented May 8, 2026

Uh oh!

ggml-gh-bot Bot commented May 8, 2026

Uh oh!

ggerganov commented May 8, 2026

Uh oh!

jacekpoplawski commented May 8, 2026

Uh oh!

ggerganov commented May 8, 2026

Uh oh!

jacekpoplawski commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jacekpoplawski commented May 8, 2026

Overview

Additional information

Requirements

Uh oh!

ggml-gh-bot Bot commented May 8, 2026

Uh oh!

ggerganov commented May 8, 2026

Uh oh!

jacekpoplawski commented May 8, 2026

Uh oh!

ggerganov commented May 8, 2026

Uh oh!

jacekpoplawski commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants