server: avoid unnecessary checkpoint restore when new tokens are present by Abioy · Pull Request #24110 · ggml-org/llama.cpp

Abioy · 2026-06-04T10:29:35Z

#23280 fixed a crash on hybrid attention models, but might restoring checkpoint rather than reusing current slot even if the input prompt gets full prefix matching and also has new tokens (means there is still tokens to be evaluated).

Overview

Avoid unnecessary checkpoint restore when the request contains new tokens beyond the cached prefix.

The pos_min_thold calculation unconditionally subtracts 1 to guard against the edge case where n_past == n_tokens() (no tokens to evaluate for logits). However, when the request has new tokens to process, this -1 is overly conservative and may trigger a redundant checkpoint restore, involving unnecessary KV state deserialization and GPU memory writes.

This change conditionally applies the -1 only when n_past >= task.n_tokens() (no new tokens), skipping unnecessary checkpoint restoration when there is actual work to do.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - AI was used to analyze the checkpoint restore logic.

The pos_min_thold calculation unconditionally subtracts 1 to ensure at least one token is evaluated for logits when no new tokens exist. However, when the request contains new tokens beyond the cached prefix, this -1 is overly conservative and may trigger an unnecessary checkpoint restore. Conditionally apply the -1 only when n_past >= task.n_tokens() (no new tokens), avoiding redundant KV state restoration when there is actual work to do.

Abioy · 2026-06-04T10:53:23Z

I think this PR might also fix #23589

ggerganov · 2026-06-04T11:22:08Z

Can you provide repro steps that demonstrate the issue on master?

ggerganov · 2026-06-04T12:14:29Z

Technically, this change should be correct. I just have no idea in what situation this can happen. So will wait for further feedback with a repro before proceeding.

Abioy · 2026-06-04T12:23:55Z

@ggerganov I run llama-server in router mode with Qwen3.6-27B-MTP, then use OpenCode with it for coding task.

# router.ini, M36D is Qwen3.6-27B-MTP
[M36D]
LLAMA_ARG_JINJA = true
c = 262144
ctk=q8_0
ctv=q8_0
ctxcp = 24
ngl = 999
ub = 1024
b = 1024
parallel = 2
kv-unified = true
top-p = 0.95
top-k = 20
min-p = 0.0
temp = 0.6
presence-penalty = 0.0
no-mmproj = true
spec-type = draft-mtp
spec-draft-n-max = 2
ctkd=q8_0
# ctvd=q4_0
ctvd=q8_0
# cms = 5120
cms = 2048

what I see in log:

// round n-k: create a checkpoint with 56853 tokens

// round n: total token 60004
[38491] 261.48.759.162 I slot      release: id  1 | task 28343 | stop processing: n_tokens = 60004, truncated = 0

// round n+1: the restore operation is not necessary
[38491] 261.49.276.216 I slot update_slots: id  1 | task 28419 | Checking checkpoint with [56853, 56853] against 60003...
[38491] 261.49.316.945 W slot update_slots: id  1 | task 28419 | restored context checkpoint (pos_min = 56853, pos_max = 56853, n_tokens = 56854, n_past = 56854, size = 149.626 MiB)
[38491] 262.12.577.142 I slot print_timing: id  1 | task 28419 | prompt eval time =    7853.10 ms /  3399 tokens (    2.31 ms per token,   432.82 tokens per second)
[38491] 262.12.577.151 I slot print_timing: id  1 | task 28419 |        eval time =   15447.88 ms /   358 tokens (   43.15 ms per token,    23.17 tokens per second)
[38491] 262.12.577.152 I slot print_timing: id  1 | task 28419 |       total time =   23300.97 ms /  3757 tokens
[38491] 262.12.577.154 I slot print_timing: id  1 | task 28419 |    graphs reused =      28287

// round n+2: same here
[38491] 262.12.580.009 I slot      release: id  1 | task 28419 | stop processing: n_tokens = 60610, truncated = 0
[38491] 262.27.271.642 I slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 1.000
[38491] 262.27.272.878 I slot update_slots: id  1 | task 28783 | Checking checkpoint with [56853, 56853] against 60609...
[38491] 262.43.046.440 I slot print_timing: id  1 | task 28783 | prompt eval time =    8345.48 ms /  3786 tokens (    2.20 ms per token,   453.66 tokens per second)
[38491] 262.43.046.451 I slot print_timing: id  1 | task 28783 |        eval time =    7428.10 ms /   172 tokens (   43.19 ms per token,    23.16 tokens per second)
[38491] 262.43.046.452 I slot print_timing: id  1 | task 28783 |       total time =   15773.58 ms /  3958 tokens
[38491] 262.43.046.454 I slot print_timing: id  1 | task 28783 |    graphs reused =      28456

Abioy · 2026-06-04T12:31:41Z

and with this PR, it reuse the slot directly:

de-flandres · 2026-06-04T15:09:58Z

@Abioy do you think this fix will resolve #21831?

Abioy · 2026-06-04T15:33:52Z

@Abioy do you think this fix will resolve #21831?

You mean "forcing full prompt re-processing" in qwen35 ? Some cases are caused by agent-related problems. Certain agents modify system messages, resulting in the common prefix between the newly input prompt and the cached prompt being too short (shorter than the reusable threshold). Such issues need to be resolved on the agent side.

If you are sure that the agent just append new messge each turn, then this PR can allow the request to directly reuse the slot, or use a latest checkpoint (instead of using an earlier checkpoint like #23589), and the prompt processing time can be reduced. Like the screenshots I post earlier, prompt eval time reduces from 8000ms to 800ms.

mdziekon · 2026-06-04T16:30:44Z

Just compiled b9509 (with this PR merged in), and I can confirm that the test case I've mentioned here is indeed now fixed. Not sure about the original report, but it seemed quite similar to my test case.

* origin/master: (57 commits) server : disable on-device spec checkpoints (ggml-org#24108) arg: fix double mtp downloads (ggml-org#24128) webui: [a11y] fix keyboard navigation issues in chat interface and sidebar (ggml-org#23132) Move duplicated imatrix code into single common imatrix-loader.cpp (ggml-org#22445) ui: Fixed packages (ggml-org#24119) ui: added single line reasoning preview (ggml-org#23601) return filter to save memory (ggml-org#24125) convert: Fix Gemma 4 Unified conversion (ggml-org#24118) ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (ggml-org#22209) server: avoid unnecessary checkpoint restore when new tokens are present (ggml-org#24110) agents: refactor, include more guidelines (ggml-org#24111) webui: fix tool selector toggle/counter, key tools by stable identity (ggml-org#24065) build : use umbrella Headers directory for XCFramework module map (ggml-org#23974) server : add header to tools/server/server-http.h (ggml-org#24089) cmake: skip cvector-generator and export-lora when CPU backend is disabled (ggml-org#24053) fix(mtmd): handle Gemma 4 audio projector embedding size (ggml-org#24091) readme : add status badges (ggml-org#24104) tests : refactor test-save-load-state to accept token input (ggml-org#24073) metal : reduce rset heartbeat from 500ms -> 5ms (ggml-org#24074) ggml-webgpu: FlashAttention refactor + standardize quantization support (ggml-org#23834) ...

Abioy requested a review from a team as a code owner June 4, 2026 10:29

github-actions Bot added examples server labels Jun 4, 2026

ggerganov reviewed Jun 4, 2026

View reviewed changes

Comment thread tools/server/server-context.cpp

cont : add ref

24632d5

ggerganov approved these changes Jun 4, 2026

View reviewed changes

ggerganov merged commit 6f3a9f3 into ggml-org:master Jun 4, 2026
25 checks passed

de-flandres mentioned this pull request Jun 4, 2026

forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory) kyuz0/amd-strix-halo-toolboxes#99

Closed

Abioy mentioned this pull request Jun 4, 2026

Server forces full prompt re-processing on subsequent requests (SWA/recurrent memory error) #21831

Open

mdziekon mentioned this pull request Jun 4, 2026

Eval bug: KV cache drops ~4k tokens per turn on Qwen3.6-35B-A3B (since build b9235 ) #23589

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: avoid unnecessary checkpoint restore when new tokens are present#24110

server: avoid unnecessary checkpoint restore when new tokens are present#24110
ggerganov merged 2 commits into
ggml-org:masterfrom
Abioy:syy_reduce_unnecessary_restore

Abioy commented Jun 4, 2026 •

edited

Loading

Uh oh!

Abioy commented Jun 4, 2026 •

edited

Loading

Uh oh!

ggerganov commented Jun 4, 2026

Uh oh!

ggerganov commented Jun 4, 2026

Uh oh!

Abioy commented Jun 4, 2026 •

edited

Loading

Uh oh!

Abioy commented Jun 4, 2026

Uh oh!

Uh oh!

Uh oh!

de-flandres commented Jun 4, 2026

Uh oh!

Abioy commented Jun 4, 2026

Uh oh!

mdziekon commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Abioy commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

Abioy commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jun 4, 2026

Uh oh!

ggerganov commented Jun 4, 2026

Uh oh!

Abioy commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Abioy commented Jun 4, 2026

Uh oh!

Uh oh!

Uh oh!

de-flandres commented Jun 4, 2026

Uh oh!

Abioy commented Jun 4, 2026

Uh oh!

mdziekon commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Abioy commented Jun 4, 2026 •

edited

Loading

Abioy commented Jun 4, 2026 •

edited

Loading

Abioy commented Jun 4, 2026 •

edited

Loading