Skip to content

server: avoid unnecessary checkpoint restore when new tokens are present#24110

Merged
ggerganov merged 2 commits into
ggml-org:masterfrom
Abioy:syy_reduce_unnecessary_restore
Jun 4, 2026
Merged

server: avoid unnecessary checkpoint restore when new tokens are present#24110
ggerganov merged 2 commits into
ggml-org:masterfrom
Abioy:syy_reduce_unnecessary_restore

Conversation

@Abioy
Copy link
Copy Markdown
Contributor

@Abioy Abioy commented Jun 4, 2026

#23280 fixed a crash on hybrid attention models, but might restoring checkpoint rather than reusing current slot even if the input prompt gets full prefix matching and also has new tokens (means there is still tokens to be evaluated).

Overview

Avoid unnecessary checkpoint restore when the request contains new tokens beyond the cached prefix.

The pos_min_thold calculation unconditionally subtracts 1 to guard against the edge case where n_past == n_tokens() (no tokens to evaluate for logits). However, when the request has new tokens to process, this -1 is overly conservative and may trigger a redundant checkpoint restore, involving unnecessary KV state deserialization and GPU memory writes.

This change conditionally applies the -1 only when n_past >= task.n_tokens() (no new tokens), skipping unnecessary checkpoint restoration when there is actual work to do.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - AI was used to analyze the checkpoint restore logic.

The pos_min_thold calculation unconditionally subtracts 1 to ensure at
least one token is evaluated for logits when no new tokens exist.
However, when the request contains new tokens beyond the cached prefix,
this -1 is overly conservative and may trigger an unnecessary checkpoint
restore.

Conditionally apply the -1 only when n_past >= task.n_tokens() (no new
tokens), avoiding redundant KV state restoration when there is actual
work to do.
@Abioy Abioy requested a review from a team as a code owner June 4, 2026 10:29
@Abioy
Copy link
Copy Markdown
Contributor Author

Abioy commented Jun 4, 2026

I think this PR might also fix #23589

@ggerganov
Copy link
Copy Markdown
Member

Can you provide repro steps that demonstrate the issue on master?

@ggerganov
Copy link
Copy Markdown
Member

Technically, this change should be correct. I just have no idea in what situation this can happen. So will wait for further feedback with a repro before proceeding.

@Abioy
Copy link
Copy Markdown
Contributor Author

Abioy commented Jun 4, 2026

@ggerganov I run llama-server in router mode with Qwen3.6-27B-MTP, then use OpenCode with it for coding task.

# router.ini, M36D is Qwen3.6-27B-MTP
[M36D]
LLAMA_ARG_JINJA = true
c = 262144
ctk=q8_0
ctv=q8_0
ctxcp = 24
ngl = 999
ub = 1024
b = 1024
parallel = 2
kv-unified = true
top-p = 0.95
top-k = 20
min-p = 0.0
temp = 0.6
presence-penalty = 0.0
no-mmproj = true
spec-type = draft-mtp
spec-draft-n-max = 2
ctkd=q8_0
# ctvd=q4_0
ctvd=q8_0
# cms = 5120
cms = 2048

what I see in log:

// round n-k: create a checkpoint with 56853 tokens

// round n: total token 60004
[38491] 261.48.759.162 I slot      release: id  1 | task 28343 | stop processing: n_tokens = 60004, truncated = 0

// round n+1: the restore operation is not necessary
[38491] 261.49.276.216 I slot update_slots: id  1 | task 28419 | Checking checkpoint with [56853, 56853] against 60003...
[38491] 261.49.316.945 W slot update_slots: id  1 | task 28419 | restored context checkpoint (pos_min = 56853, pos_max = 56853, n_tokens = 56854, n_past = 56854, size = 149.626 MiB)
[38491] 262.12.577.142 I slot print_timing: id  1 | task 28419 | prompt eval time =    7853.10 ms /  3399 tokens (    2.31 ms per token,   432.82 tokens per second)
[38491] 262.12.577.151 I slot print_timing: id  1 | task 28419 |        eval time =   15447.88 ms /   358 tokens (   43.15 ms per token,    23.17 tokens per second)
[38491] 262.12.577.152 I slot print_timing: id  1 | task 28419 |       total time =   23300.97 ms /  3757 tokens
[38491] 262.12.577.154 I slot print_timing: id  1 | task 28419 |    graphs reused =      28287

// round n+2: same here
[38491] 262.12.580.009 I slot      release: id  1 | task 28419 | stop processing: n_tokens = 60610, truncated = 0
[38491] 262.27.271.642 I slot get_availabl: id  1 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 1.000
[38491] 262.27.272.878 I slot update_slots: id  1 | task 28783 | Checking checkpoint with [56853, 56853] against 60609...
[38491] 262.43.046.440 I slot print_timing: id  1 | task 28783 | prompt eval time =    8345.48 ms /  3786 tokens (    2.20 ms per token,   453.66 tokens per second)
[38491] 262.43.046.451 I slot print_timing: id  1 | task 28783 |        eval time =    7428.10 ms /   172 tokens (   43.19 ms per token,    23.16 tokens per second)
[38491] 262.43.046.452 I slot print_timing: id  1 | task 28783 |       total time =   15773.58 ms /  3958 tokens
[38491] 262.43.046.454 I slot print_timing: id  1 | task 28783 |    graphs reused =      28456
image

@Abioy
Copy link
Copy Markdown
Contributor Author

Abioy commented Jun 4, 2026

and with this PR, it reuse the slot directly:

image

Comment thread tools/server/server-context.cpp
@ggerganov ggerganov merged commit 6f3a9f3 into ggml-org:master Jun 4, 2026
25 checks passed
@de-flandres
Copy link
Copy Markdown

@Abioy do you think this fix will resolve #21831?

@Abioy
Copy link
Copy Markdown
Contributor Author

Abioy commented Jun 4, 2026

@Abioy do you think this fix will resolve #21831?

You mean "forcing full prompt re-processing" in qwen35 ? Some cases are caused by agent-related problems. Certain agents modify system messages, resulting in the common prefix between the newly input prompt and the cached prompt being too short (shorter than the reusable threshold). Such issues need to be resolved on the agent side.

If you are sure that the agent just append new messge each turn, then this PR can allow the request to directly reuse the slot, or use a latest checkpoint (instead of using an earlier checkpoint like #23589), and the prompt processing time can be reduced. Like the screenshots I post earlier, prompt eval time reduces from 8000ms to 800ms.

@mdziekon
Copy link
Copy Markdown

mdziekon commented Jun 4, 2026

Just compiled b9509 (with this PR merged in), and I can confirm that the test case I've mentioned here is indeed now fixed. Not sure about the original report, but it seemed quite similar to my test case.

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jun 4, 2026
* origin/master: (57 commits)
server : disable on-device spec checkpoints (ggml-org#24108)
arg: fix double mtp downloads (ggml-org#24128)
webui: [a11y] fix keyboard navigation issues in chat interface and sidebar (ggml-org#23132)
Move duplicated imatrix code into single common imatrix-loader.cpp (ggml-org#22445)
ui: Fixed packages (ggml-org#24119)
ui: added single line reasoning preview (ggml-org#23601)
return filter to save memory (ggml-org#24125)
convert: Fix Gemma 4 Unified conversion (ggml-org#24118)
ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (ggml-org#22209)
server: avoid unnecessary checkpoint restore when new tokens are present (ggml-org#24110)
agents: refactor, include more guidelines (ggml-org#24111)
webui: fix tool selector toggle/counter, key tools by stable identity (ggml-org#24065)
build : use umbrella Headers directory for XCFramework module map (ggml-org#23974)
server : add header to tools/server/server-http.h (ggml-org#24089)
cmake: skip cvector-generator and export-lora when CPU backend is disabled (ggml-org#24053)
fix(mtmd): handle Gemma 4 audio projector embedding size (ggml-org#24091)
readme : add status badges (ggml-org#24104)
tests : refactor test-save-load-state to accept token input (ggml-org#24073)
metal : reduce rset heartbeat from 500ms -> 5ms (ggml-org#24074)
ggml-webgpu: FlashAttention refactor + standardize quantization support (ggml-org#23834)
...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants