server: avoid unnecessary checkpoint restore when new tokens are present#24110
Conversation
The pos_min_thold calculation unconditionally subtracts 1 to ensure at least one token is evaluated for logits when no new tokens exist. However, when the request contains new tokens beyond the cached prefix, this -1 is overly conservative and may trigger an unnecessary checkpoint restore. Conditionally apply the -1 only when n_past >= task.n_tokens() (no new tokens), avoiding redundant KV state restoration when there is actual work to do.
|
I think this PR might also fix #23589 |
|
Can you provide repro steps that demonstrate the issue on |
|
Technically, this change should be correct. I just have no idea in what situation this can happen. So will wait for further feedback with a repro before proceeding. |
|
@ggerganov I run llama-server in router mode with Qwen3.6-27B-MTP, then use OpenCode with it for coding task. # router.ini, M36D is Qwen3.6-27B-MTP
[M36D]
LLAMA_ARG_JINJA = true
c = 262144
ctk=q8_0
ctv=q8_0
ctxcp = 24
ngl = 999
ub = 1024
b = 1024
parallel = 2
kv-unified = true
top-p = 0.95
top-k = 20
min-p = 0.0
temp = 0.6
presence-penalty = 0.0
no-mmproj = true
spec-type = draft-mtp
spec-draft-n-max = 2
ctkd=q8_0
# ctvd=q4_0
ctvd=q8_0
# cms = 5120
cms = 2048what I see in log:
|
|
You mean "forcing full prompt re-processing" in qwen35 ? Some cases are caused by agent-related problems. Certain agents modify system messages, resulting in the common prefix between the newly input prompt and the cached prompt being too short (shorter than the reusable threshold). Such issues need to be resolved on the agent side. If you are sure that the agent just append new messge each turn, then this PR can allow the request to directly reuse the slot, or use a latest checkpoint (instead of using an earlier checkpoint like #23589), and the prompt processing time can be reduced. Like the screenshots I post earlier, prompt eval time reduces from 8000ms to 800ms. |
|
Just compiled |
* origin/master: (57 commits) server : disable on-device spec checkpoints (ggml-org#24108) arg: fix double mtp downloads (ggml-org#24128) webui: [a11y] fix keyboard navigation issues in chat interface and sidebar (ggml-org#23132) Move duplicated imatrix code into single common imatrix-loader.cpp (ggml-org#22445) ui: Fixed packages (ggml-org#24119) ui: added single line reasoning preview (ggml-org#23601) return filter to save memory (ggml-org#24125) convert: Fix Gemma 4 Unified conversion (ggml-org#24118) ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (ggml-org#22209) server: avoid unnecessary checkpoint restore when new tokens are present (ggml-org#24110) agents: refactor, include more guidelines (ggml-org#24111) webui: fix tool selector toggle/counter, key tools by stable identity (ggml-org#24065) build : use umbrella Headers directory for XCFramework module map (ggml-org#23974) server : add header to tools/server/server-http.h (ggml-org#24089) cmake: skip cvector-generator and export-lora when CPU backend is disabled (ggml-org#24053) fix(mtmd): handle Gemma 4 audio projector embedding size (ggml-org#24091) readme : add status badges (ggml-org#24104) tests : refactor test-save-load-state to accept token input (ggml-org#24073) metal : reduce rset heartbeat from 500ms -> 5ms (ggml-org#24074) ggml-webgpu: FlashAttention refactor + standardize quantization support (ggml-org#23834) ...


#23280 fixed a crash on hybrid attention models, but might restoring checkpoint rather than reusing current slot even if the input prompt gets full prefix matching and also has new tokens (means there is still tokens to be evaluated).
Overview
Avoid unnecessary checkpoint restore when the request contains new tokens beyond the cached prefix.
The
pos_min_tholdcalculation unconditionally subtracts1to guard against the edge case wheren_past == n_tokens()(no tokens to evaluate for logits). However, when the request has new tokens to process, this-1is overly conservative and may trigger a redundant checkpoint restore, involving unnecessary KV state deserialization and GPU memory writes.This change conditionally applies the
-1only whenn_past >= task.n_tokens()(no new tokens), skipping unnecessary checkpoint restoration when there is actual work to do.Requirements