server: fix checkpoints creation#22929
Conversation
|
tested following way:
Details |
|
Hi @jacekpoplawski, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
d878621 to
ea9369c
Compare
|
Details |
|
Yes, that seems in a good direction. Have you done testing that it works as expected? |
|
This needs autoparser dedicated support for split-marker detection; currently, this will assume that all autoparser models use the ChatML markers ( I'll try to submit the marker detection code ASAP. |
|
|
||
| const auto message_spans = json_value(data, "message_spans", json::array()); | ||
| if (message_spans.is_array()) { | ||
| int32_t last_user_pos = -1; |
There was a problem hiding this comment.
You can probably use 0 as the sentinel value here, since a checkpoint at pos 0 isn't useful. Should help clean up the other logic too.
|
|
||
| if ((size_t) last_user_pos <= prompt.size()) { | ||
| const std::string prefix = prompt.substr(0, (size_t) last_user_pos); | ||
| const auto prefix_tokens = common_tokenize(vocab, prefix, true, true); |
There was a problem hiding this comment.
Just a guess, but this will probably create incorrect checkpoints for multimodal models with at least one image in the prompt.
There was a problem hiding this comment.
Yes, you are right, this breaks after the first image.
There was a problem hiding this comment.
now it should be ok
It works stable for my usecase: pi, qwen 3.6 27B, 200k ctx, 24 checkpoints With 8 checkpoints I was able to reproduce As @aldehir pointed out, this does not work correctly with multimodal prompts. I committed a fallback to the old mechanism for that case. Should I add a switch to enable this new mechanism as an option, or should I try to support multimodal prompts as well? I understand that the impact of this change is significant, but the benefits are also significant: agentic coding is much more responsive now. |
|
Tested, model used: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF The message But there is a cache miss that happens only one time, and I can't reproduce it. I tried for 1h without being able to hit that again. Great work! You saved my time and my electricity bill. Edit: |
| // stop the prompt batch exactly before the latest user input, so a checkpoint | ||
| // can be created at the conversation boundary | ||
| if (checkpoint_before_last_user_token > 0 && | ||
| slot.prompt.n_tokens() == checkpoint_before_last_user_token) { |
There was a problem hiding this comment.
Just for testing, can you check that this works also with images:
| slot.prompt.n_tokens() == checkpoint_before_last_user_token) { | |
| slot.prompt.get_text_tokens().size() == checkpoint_before_last_user_token) { |
And also the same change applied below on line 2752
There was a problem hiding this comment.
I tried using slot.prompt.tokens.get_text_tokens().size(), but it didn’t help. I am now testing an approach that iterates over the full token sequence while skipping LLAMA_TOKEN_NULL, and it seems to work.
|
Just pulled that PR. When using Pi or OpenCode I still get |
Could you say a bit more and show some logs? I tested this fix for many hours (Qwen 3.6 27B), and |
|
@ggerganov Multimodal prompts are now supported. I will continue working on this, because the checkpoint is currently created only after the image is read for the second time, not the first time. If you are happy with the direction of my changes, I’d like to add two arguments: one to set |
I was using qwen3.5 122b. I sadly cant provide more logs as of right now. |
|
@jacekpoplawski Is the |
|
Hi, 1. The On-Demand CheckpointFor speculative KV-cache preloading, an on-demand checkpoint triggered by an abrupt TCP disconnect would probably fix it because you can get the information of "where" to do it while it's still running.
2. The LRU Strategy & Truncation (The Core Issue)My workflow involves a Context LRU system where rg-edit frequently modifies files within the context.
How the LRU works:
3. Current Implementation & OptimizationCurrently, I maintain 64 checkpoints with a frequency of 1 checkpoint every 2k tokens.
4. Potential Optimization: Exponential Checkpoint FrequencyA more complex strategy to further optimize this would be possible: Exponential Checkpoint Frequency.
Conclusion: Before --checkpoint-every-n-tokens existed I already implemented it in a PR checkpoint at every logical batch size and back then my PR fixed hybrids for opencode too, so I guess opencode has the same requirements under the hood. It's just less apparent because with an agent you won't be able to tell immediately things don't work right when it starts to recompute from scratch. Best regards. |
Exactly — For what it's worth, the token position is stable and known ahead of time in chat assistant use cases — the system prompt doesn't change between turns, so the flag value can be set once at server launch. I'd also echo @aagit's point that periodic checkpoints ( Looking forward to the follow-up PR. |
|
Is this supposed to somehow help #22907 (comment) ? |
I see the checkpoint on TCP disconnect was already proposed before for the timeout case. Unless there's other cons I'm missing, such option to checkpoint on TCP disconnect would save me in average about half a second to 1 second per speculative background context preload interruption. So I'd certainly like that too. It's a smaller issue than lack of --checkpoint-max-step though. |
* common : add common_chat_split_by_role * cont : fix spans to reach end of message * server: fix checkpoints creation - extract message_spans from chat templates - find the prompt token position before the latest user message - split prompt batching at that position - create a context checkpoint before the latest user input - avoid periodic mid-prompt checkpoints when that position is known - handle multimodal prompts when mapping text/template positions to server prompt tokens - add --checkpoint-min-step to control minimum spacing between checkpoints * cont : clean-up * Support autoparser detection for message barriers * server: fix message span delimiter and update docs --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
|
由于我的英文不太好,且母语是中文,以下内容使用Qwen协助翻译生成: Feature Request: Support mid-prompt periodic checkpoints for long-context retry resilienceContextFirst, thank you for the great work on #22929! The message-boundary checkpoint logic is a smart optimization for typical chat workflows. However, I'm encountering a limitation in a production scenario with very long contexts (150k+ tokens): Current Behavior
Desired Behavior
Use Case Details
Suggested Solutions (any would help)
Related FeedbackThis aligns with concerns raised by @aagit and @ZacharyReis in #22929 about:
Questions
Thanks for considering this! 🙏 |
Adds a backstep-and-restore path that prevents `update_slots` from wiping the entire slot on every multi-turn request to QWEN35MOE / QWEN3NEXT (hybrid Gated DeltaNet + attention). The post-LCP cut is rounded back to the nearest chat-template turn boundary, the existing context-checkpoint infrastructure is reused to restore state when the recurrent backend rejects partial seq_rm, and a checkpoint is forced on every boundary token so a usable rescue point always exists. Drops per-turn cost from ~30s full re-eval to ~1s O(delta) on a 36k-token Qwen 3.6 27B test (~45x).
|
@mscheurwater @ZacharyReis could you check #23808? |
ggml-org#22929 creates a context checkpoint only before the last user message, so prompts with a stable prefix and content that changes between turns lose all checkpoint cache hits (the surviving checkpoints sit past the divergence point) and re-evaluate the full prompt every turn, notably on SWA models. Create a checkpoint before every user message instead, derived from the same message_spans. prompt_get_n_before_user() becomes prompt_get_user_boundaries(); the prefill batch breaks at each boundary and a checkpoint is allowed at any of them, still bounded by --checkpoint-min-step and --ctx-checkpoints.
ggml-org#22929 creates a context checkpoint only before the last user message, so prompts with a stable prefix and content that changes between turns lose all checkpoint cache hits (the surviving checkpoints sit past the divergence point) and re-evaluate the full prompt every turn, notably on SWA models. Create a checkpoint before every user message instead, derived from the same message_spans. prompt_get_n_before_user() becomes prompt_get_user_boundaries(); the prefill batch breaks at each boundary and a checkpoint is allowed at any of them, still bounded by --checkpoint-min-step and --ctx-checkpoints.
ggml-org#22929 creates a context checkpoint only before the last user message, so prompts with a stable prefix and content that changes between turns lose all checkpoint cache hits (the surviving checkpoints sit past the divergence point) and re-evaluate the full prompt every turn, notably on SWA models. Create a checkpoint before every user message instead, derived from the same message_spans. prompt_get_n_before_user() becomes prompt_get_user_boundaries(); the prefill batch breaks at each boundary and a checkpoint is allowed at any of them, still bounded by --checkpoint-min-step and --ctx-checkpoints.
ggml-org#22929 creates a context checkpoint only before the last user message, so prompts with a stable prefix and content that changes between turns lose all checkpoint cache hits (the surviving checkpoints sit past the divergence point) and re-evaluate the full prompt every turn, notably on SWA models. Create a checkpoint before every user message instead, derived from the same message_spans. prompt_get_n_before_user() becomes prompt_get_user_boundaries(); the prefill batch breaks at each boundary and a checkpoint is allowed at any of them, still bounded by --checkpoint-min-step and --ctx-checkpoints.
I agree, my prompt cache numbers after this merge (b9105 before, now on b9396, Qwen3.6 35B A3B) dropped quite a bit due to me not being able to set this number anymore. It is very relevant in some agentic workloads when you're sending many requests with same instructions but different data in one message and the whole thing fits in <1000 tokens. Before: Now: |
|
我觉得也说服不了各位。
I don't think it can convince everyone either.
|
I'm also experiencing this issue. I use RAG and include the information in the message with a depth of 4. Because of the change made here, I lose the entire context. |
* common : add common_chat_split_by_role * cont : fix spans to reach end of message * server: fix checkpoints creation - extract message_spans from chat templates - find the prompt token position before the latest user message - split prompt batching at that position - create a context checkpoint before the latest user input - avoid periodic mid-prompt checkpoints when that position is known - handle multimodal prompts when mapping text/template positions to server prompt tokens - add --checkpoint-min-step to control minimum spacing between checkpoints * cont : clean-up * Support autoparser detection for message barriers * server: fix message span delimiter and update docs --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
|
这个功能改变的潜在的负面影响还在陆陆续续暴露。 The potential negative impacts of this functional change are still gradually coming to light. |
|
Yeah sometimes you could have a really long system prompt with an ending that dynamically changes. Checkpoints by message boundaries might not always work well. |
If the context needs to be recalculated, a checkpoint could be generated for each user message. |
|
我现在使用如下策略来适应新的KVcache逻辑:
I am currently adopting the following strategy to adapt to the new KVcache logic:
|
* common : add common_chat_split_by_role * cont : fix spans to reach end of message * server: fix checkpoints creation - extract message_spans from chat templates - find the prompt token position before the latest user message - split prompt batching at that position - create a context checkpoint before the latest user input - avoid periodic mid-prompt checkpoints when that position is known - handle multimodal prompts when mapping text/template positions to server prompt tokens - add --checkpoint-min-step to control minimum spacing between checkpoints * cont : clean-up * Support autoparser detection for message barriers * server: fix message span delimiter and update docs --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
git merge's 3-way resolver did not flag two semantic duplicates in
tools/server/server-context.cpp because the merge base did not contain
either symbol. The duplicate bodies are byte-identical, so removing
the second copy of each pair is semantically equivalent.
Removed:
- `const bool near_prompt_end` declaration at line 3653 (upstream
side, e2ef8fe, "server: fix checkpoints creation" PR ggml-org#22929).
- `static uint32_t server_n_outputs_max(...)` body at lines 219-232
(upstream side, de6f727, "llama: limit max outputs of
llama_context" PR ggml-org#23861; one line modified by 5dcb711,
"speculative: fix n_outputs_max and remove draft-simple auto-enable"
PR ggml-org#23988).
Kept:
- The cache-side copies (72cfbcd), which match the
cache-optimization chain the Stage 11 work is built on.



Overview
Implemented as requested in #22826 (comment)
message_spansfrom chat templatesAdditional information
This is another chapter in my journey toward fixing
forcing full prompt re-processing due to lack of cache dataMy main goal is to increase the "responsiveness" of agentic coding in llama.cpp
I am currently testing this with the following command:
preserve_thinkingreally helps, without it, the prompt history changes, so there is always some reprocessingRequirements