server : checkpoint before every user turn boundary#23814
Conversation
ggml-org#22929 creates a context checkpoint only before the last user message, so prompts with a stable prefix and content that changes between turns lose all checkpoint cache hits (the surviving checkpoints sit past the divergence point) and re-evaluate the full prompt every turn, notably on SWA models. Create a checkpoint before every user message instead, derived from the same message_spans. prompt_get_n_before_user() becomes prompt_get_user_boundaries(); the prefill batch breaks at each boundary and a checkpoint is allowed at any of them, still bounded by --checkpoint-min-step and --ctx-checkpoints.
Adds a backstep-and-restore path that prevents `update_slots` from wiping the entire slot on every multi-turn request to QWEN35MOE / QWEN3NEXT (hybrid Gated DeltaNet + attention). The post-LCP cut is rounded back to the nearest chat-template turn boundary, the existing context-checkpoint infrastructure is reused to restore state when the recurrent backend rejects partial seq_rm, and a checkpoint is forced on every boundary token so a usable rescue point always exists. Drops per-turn cost from ~30s full re-eval to ~1s O(delta) on a 36k-token Qwen 3.6 27B test (~45x).
|
@pwilkin @ggerganov added performance testing to PR, verification of no performance degradation. Let me know if this needs anything else to be ready for review. |
|
The mtmd code already exists from the previous version before this PR, so I think it won't change much. But I have to note that this part of the code is a bit over my head so I cannot tell for sure. In anyway, since we now create checkpoint for both user and assistant message boundary, wouldn't it easier to simply create the checkpoint at every EOG tokens instead? Wouldn't detecting boundary via tokenizing become an overkill solution now? |
Hmm, I think you still want to keep the last user checkpoint at all times. So when reasoning gets striped from a long-horizon agentic task, it can rollback to the last user checkpoint instead of the start (since it would have been evicted by then). I think we probably need to cache checkpoints with different eviction rules and maintain those at certain boundaries. |
|
@aldehir not quite sure if I understand your comment correctly, let's take an example: so on on the next user message, in either cases, I still don't quite sure what this PR is trying to address. to add to that, @reedmayhew18 your AI-generated comment #23814 (comment) is not welcomed here as (1) we strictly forbidden AI-written comments and (2) your result doesn't even represent all possible cases that can happen. please do not post excessive AI-generated content, that is not helpful |
@aldehir Hello, my apologies for using A.I. to assist with my responses. I have a hard time typing out long replies due to some neurological health issues I have, so it helps me make sure I have all of my thoughts laid out cleanly without any gaps. I can avoid using it moving forward if its an issue. What this patch addresses: Essentially, without this patch, my computer (and several others) ends up reprocessing tens of thousands of tokens on every turn (sometimes hundreds of thousands with vary long conversations) if checkpoints are not made on an intelligent basis. As demonstrated by the testing, any subtle change in previous messages right now has the potential to cause a full prompt re-evaluation which is extremely inefficient and wasteful on time. I have 122b-190b param models loaded sometimes and they are unusable without the patch, as forcing them to re-evaluate a prompt can sometimes take up to 30 minutes, and if its running tools, force it to re-evaluate the entire prompt multiple times per turn. It's completely miserable and unbearable. I use these models like 8+ hours a day (for personal and work) with Claude Code, Open-webui, Kai (mobile app), etc. And for example: Kai isn't the best programmed and does some timestamp adjustment in previous messages and causes the current master branch to re-evaluate the entire prompt multiple times per turn. With this patch, responses take seconds vs. several minutes with 100b+ models. Hopefully that gives a better real-world example. I'm personally shocked that people aren't using the system without some sort of patch like the one I'm suggesting. Just because I genuinely thought my computer was too slow to handle these models until I spend several dozen hours working on this. Lastly, please understand: I am fully aware A.I. was used to assist with this coding and if that's an issue, I'm more than happy to let someone else look at my suggestion and re-write it themselves in a human manner. Please understand that I am very knowledgeable about how these systems work and understand the workflow about these things, it is simply a limitation on my ability to physically type out the code from scratch. (I am in classes learning these languages, I just have very much difficulty typing/remembering syntax due to my neurological conditions. Hopefully that makes sense. Not trying to cause any issues. Just trying to help. I promise I can answer any questions you have about the code, part of my current job is being an IT Security Administrator overseeing a fleet of computers that need constant diagnosis, fixing, all this stuff. Linux is my daily driver, I dont use windows at home, just at work. Only including these things so you understand that I'm not just another "vibe coder" submitting PRs without any clue of what they're doing. Thanks and let me know if you have any questions! |
including a wall of text doesn't make you different from any other vibe coders let me put this simple: either you should answer my question (in my last comments) concisely and straight to the point, or let's close this PR and let another community member to submit a better, less AI-excessive patch. |
I genuinely can help with this stuff and others can't. That's a huge difference regardless of the size of text. Either way, I understand the importance of conciseness.
I was responding to the code change as you replied. I was simply addressing both of your concerns separately. (Why I was using A.I. and then the code changes requested.) Sorry about the wall of text, that is how I naturally talk, which is what you're requiring here. This is where I'd typically have something help me condense my thoughts down. I'll do my best to keep responses short. |
Catch-Up to Current Master
|
I might be a bit hard here, but the truth is that this PR isn't much different than other that are fully AI-generated I don't believe patching that small It's best to open a discussion to see if the chat template system maintainers have a solution to somehow expose the boundary from the chat template engine back to the caller. I strongly believe it's feasible, and even if it's not, the logic to handle message boundaries should still eventually be in the chat.cpp, not inside server. Closing this PR since both your current impl and the proposed one #23814 (comment) sounds too overkill |
What a dickish comment |
This comment was marked as spam.
This comment was marked as spam.
|
Gotta love the elitist gatekeepers that maintain this project |
That's an insult to AI though, maybe he should have used AI for HIS response, it would have been better than what he decided to go with |
This comment was marked as spam.
This comment was marked as spam.
|
Hi @Xcc313r4n7 and @nihilistau. Thank you for your constructive feedback about how to moderate contributions to a project that neither of you have ever contributed to. If you understand this part of the code and are able to contribute a legitimate fix that you can explain in your own words, please feel free to submit a PR. |
Overview
#22929 creates a context checkpoint only before the last user message, so prompts with a stable prefix and content that changes between turns lose all checkpoint cache hits (the surviving checkpoints sit past the divergence point) and re-evaluate the full prompt every turn, notably on SWA models.
Create a checkpoint before every user message instead, derived from the same message_spans. prompt_get_n_before_user() becomes prompt_get_user_boundaries(); the prefill batch breaks at each boundary and a checkpoint is allowed at any of them, still bounded by --checkpoint-min-step and --ctx-checkpoints.
Additional information
Addresses the regression: #22929 (comment)
Originated from: 57f81cb
(Commit #22929 resolved some of these issues, but not all of them)
Checkpointing system working in testing: #22746 (comment)
(This PR uses the same technique, simply working alongside #22929 since it was originally designed before that commit)
Requirements
I have read and agree with the contributing guidelines: Yes
AI usage disclosure: Yes
Diagnostics, patch drafts and suggestions, automating labor
(example: running scripts and waiting for them to finish, recording results)
Human: Yes (of course!)
Code review and finalization, task management, quality assurance.
(I was consistently present, directing, and analyzing the code the entire time this patch was being worked on.)