Skip to content

server : checkpoint before every user turn boundary#23814

Closed
mfielding92 wants to merge 2 commits into
ggml-org:masterfrom
mfielding92:master
Closed

server : checkpoint before every user turn boundary#23814
mfielding92 wants to merge 2 commits into
ggml-org:masterfrom
mfielding92:master

Conversation

@mfielding92
Copy link
Copy Markdown

Overview

#22929 creates a context checkpoint only before the last user message, so prompts with a stable prefix and content that changes between turns lose all checkpoint cache hits (the surviving checkpoints sit past the divergence point) and re-evaluate the full prompt every turn, notably on SWA models.

Create a checkpoint before every user message instead, derived from the same message_spans. prompt_get_n_before_user() becomes prompt_get_user_boundaries(); the prefill batch breaks at each boundary and a checkpoint is allowed at any of them, still bounded by --checkpoint-min-step and --ctx-checkpoints.

Additional information

Addresses the regression: #22929 (comment)

Originated from: 57f81cb
(Commit #22929 resolved some of these issues, but not all of them)

Checkpointing system working in testing: #22746 (comment)
(This PR uses the same technique, simply working alongside #22929 since it was originally designed before that commit)

Requirements

  • I have read and agree with the contributing guidelines: Yes

  • AI usage disclosure: Yes
    Diagnostics, patch drafts and suggestions, automating labor
    (example: running scripts and waiting for them to finish, recording results)

  • Human: Yes (of course!)
    Code review and finalization, task management, quality assurance.
    (I was consistently present, directing, and analyzing the code the entire time this patch was being worked on.)

ggml-org#22929 creates a context checkpoint only before the last user message,
so prompts with a stable prefix and content that changes between turns
lose all checkpoint cache hits (the surviving checkpoints sit past the
divergence point) and re-evaluate the full prompt every turn, notably on
SWA models.

Create a checkpoint before every user message instead, derived from the
same message_spans. prompt_get_n_before_user() becomes
prompt_get_user_boundaries(); the prefill batch breaks at each boundary
and a checkpoint is allowed at any of them, still bounded by
--checkpoint-min-step and --ctx-checkpoints.
@mfielding92 mfielding92 requested a review from a team as a code owner May 28, 2026 11:50
mfielding92 referenced this pull request May 28, 2026
Adds a backstep-and-restore path that prevents `update_slots` from
wiping the entire slot on every multi-turn request to QWEN35MOE /
QWEN3NEXT (hybrid Gated DeltaNet + attention). The post-LCP cut
is rounded back to the nearest chat-template turn boundary, the
existing context-checkpoint infrastructure is reused to restore
state when the recurrent backend rejects partial seq_rm, and a
checkpoint is forced on every boundary token so a usable rescue
point always exists.

Drops per-turn cost from ~30s full re-eval to ~1s O(delta) on a
36k-token Qwen 3.6 27B test (~45x).
@mfielding92
Copy link
Copy Markdown
Author

@pwilkin @ggerganov added performance testing to PR, verification of no performance degradation. Let me know if this needs anything else to be ready for review.

Copy link
Copy Markdown
Member

@pwilkin pwilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH this looks good to me (minimal changes and pretty self-contained), would prob like @ngxson to take a look wrt. to the MTMD code.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 29, 2026

The mtmd code already exists from the previous version before this PR, so I think it won't change much. But I have to note that this part of the code is a bit over my head so I cannot tell for sure.

In anyway, since we now create checkpoint for both user and assistant message boundary, wouldn't it easier to simply create the checkpoint at every EOG tokens instead? Wouldn't detecting boundary via tokenizing become an overkill solution now?

@aldehir
Copy link
Copy Markdown
Contributor

aldehir commented May 30, 2026

In anyway, since we now create checkpoint for both user and assistant message boundary, wouldn't it easier to simply create the checkpoint at every EOG tokens instead?

Hmm, I think you still want to keep the last user checkpoint at all times. So when reasoning gets striped from a long-horizon agentic task, it can rollback to the last user checkpoint instead of the start (since it would have been evicted by then). I think we probably need to cache checkpoints with different eviction rules and maintain those at certain boundaries.

Comment thread tools/server/server-context.cpp
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 30, 2026

@aldehir not quite sure if I understand your comment correctly, let's take an example:

<turn>user
prompt1
<turn>assistant
response1
<turn>user
prompt2
<turn>assistant
reasoning_content
response2

so on master currently we have a checkpoint right after prompt2

on the next user message, prompt3, the reasoning_content will be removed and we rollback to prompt2 checkpoint, things look good.

in either cases, I still don't quite sure what this PR is trying to address. to add to that, @reedmayhew18 your AI-generated comment #23814 (comment) is not welcomed here as (1) we strictly forbidden AI-written comments and (2) your result doesn't even represent all possible cases that can happen. please do not post excessive AI-generated content, that is not helpful

@mfielding92
Copy link
Copy Markdown
Author

mfielding92 commented May 30, 2026

@aldehir not quite sure if I understand your comment correctly, let's take an example:

<turn>user
prompt1
<turn>assistant
response1
<turn>user
prompt2
<turn>assistant
reasoning_content
response2

so on master currently we have a checkpoint right after prompt2

on the next user message, prompt3, the reasoning_content will be removed and we rollback to prompt2 checkpoint, things look good.

in either cases, I still don't quite sure what this PR is trying to address. to add to that, @reedmayhew18 your AI-generated comment #23814 (comment) is not welcomed here as (1) we strictly forbidden AI-written comments and (2) your result doesn't even represent all possible cases that can happen. please do not post excessive AI-generated content, that is not helpful

@aldehir Hello, my apologies for using A.I. to assist with my responses. I have a hard time typing out long replies due to some neurological health issues I have, so it helps me make sure I have all of my thoughts laid out cleanly without any gaps. I can avoid using it moving forward if its an issue.


What this patch addresses:

Essentially, without this patch, my computer (and several others) ends up reprocessing tens of thousands of tokens on every turn (sometimes hundreds of thousands with vary long conversations) if checkpoints are not made on an intelligent basis.

As demonstrated by the testing, any subtle change in previous messages right now has the potential to cause a full prompt re-evaluation which is extremely inefficient and wasteful on time. I have 122b-190b param models loaded sometimes and they are unusable without the patch, as forcing them to re-evaluate a prompt can sometimes take up to 30 minutes, and if its running tools, force it to re-evaluate the entire prompt multiple times per turn. It's completely miserable and unbearable.

I use these models like 8+ hours a day (for personal and work) with Claude Code, Open-webui, Kai (mobile app), etc. And for example: Kai isn't the best programmed and does some timestamp adjustment in previous messages and causes the current master branch to re-evaluate the entire prompt multiple times per turn. With this patch, responses take seconds vs. several minutes with 100b+ models.

Hopefully that gives a better real-world example. I'm personally shocked that people aren't using the system without some sort of patch like the one I'm suggesting. Just because I genuinely thought my computer was too slow to handle these models until I spend several dozen hours working on this.


Lastly, please understand: I am fully aware A.I. was used to assist with this coding and if that's an issue, I'm more than happy to let someone else look at my suggestion and re-write it themselves in a human manner. Please understand that I am very knowledgeable about how these systems work and understand the workflow about these things, it is simply a limitation on my ability to physically type out the code from scratch. (I am in classes learning these languages, I just have very much difficulty typing/remembering syntax due to my neurological conditions.

Hopefully that makes sense. Not trying to cause any issues. Just trying to help. I promise I can answer any questions you have about the code, part of my current job is being an IT Security Administrator overseeing a fleet of computers that need constant diagnosis, fixing, all this stuff. Linux is my daily driver, I dont use windows at home, just at work. Only including these things so you understand that I'm not just another "vibe coder" submitting PRs without any clue of what they're doing.

Thanks and let me know if you have any questions!

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 30, 2026

Only including these things so you understand that I'm not just another "vibe coder" submitting PRs without any clue of what they're doing.

including a wall of text doesn't make you different from any other vibe coders

let me put this simple: either you should answer my question (in my last comments) concisely and straight to the point, or let's close this PR and let another community member to submit a better, less AI-excessive patch.

@mfielding92
Copy link
Copy Markdown
Author

mfielding92 commented May 30, 2026

Only including these things so you understand that I'm not just another "vibe coder" submitting PRs without any clue of what they're doing.

including a wall of text doesn't make you different from any other vibe coders

I genuinely can help with this stuff and others can't. That's a huge difference regardless of the size of text. Either way, I understand the importance of conciseness.

let me put this simple: either you should answer my question (in my last comments) concisely and straight to the point, or let's close this PR and let another community member to submit a better, less AI-excessive patch.

I was responding to the code change as you replied. I was simply addressing both of your concerns separately. (Why I was using A.I. and then the code changes requested.)

Sorry about the wall of text, that is how I naturally talk, which is what you're requiring here. This is where I'd typically have something help me condense my thoughts down. I'll do my best to keep responses short.

Catch-Up to Current Master
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 30, 2026

I might be a bit hard here, but the truth is that this PR isn't much different than other that are fully AI-generated

I don't believe patching that small prompt_get_n_before_user is an acceptable solution,but sadly that's what an AI will do if you ask it to. AI cannot look at the broader picture, unfortunately. The prompt_get_n_before_user itself is already a hacky solution and we don't want to extend from that, it will become extremely difficult to refactor it in the future.

It's best to open a discussion to see if the chat template system maintainers have a solution to somehow expose the boundary from the chat template engine back to the caller. I strongly believe it's feasible, and even if it's not, the logic to handle message boundaries should still eventually be in the chat.cpp, not inside server.

Closing this PR since both your current impl and the proposed one #23814 (comment) sounds too overkill

@ngxson ngxson closed this May 30, 2026
@Xcc313r4n7
Copy link
Copy Markdown

Only including these things so you understand that I'm not just another "vibe coder" submitting PRs without any clue of what they're doing.

including a wall of text doesn't make you different from any other vibe coders

let me put this simple: either you should answer my question (in my last comments) concisely and straight to the point, or let's close this PR and let another community member to submit a better, less AI-excessive patch.

What a dickish comment

@nihilistau

This comment was marked as spam.

@Xcc313r4n7
Copy link
Copy Markdown

Gotta love the elitist gatekeepers that maintain this project

@Xcc313r4n7
Copy link
Copy Markdown

Xcc313r4n7 commented Jun 3, 2026

Only including these things so you understand that I'm not just another "vibe coder" submitting PRs without any clue of what they're doing.

including a wall of text doesn't make you different from any other vibe coders
let me put this simple: either you should answer my question (in my last comments) concisely and straight to the point, or let's close this PR and let another community member to submit a better, less AI-excessive patch.

What a dickish comment

It's probably the most insane thing I have seen on a project exclusively about creating AI and to be honest this person has the same kind of understanding some AI's do. Ever tried to get an AI to understand something it just can't understand because it is not in it's training data? That's the vibe I get from this guy. He just can't compute, His breain was stuck in a vibe coded loop.

This guy moderates at least 10 Reddit channels. I guarentee it! You simply can not be that much of a dick and not be a reddit mod.

That's an insult to AI though, maybe he should have used AI for HIS response, it would have been better than what he decided to go with

@nihilistau

This comment was marked as spam.

@ddh0
Copy link
Copy Markdown
Contributor

ddh0 commented Jun 3, 2026

Hi @Xcc313r4n7 and @nihilistau. Thank you for your constructive feedback about how to moderate contributions to a project that neither of you have ever contributed to. If you understand this part of the code and are able to contribute a legitimate fix that you can explain in your own words, please feel free to submit a PR.

@nihilistau

This comment was marked as spam.

@ggml-org ggml-org locked as off-topic and limited conversation to collaborators Jun 3, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants